KR20150050211A

KR20150050211A - Anonymizing Device and Method using Information theory approach of Electronic Medical Records

Info

Publication number: KR20150050211A
Application number: KR1020130131737A
Authority: KR
Inventors: 이도헌; 유선용
Original assignee: 한국과학기술원
Priority date: 2013-10-31
Filing date: 2013-10-31
Publication date: 2015-05-08
Also published as: KR101519449B1

Abstract

A de-identification apparatus, method and recording medium using an information theory of an electronic medical record are disclosed. The present invention includes: a storage part storing electronic medical record (EMR) data; an entropy calculation part calculating entropy on each of attributes of the stored data; an attribute combination part combining each of the attributes to generate attribute combinations; a joint entropy calculation part calculating joint entropy on the generated attribute combinations; a dependence calculation part calculating dependence of the calculated joint entropy; an attribute combination selecting part selecting the attribute combinations with information loss amount less than a preset amount among the attribute combinations of the joint entropy and the dependence; and a k-anonymity part de-identifying data using a k-anonymity method based on the selected attribute combinations.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an anonymizing apparatus and method using an information theory of electronic medical records,

본 발명은 익명화 장치 및 방법에 관한 것으로, 보다 상세하게는 전자의무기록의 정보 이론을 이용한 익명화 장치 및 방법에 관한 것이다.
The present invention relates to an apparatus and method for anonymizing, and more particularly, to an apparatus and method for anonymizing information using electronic medical record information theory.

전자의무기록(Electronic medical record)은 임상현장에서 생성된 의료 기록으로서 진료행위를 지원하는 원래의 목적은 물론, 최근에는 의료정보를 분석하여 새로운 의료기술을 만들고, 환자 추천 시스템을 구축하는 등 중요한 도구로 인정되고 있다. 하지만 환자 개인의 사적인 정보를 포함하고 있기 때문에 연구자에게 공개되는데 제약이 있다. 따라서 의료정보 자체는 손상하지 않으면서 사적인 정보는 보호하는 익명화 기술에 대한 요구가 많다. The electronic medical record is a medical record created in the clinical field. In addition to the original purpose of supporting medical practice, recently, it has been important to develop a new medical technology by analyzing medical information, . However, since it contains personal information of patients, it is restricted to be disclosed to researchers. Therefore, there is a great demand for anonymization technology that protects personal information without damaging the medical information itself.

종래의 연구들은 주로 보호대상 정보항목의 삭제나 은닉에만 초점을 맞추고 있어서 추후 예상치 않는 경로로 사적인 정보가 노출될 수 있는 위험이 있다. 이와 같은 노출 위험을 줄이기 위해서 재식별 방지 기법(anonymization)이 지속적으로 연구되어 왔으며, 현재 k-재식별 방지(k-anonymity) 조건을 기반으로 많은 연구가 되고 있다.Conventional studies focus mainly on deletion or concealment of information items to be protected, and there is a risk that private information may be exposed to an unexpected path in the future. In order to reduce the risk of such exposure, anonymization has been continuously studied, and many studies have been conducted based on the k-anonymity condition.

종래의 연구 중 익명화를 수행하는데 있어서, 개인식별 정보 익명화 기법(De-identification) 및 재식별 방지 기법을 개시한다.Disclosure of the Invention An object of the present invention is to provide a personal identification information anonymization technique and an anti-re-identification technique in anonymization.

도 1(a)은 종래 기술인 개인식별 정보 익명화 기법(De-identification)을 설명하기 위한 도면이다. 개인식별 정보 익명화 기법은 개인 식별정보 삭제, 윽닉하여 익명화를 수행하는 기법이다. 이 때, 이름, 주민번호 전화번호와 같은 개인을 식별하는데 직관적인 정보들이 개인식별정보에 해당된다. 즉, 개인식별 정보 익명화 기법은 테이블(110)가 포함하는 이름(Name) 및 SSN(Social Security Number)을 삭제하여 테이블(130)로 익명화한다.FIG. 1 (a) is a view for explaining a conventional personal identification information anonymization technique (De-identification). Personal identification information anonymization is a technique for performing anonymization by deleting personal identification information. At this time, information that is intuitive to identify an individual such as a name, a resident registration number, etc. corresponds to the person identification information. That is, the personal identification information anonymization technique deletes a name (Name) and an SSN (Social Security Number) included in the table 110 and anonymizes the table 130.

하지만 도 1(b)와 같이, 개인식별 정보 익명화 기법은 개인식별정보를 삭제하여 어느 정도의 익명화를 시행할 수 있지만 남아 있는 정보들의 조합으로 개인을 재식별(Re-identification)할 수 있는 상황이 발생한다. 재식별이 발생할 경우, 테이블(150) 및 테이블(160)과 같이 익명화가 수행됐다고 하더라도, 남아있는 몇 가지 정보들을 조합하여 테이블(180)을 추출한다. 이를 통해 어느 누가 무슨 병에 걸려 있는 지에 대한 정보를 추출하게 된다. 이러한 정보들이 잘못 사용될 경우 보험회사, 정치, 취업 등 다방면에서 개인에게 불이익이 발생한다.However, as shown in FIG. 1 (b), the personal identification information anonymization technique can remove an individual identification information and perform some degree of anonymization, but a situation in which an individual can be re-identified by a combination of remaining information Occurs. When the re-identification occurs, even if anonymization is performed as in the table 150 and the table 160, the table 180 is extracted by combining some remaining information. This will extract information about who is sick. Misuse of such information can lead to disadvantages for individuals in various fields, such as insurance companies, politics, and employment.

도 2는 재식별을 방지하는 기법을 도시한 도면으로써, 재식별 방지 기법은 개인식별정보 익명화 기법이 수행된 정보를 기초로 익명화를 하는 것이다. 즉, 재식별 방지 기법은 개인식별 정보 익명화 이후 얻어진 테이블(220, 230)로부터 일정 부분의 정보를 일반화(generalization)하여 재식별 발생확률을 낮추는 것이다. 일 실시예로, 재식별 방지 기법은 테이블(220)의 우편번호 정보의 일정부분을 지워줌으로써, 일반화를 수행할 수 있다. 따라서, 상기 기법은 테이블(210) 및 테이블(220)을 조합하여도 테이블(250)과 같이 재식별을 할 수 없게 한다.FIG. 2 is a diagram illustrating a technique for preventing re-identification. The re-identification prevention technique is anonymization based on information on which the personal identification information anonymization technique is performed. That is, the re-identification prevention technique reduces the probability of re-identification by generalizing information of a certain portion from the tables 220 and 230 obtained after anonymization of the personal identification information. In one embodiment, the re-identification prevention technique may perform generalization by erasing certain portions of the postal code information of table 220. [ Thus, the technique makes it impossible to re-identify the table 210 and the table 220, even when the table 250 is combined.

재식별 방지 기법을 수행하기 위해서 가장 핵심이 되는 기술은 k-재식별 방지 기법이다. k-재식별 방지 기법은 세가지 종류의 데이터를 포함한다. 상기 세가지 종류의 데이터는 개인식별정보(Explicit ID), 민감정보(민감정보 ID) 및 준식별자(Quasi-identifier 또는 Quasi-ID)이다.The most important technique for performing re-identification prevention is the k-re-identification prevention technique. The k-re-identification prevention technique includes three types of data. The three kinds of data are an explicit ID, sensitive information (sensitive information ID), and quasi-identifier (quasi-identifier).

개인식별정보는 개인 식별 정보로서, 이름, 주민등록번호 및 전화번호와 같이 개인을 식별하는데 직접적으로 이용되는 정보이다.Personal identification information is personal identification information, which is information directly used to identify an individual, such as a name, a social security number, and a telephone number.

민감정보는 특정 기관에서 생성한 데이터로서, 개인에게 밀접한 정보로 보호되어야 하는 민감한 속성일 수 있다. 일 실시예로 병원에서 제공하는 처방기록 또는 진단기록들이 될 수 있다.Sensitive information is data generated by a specific agency and may be a sensitive attribute that must be protected by information that is close to the individual. In one embodiment, these may be prescription or diagnostic records provided by the hospital.

준식별자는 개인식별자정보와 민감정보를 제외한 항목들 중에서, 외부 테이블의 항목과의 연결을 통해 재식별을 발생시킬 가능성이 있거나 항목간의 조합을 통해 재식별을 발생시킬 가능성이 있는 항목들일 수 있다. 즉, 도 3과 같이 테이블(310) 및 테이블(320)의 빨간색으로 칠해진 항목들인 나이(Age) 및 우편번호(ZIP)이다. Among the items excluding the individual identifier information and the sensitive information, the semi-identifier may have a possibility to cause re-identification through connection with an item of the external table, or may be a combination of items to cause re-identification. That is, as shown in FIG. 3, the Age 310 and the ZIP 320 are entries painted in red on the table 310 and the table 320, respectively.

k-재식별 방지 기법은 상기 준식별자로 구분된 항목들을 대상으로 같은 기록의 형태가 최소 k개 이상 되도록 데이터를 일반화하는 기법이다. k-재식별 방지 기법은 적어도 k개의 기록이 같은 준식별자 값을 가지도록 데이터를 입란화 하는 과정이다. The k-re-identity prevention technique is a technique for generalizing data such that at least k types of records belonging to the above-mentioned quasi-identifiers are classified. The k-re-identity prevention scheme is a process of embedding data so that at least k records have the same quasi-identifier value.

k-재식별 방지 기법의 일 실시예로 도 4를 참조하면, 도 4(a)는 원본 테이블이고, 도 4(b)는 k가 4일 경우인 재식별 방지 기법을 수행한 테이블이다. 또한 도 4(a) 및 도 4(b)의 준식별자(410, 430)은 우편번호, 나이 및 국적이고, 도 4(b)에 적용되는 k-재식별 방지 기법의 k값은 4일이다. Referring to FIG. 4 as an embodiment of the k-re-identification preventing method, FIG. 4 (a) is an original table and FIG. 4 (b) is a table on which a re-identification prevention technique is performed when k is 4. The quasi identifiers 410 and 430 in FIG. 4 (a) and FIG. 4 (b) are zip code, age and nationality, and the k value of the k- .

k=4일 경우의 k-재식별 방지 기법은 데이터(450)와 같이 적어도 4개의 기록이 같은 준식별자의 값을 가지도록 데이터를 일반화한다.The k-re-identity prevention scheme for k = 4 normalizes the data so that at least four records, such as data 450, have the same quasi-identifier value.

k-재식별 방지 기법은 데이터를 일반화하는 과정에서 정보의 손실이 발생하기 때문에, 데이터의 유용성을 최대로 하기 위해 정보손실(Information Loss)을 줄이는 방법이 중요하다. 종래의 k-재식별 방지 기법은 정보손실을 줄이기 위해, 도 5와 같이 최대한 유사한 기록 및 데이터를 군집하는 방법을 모색해왔다. 테이블(510)은 원본 테이블이고, 테이블(530)은 k=3일 경우, Work Class 및 Marital status를 묶은 테이블을 도시하고 있으며, 테이블(550)는 테이블(530)을 재식별 방지 기법을 수행한 것을 도시하고 있다. 특히, k=3이기 때문에 3개의 레코드가 같은 것을 확인한다.In order to maximize the usefulness of the data, it is important to reduce the information loss because the information is lost in the process of generalizing the data. In order to reduce information loss, the conventional k-re-identification prevention technique has sought a method of clustering records and data as similar as in Fig. Table 530 shows a table that is a bundle of Work Class and Marital status when k = 3, and table 550 shows a table 530 that is the table . In particular, since k = 3, three records are identical.

이와 같이, 종래 방법은 모든 조합에 대하여 유사한 기록을 찾는 경우 계산량이 지수적으로 증가하고, 정보손실을 효과적으로 줄이지 못한다. 정보손실는 k-재식별 방지 기법의 성능평가에 매우 중요한 척도이다. 특히, 종래 방법은 일정한 성능을 보장받을 수 없으며, 수행 시간이 지수적으로 증가하는 점은 반드시 해결되어야 할 사항이다.
As such, the conventional method increases the amount of computation exponentially when finding similar records for all combinations, and does not effectively reduce information loss. Information loss is a very important measure in evaluating the performance of the k-re-identification technique. In particular, the conventional method can not guarantee a certain performance, and the fact that the execution time increases exponentially must be solved.

본 발명이 해결하고자 하는 기술적 과제는 결합 엔트로피 및 의존도를 이용해 일반화를 수행하는 속성을 산출하는 전자의무기록의 정보 이론을 이용한 익명화 장치 및 방법을 제공하는데 있다.An object of the present invention is to provide an apparatus and method for anonymizing information using an electronic medical record information theory that calculates attributes for performing generalization using combined entropy and dependency.

본 발명이 해결하고자 하는 다른 기술적 과제는 k-재식별 방지 기법을 수행하는데 있어서 정보손실을 효과적으로 방지하는 전자의무기록의 정보 이론을 이용한 익명화 장치 및 방법을 제공하는데 있다.It is another object of the present invention to provide an apparatus and method for anonymizing information using an electronic medical record information theory that effectively prevents information loss in performing a k-re-identification prevention technique.

본 발명이 해결하고자 하는 또 다른 기술적 과제는 k-재식별 방지 기법을 수행하는데 있어서 산출량을 줄이는 전자의무기록의 정보 이론을 이용한 익명화 장치 및 방법을 제공하는데 있다.
It is another object of the present invention to provide an apparatus and method for anonymizing information using an electronic medical record information theory that reduces the amount of output in performing the k-re-identification prevention technique.

전자의무기록의 정보이론을 이용한 익명화 장치에 있어서, 전자의무기록(Electronic Medical Record, EMR)의 데이터가 저장된 저장부, 상기 저장된 데이터의 개별 속성들에 대한 엔트로피(entropy)를 산출하는 엔트로피 산출부, 상기 개별 속성들을 조합하여 속성 조합들을 생성하는 속성 조합부, 상기 생성된 속성 조합들에 대한 결합 엔트로피(joint entropy)를 산출하는 결합 엔트로피 산출부, 상기 산출된 결합 엔트로피의 의존도를 산출하는 의존도 산출부, 상기 결합 엔트로피 및 상기 의존도의 속성 조합들 중 기설정된 정보 손실량보다 적은 속성 조합을 선택하는 속성 조합 선택부 및 상기 선택된 속성 조합을 기초로 k-재식별 방지 기법(k-anonymity)을 이용하여 상기 데이터를 익명화하는 k-재식별 방지부를 포함할 수 있다. An apparatus for anonymizing information using an electronic medical record information theory, comprising: a storage unit for storing data of an electronic medical record (EMR); an entropy calculating unit for calculating an entropy of individual attributes of the stored data; An attribute combination unit for generating attribute combinations by combining the individual attributes, a joint entropy calculation unit for calculating joint entropy for the generated attribute combinations, a dependency calculation unit for calculating the dependency of the calculated joint entropy, An attribute combination selector for selecting an attribute combination less than a predetermined amount of information loss among the attribute combinations of the binding entropy and the dependency, and an attribute combination selecting unit for selecting the attribute combination using the k-anonymity based on the selected attribute combination. And a k-re-identification preventing unit for anonymizing the data.

상기 전자의무기록의 새로운 데이터를 입력받고, 상기 데이터를 일반화(gerneralization)하는 필요한 변수 k를 입력받는 입력부를 더 포함할 수 있다.And an input unit for receiving new data of the electronic medical record and inputting a necessary variable k for gernerizing the data.

상기 속성 조합부는, 상기 엔트로피 또는 결합 엔트로피를 내림차순으로 정렬하여 우선순위가 기설정된 순위보다 낮은 엔트로피 또는 낮은 결합 엔트로피를 포함하는 속성 또는 속성 조합을 제거할 수 있다.The attribute combination unit may sort the entropy or join entropy in descending order to remove an attribute or attribute combination including an entropy lower in priority or a lower binding entropy.

상기 속성 조합부는, 상기 의존도를 오름차순으로 정렬하여 우선순위가 높은 기설정된 순위보다 높은 의존도를 포함하는 속성 조합을 제거할 수 있다.The attribute combination unit may sort the dependencies in ascending order to remove attribute combinations having a dependency higher than a predetermined priority order.

상기 결합 엔트로피 산출부는, 서로 다른 두 속성에 대한 상기 결합 엔트로피는 하기 수학식과 같이 산출할 수 있다.The combining entropy calculating unit may calculate the combining entropy for two different attributes according to the following equation.

[수학식][Mathematical Expression]

여기서, i 및 j는 개별 속성의 상태을 의미하고,,

는 속성 조합된 상태인 ij에 대한 확률을 의미한다.Here, i and j represent states of individual attributes,

Is the probability for ij, the combined state of the attributes.

상기 의존도 산출부는, 상기 의존도를 산출하기 위해 상기 결합 엔트로피에 대한 조건부 엔트로피를 하기 수학식과 같이 산출할 수 있다.In order to calculate the dependency, the dependency calculation unit may calculate the conditional entropy of the binding entropy according to the following equation.

[수학식][Mathematical Expression]

여기서,

는 개별 속성 y에 대한 개별 속성 x를 의미한다.here,

Means an individual attribute x for an individual attribute y.

상기 의존도 산출부는, 상기 개별 속성 y에 대한 x의 의존도를 하기 수학식과 같이 산출할 수 있다.The dependency calculation unit may calculate dependence of x on the individual attribute y as follows:

[수학식][Mathematical Expression]

여기서,

는 개별 속성 y에 대한 엔트로피를 의미한다.here,

Is the entropy for the individual attribute y.

상기 속성 조합 선택부는, 상기 기설정된 정보 손실량보다 적은 속성 조합이 없으면, 상기 속성 조합부, 상기 결합 엔트로피 산출부 및 상기 의존도 산출부를 수행하는 과정을 반복할 수 있다.The attribute combination selection unit may repeat the process of performing the attribute combination unit, the combining entropy calculation unit, and the dependency calculation unit when there is no attribute combination less than the predetermined information loss amount.

전자의무기록의 정보이론을 이용한 익명화 방법은, 전자의무기록(Electronic Medical Record, EMR)의 데이터를 불러오는 단계, 상기 불러온 데이터의 개별 속성들에 대한 엔트로피(entropy)를 산출하는 단계, 상기 개별 속성들을 조합하여 속성 조합들을 생성하는 단계, 상기 생성된 속성 조합들에 대한 결합 엔트로피(joint entropy)를 산출하는 단계, 상기 산출된 결합 엔트로피의 의존도를 산출하는 단계, 상기 결합 엔트로피 및 상기 의존도의 속성 조합들 중 기설정된 정보 손실량보다 적은 속성 조합을 선택하는 단계 및 상기 선택된 속성 조합을 기초로 k-재식별 방지 기법(k-anonymity)을 이용하여 상기 데이터를 익명화하는 단계를 포함할 수 있다.An anonymizing method using an information theory of electronic medical records includes the steps of loading data of an electronic medical record (EMR), calculating an entropy for individual attributes of the retrieved data, Generating a combination entropy for the generated attribute combinations, calculating a dependence of the calculated binding entropy, combining the combination entropy and the attribute of the dependency Selecting an attribute combination less than a predetermined information loss amount and anonymizing the data using a k-anonymity based on the selected attribute combination.

상기 전자의무기록의 새로운 데이터를 입력받고, 상기 데이터를 일반화(gerneralization)하는 필요한 변수 k를 입력받는 단계를 더 포함할 수 있다.
Receiving new data of the electronic medical record, and inputting a necessary variable k for gerneralization of the data.

본 발명에 따른 전자의무기록의 정보 이론을 이용한 익명화 장치 및 방법은 결합 엔트로피 및 의존도를 이용해 일반화를 수행하는 속성을 산출할 수 있다.An apparatus and method for anonymizing information based on information theory of electronic medical records according to the present invention can calculate an attribute for performing generalization using combined entropy and dependency.

또한 k-재식별 방지 기법을 수행하는데 있어서 정보손실을 효과적을 방지할 수 있다.In addition, it is possible to effectively prevent information loss in performing the k-re-identification prevention technique.

또한 k-재식별 방지 기법을 수행하는데 있어서 산출량을 줄이므로써, 수행 시간을 단축시킬 수 있다.
In addition, the execution time can be shortened by reducing the amount of output in performing the k-re-identification prevention technique.

도 1(a)는 종래 기술인 개인식별정보 익명화 기법을 설명하기 위한 도면이다.
도 1(b)는 종래 기술인 재식별을 설명하기 위한 도면이다.
도 2는 종래 기술인 재식별 방지 기법을 설명하기 위한 도면이다.
도 3은 종래 기술인 k-재식별 방지 기법의 준식별자를 설명하기 위한 도면이다.
도 4는 종래 기술인 k=4일 경우의 k-재식별 방지 기법을 수행하는 과정을 설명하기 위한 도면이다.
도 5는 종래 기술인 k-재식별 방지 기법을 구성원에 대해 군집화하여 수행하는 과정을 설명하기 위한 도면이다.
도 6는 본 발명의 일 실시예에 따른 익명화 장치의 구성요소를 도시한 블록도이다.
도 7은 본 발명의 일 실시예에 따른 제어부의 구성요소를 도시한 블록도이다.
도 8은 본 발명의 일 실시예에 따른 엔트로피 및 결합 엔트로피를 설명하기 위한 다이어그램을 도시한 도면이다.
도 9는 본 발명의 일 실시예에 따른 해쉬 함수를 기초로 key-value 쌍 구조를 설명하기 위한 도면이다.
도 10은 본 발명의 일 실시예에 따른 연결 리스트를 기초로 제안하는 데이터 구조를 설명하기 위한 도면이다.
도 11은 본 발명의 일 실시예에 따른 고유한 환자 정보를 줄여나가는 결과에 대한 비교를 설명하기 위한 그래프를 도시한 도면이다.
도 12는 본 발명의 일 실시예에 따른 알고리즘이 수행되는 시간에 대한 비교를 설명하기 위한 그래프를 도시한 도면이다.
도 13은 본 발명의 일 실시예에 따른 익명화 방법의 수행과정을 도시한 순서도이다.FIG. 1 (a) is a view for explaining a conventional personal identification information anonymization technique.
Fig. 1 (b) is a view for explaining re-identification as a prior art.
FIG. 2 is a view for explaining a conventional re-identification preventing technique.
FIG. 3 is a view for explaining a quasi-identifier of a prior art k -reference prevention technique.
FIG. 4 is a diagram for explaining a process of performing the k-re-identification prevention technique when k = 4 in the prior art.
FIG. 5 is a view for explaining a process of clustering members according to the prior art k-re-identification prevention technique.
6 is a block diagram illustrating the components of an anonymizing apparatus in accordance with an embodiment of the present invention.
7 is a block diagram illustrating components of a control unit according to an embodiment of the present invention.
FIG. 8 is a diagram illustrating a diagram for explaining entropy and combining entropy according to an embodiment of the present invention. Referring to FIG.
9 is a diagram for explaining a key-value pair structure based on a hash function according to an embodiment of the present invention.
10 is a diagram for explaining a data structure proposed based on a linked list according to an embodiment of the present invention.
Figure 11 is a graph illustrating a comparison of results that reduce unique patient information in accordance with an embodiment of the present invention.
FIG. 12 is a graph illustrating a comparison of time when an algorithm according to an embodiment of the present invention is performed.
13 is a flowchart illustrating an anonymization method according to an embodiment of the present invention.

이하 본 발명의 실시예를 첨부된 도면들을 참조하여 상세히 설명할 수 있다. 우선 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 할 수 있다. 또한 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 당업자에게 자명하거나 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략할 수 있다.
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numerals are used to designate the same or similar components throughout the drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather obvious or understandable to those skilled in the art.

도 6는 본 발명의 일 실시예에 따른 익명화 장치의 구성요소를 도시한 블록도이다.6 is a block diagram illustrating the components of an anonymizing apparatus in accordance with an embodiment of the present invention.

도 6을 참조하면, 익명화 장치(1)는 전자의무기록의 데이터를 k-재식별 방지 기법을 이용하여 익명화를 할 수 있다. 익명화 장치(1)는 정보손실을 최소화 하기 위해 정보이론를 기초로 익명화를 할 수 있다. 익명화 장치(1)는 입력부(610), 제어부(630), 출력부(650) 및 저장부(670)을 포함할 수 있다.Referring to FIG. 6, the anonymizing device 1 can anonymize data of electronic medical records using a k-re-identification technique. An anonymizing device 1 can anonymize based on information theory to minimize information loss. The anonymizing device 1 may include an input unit 610, a control unit 630, an output unit 650, and a storage unit 670.

입력부(610)는 전자의무기록의 새로운 데이터를 입력받을 수 있다. 상기 새로운 데이터는 새로운 환자의 데이터, 기존 환자의 수정사항에 대한 데이터 및 기존 환자의 추가사항에 대한 데이터 중 적어도 하나일 수 있다. 입력부(610)는 전자의무기록의 데이터를 익명화하는데 필요한 변수 k를 입력받을 수 있다. 상기 변수 k는 일반화를 하기 위해 필요한 변수로써, 적어도 k개의 기록이 같은 데이터값을 가지도록 데이터를 일반화하는데 필요한 변수일 수 있다.The input unit 610 can receive new data of the electronic medical record. The new data may be at least one of data on a new patient, data on modifications of existing patients, and data on additions of existing patients. The input unit 610 may receive a variable k necessary for anonymizing the data of the electronic medical record. The variable k is a variable required for generalization, and may be a variable necessary for generalizing the data so that at least k records have the same data value.

제어부(630)는 저장부(670)에 저장된 전자의무기록의 개별 속성들에 대한 엔트로피(entropy)를 산출할 수 있다. 제어부(630)는 개별 속성들을 조합하여 속성 조합들을 생성할 수 있다. 제어부(630)는 생성된 속성 조합들에 대한 결합 엔트로피(joint entropy)를 산출할 수 있고, 산출된 결합 엔트로피의 의존도를 산출할 수 있다. 제어부(630)는 결합 엔트로피 및 의존도의 속성 조합들 중 기설정된 정보 손실량보다 적은 속성 조합을 선택할 수 있다. 또한 제어부(630)는 선택된 속성 조합을 기초로 k-재식별 방지 기법을 이용하여 데이터를 익명화할 수 있다.The control unit 630 may calculate an entropy for individual attributes of the electronic medical record stored in the storage unit 670. [ The control unit 630 may combine the individual attributes to generate attribute combinations. The control unit 630 can calculate the joint entropy for the generated attribute combinations and calculate the dependency of the calculated coupling entropy. The control unit 630 may select an attribute combination less than the predetermined information loss amount among the attribute combinations of the combining entropy and the dependency. Also, the control unit 630 can anonymize data using the k-re-identification prevention technique based on the selected attribute combination.

출력부(650)는 저장부(670)에 저장된 전자의무기록의 원본 데이터를 출력할 수 있다. 또한 출력부(630)는 상기 원본 데이터를 익명화한 데이터를 출력할 수 있다. 출력부(650)는 모니터, 액정, 프로젝터, TV 및 헤드업 디스플레이(hand-up display), 인쇄물 중 적어도 하나를 포함할 수 있다.The output unit 650 can output the original data of the electronic medical record stored in the storage unit 670. The output unit 630 may output data obtained by anonymizing the original data. The output unit 650 may include at least one of a monitor, a liquid crystal, a projector, a TV and a hand-up display, and a printed matter.

저장부(670)는 전자의무기록의 데이터가 저장될 수 있다. 저장부(670)는 개별 속성을 조합한 속성 조합이 저장될 수 있다. 저장부(670)는 제어부(630)에서 산출된 값을 저장될 수 있다. 상기 산출된 값은 엔트로피, 결합 엔트로피 및 의존도 중 적어도 하나일 수 있다. 저장부(670)는 익명화하는데 필요한 알고리즘이 저장될 수 있다.
The storage unit 670 may store data of the electronic medical record. The storage unit 670 may store an attribute combination in which individual attributes are combined. The storage unit 670 may store the value calculated by the controller 630. The calculated value may be at least one of entropy, coupling entropy, and dependency. The storage unit 670 may store an algorithm necessary for anonymizing.

도 7은 본 발명의 일 실시예에 따른 제어부의 구성요소를 도시한 블록도이고, 도 8은 본 발명의 일 실시예에 따른 엔트로피 및 결합 엔트로피를 설명하기 위한 다이어그램을 도시한 도면이며, 도 9는 본 발명의 일 실시예에 따른 해쉬 함수를 기초로 key-value 쌍 구조를 설명하기 위한 도면이고, 도 10은 본 발명의 일 실시예에 따른 연결 리스트를 기초로 제안하는 데이터 구조를 설명하기 위한 도면이다.FIG. 7 is a block diagram illustrating components of a control unit according to an embodiment of the present invention, and FIG. 8 is a diagram illustrating a diagram for explaining entropy and combining entropy according to an embodiment of the present invention. FIG. 10 is a view for explaining a key-value pair structure based on a hash function according to an embodiment of the present invention. FIG. 10 is a view for explaining a data structure proposed based on a linked list according to an embodiment of the present invention FIG.

도 7 내지 도 10을 참조하면, 제어부(630)는 엔트로피가 가지는 특성을 이용하여 익명화를 수행하는 일반화 과정에서 발생할 수 있는 정보 손실을 최소화할 수 있는 개별 속성 또는 속성 조합을 산출할 수 있다. 제어부(630)는 속성 조합에 의해 엔트로피 값이 높아지는 경우를 찾기 위해 이를 정량적으로 판단하기 위한 의존도를 산출할 수 있다. 제어부(630)는 엔트로피 산출부(710), 속성 조합부(720), 결합 엔트로피 산출부(730), 의존도 산출부(740), 속성 조합 선택부(750) 및 k-재식별 방지부(760)를 포함할 수 있다.Referring to FIGS. 7 to 10, the controller 630 may calculate individual attributes or combinations of attributes that can minimize information loss that may occur in the generalization process of anonymizing using the characteristics of entropy. The control unit 630 can calculate the dependency for quantitatively determining the entropy value in order to find the case where the entropy value increases due to the attribute combination. The control unit 630 includes an entropy calculation unit 710, an attribute combination unit 720, a combining entropy calculation unit 730, a dependency calculation unit 740, an attribute combination selection unit 750 and a k- ).

엔트로피 산출부(710)는 데이터의 개별 속성들에 대한 엔트로피를 산출할 수 있다. 상기 데이터는 개인식별정보, 민감 정보 및 준식별자 중 적어도 하나일 수 있다. 엔트로피 산출부(710)는 엔트로피를 산출하기 위해 전체 집단을

라 할 때, 개별 속성은 [수학식 1]과 같을 수 있다.
The entropy calculation unit 710 may calculate an entropy for individual attributes of data. The data may be at least one of personal identification information, sensitive information, and semi-identifiers. The entropy calculation unit 710 calculates the total entropy

, The individual attribute may be equal to Equation (1).

여기서, j는 속성의 인덱스를 의미하고, r은 모든 속성의 개수를 의미하며, m은 속성이 가질 수 있는 값의 개수를 의미한다.Here, j denotes the index of the attribute, r denotes the number of all attributes, and m denotes the number of values that the attribute can have.

만약

를 인종이라고 가정한다면, 하기와 같을 수 있다.if

Is a race, it can be as follows.

특히, 엔트로피는 확률적 상태 분포를 가지는 어떤 앙상블에서 개별 속성의 상태 i에 대한 확률 p_i의 로그합을 의미할 수 있고, [수학식 2]와 같이 정의될 수 있다.
In particular, entropy can mean the logarithm of the probability p _i to the state i of an individual attribute in an ensemble with a stochastic state distribution, and can be defined as: " (2) "

따라서, 엔트로피 산출부(710)는 [수학식 2]를 이용하여 개별 속성에 대한 엔트로피를 산출할 수 있다. 엔트로피 산출부(710)는 서로 다른 두 속성에 대한 각각의 엔트로피(810, 820)를 [수학식 3] 및 [수학식 4]와 같이 정의할 수 있다.
Accordingly, the entropy calculation unit 710 can calculate the entropy of the individual attribute using Equation (2). The entropy calculation unit 710 can define entropy 810 and 820 for two different attributes as shown in [Equation 3] and [Equation 4].

여기서, i 및 j는 개별 속성의 상태을 의미하고, p_i 및 p_j는 각각 i 및 j에 대한 확률을 의미한다.Here, i and j mean the state of the individual attribute, and p _i And p _j denote the probabilities for i and j, respectively.

엔트로피 산출부(710)는 각 속성값의 확률 분포를 알고 있어야 하기 때문에 각 속성값의 발생 횟수를 산출할 수 있다. 따라서, 엔트로피 산출부(710)는 확률 분포의 효율적인 산출을 위해 해쉬 맵(hash map)과 연결 리스트(link list)를 이용하여 속성값을 저장부(670)에 저장할 수 있다.Since the entropy calculating unit 710 must know the probability distribution of each attribute value, the number of occurrences of each attribute value can be calculated. Accordingly, the entropy calculation unit 710 can store the attribute values in the storage unit 670 using a hash map and a link list for efficient calculation of the probability distribution.

도 9(a)는 키(key)와 해당값(Mapped- value)의 블록도를 도시한 것이고, 도 9(b)는 도 9(a)에 해당되는 일 실시예의 결과를 나타낸 테이블이다. 엔트로피 산출부(710)가 수행하는 해쉬 함수는 키를 이용하여 큰 데이터 집합을 보다 작은 데이터 집합으로 연결시켜주는 알고리즘이다. 해쉬 맵은 빠른 속도로 데이터 값을 찾을 수 있다는 장점이 있기 때문에 용량이 큰 병원의 전자의무기록 데이터에 적용하기 적합하다. 만약 키(1010) 값이 0이면, 연결된 해당값의 구조 정보를 포함할 수 있다. 도 9(b)는 연결되는 진 구조 정보가 {Index, Name, Age} 속성을 포함될 수 있다.FIG. 9A is a block diagram of a key and a corresponding value (Mapped-value), and FIG. 9B is a table showing a result of an embodiment corresponding to FIG. 9A. The hash function performed by the entropy calculation unit 710 is an algorithm for connecting a large data set to a smaller data set using a key. Since hash maps have the advantage of being able to find data values at high speed, they are suitable for applying to electronic medical record data of large hospitals. If the value of the key 1010 is 0, the structure information of the associated value may be included. In FIG. 9B, the true structure information to be connected may include {Index, Name, Age} attributes.

도 10은 연결 리스트를 이용하여 연결된 해당값을 도시하고 있다. 엔트로피 산출부(710)는 다음 노드에 대한 주소를 가진 각각의 노드에 대한 필드는 보통 다음 링크를 호출할 수 있다. 엔트로피 산출부(710)는 다음 링크는 같거나 다른 해당값을 가지는 세트의 주소(1030, 1050, 1070)를 나타낼 수 있다. 따라서, 남은 필드는 연결된 해당값일 수 있다. 이러한 구조는 전체 구조를 재분배하거나 재편성할 필요없이 쉽게 삽입되거나 삭제될 수 있다.FIG. 10 shows corresponding values using a linked list. The entropy calculation unit 710 may normally call the next link for each node having an address for the next node. The entropy calculator 710 may indicate the addresses 1030, 1050, and 1070 of the next set of links having the same or different values. Thus, the remaining fields may be associated values. This structure can be easily inserted or deleted without having to redistribute or reorganize the entire structure.

엔트로피 산출부(710)는 해당값의 개수와 해당값의 인덱스(index) 값에 대한 정보를 확인할 수 있다. 상기 정보는 엔트로피를 계산할 때 보다 빠르게 데이터에 접근할 수 있게 해줄 수 있다. 또한 엔트로피 산출부(710)는 전체 엔트로피를 산출하고 나서, 만약 몇 개의 연결된 해당값이 큰 엔트로피를 가지면, 키와 연결된 해당값의 인덱스를 이용하여 연결된 해당값에 쉽게 접근할 수 있다.The entropy calculator 710 can check information on the number of corresponding values and index values of the corresponding values. The information can allow for faster access to data when calculating entropy. Also, the entropy calculator 710 may calculate the total entropy and then easily access the associated value using the index of the corresponding value associated with the key, if several associated values have a large entropy.

속성 조합부(720)는 데이터의 개별 속성들을 조합하여 속성 조합들을 생성할 수 있다. 속성 조합부(720)는 엔트로피 또는 결합 엔트로피를 내림차순으로 정렬하여 우선순위가 기설정된 순위보다 낮은 엔트로피 또는 낮은 결합 엔트로피를 포함하는 속성 또는 속성 조합을 제거할 수 있다. 속성 조합부(720)는 의존도를 오름차순으로 정렬하여 우선순위가 높은 기설정된 순위보다 높은 의존도를 포함하는 속성 조합을 제거할 수 있다.The attribute combination unit 720 may combine the individual attributes of the data to generate attribute combinations. The attribute combination unit 720 may sort entropy or join entropy in descending order to remove an attribute or combination of attributes that includes entropy lower in priority or lower binding entropy. The attribute combination unit 720 may sort the dependencies in ascending order so as to remove the attribute combination including the dependency higher than the predetermined priority order.

결합 엔트로피 산출부(730)는 속성 조합들에 대한 결합 엔트로피를 산출할 수 있다. 결합 엔트로피는 속성 조합의 불확실성을 보여주는 척도가 될 수 있으며, 상기 두 속성에 대한 결합 엔트로피(830)는 [수학식 5]와 같이 정의할 수 있다.
The combining entropy calculating unit 730 may calculate the combining entropy for the attribute combinations. The join entropy may be a measure of the uncertainty of the attribute combination, and the join entropy 830 for the two attributes may be defined as: " (5) "

여기서,

는 속성 조합된 상태인 ij에 대한 확률을 의미한다.here,

Is the probability for ij, the combined state of the attributes.

결합 엔트로피 산출부(730)는 전자의무기록의 데이터에 존재하는 서로 다른 속성값의 결합 엔트로피를 산출하기 위해 구분자를 이용하여 각각의 값을 하나로 통합할 수 있다. 일 실시예로, 결합 엔트로피 산출부(730)는 나이와 직업 속성을 결합하면

와 같은 값이 산출될 수 있으며,

및

는 같은 값으로 인식할 수 있다. 따라서, 결합 엔트로피 산출부(730)가 고려해야 할 모든 가능한 속성 조합의 수는 [수학식 6]과 같을 수 있다.
The combining entropy calculating unit 730 may integrate the respective values into one using the delimiter to calculate the combined entropy of the different attribute values present in the data of the electronic medical record. In one embodiment, the combining entropy calculator 730 may combine age and occupancy attributes

Can be calculated,

And

Can be recognized as the same value. Therefore, the number of all possible attribute combinations that the join entropy calculation unit 730 should consider is equal to Equation (6).

따라서, 결합 엔트로피 산출부(730)는 모든 결합 엔트로피의 합을 [수학식 7]과 같이 정의할 수 있다.
Therefore, the combining entropy calculating unit 730 can define the sum of all the combining entropy as Equation (7).

의존도 산출부(740)는 결합 엔트로피의 의존도를 산출할 수 있다. 의존도 산출부(740)는 상기 의존도를 산출하기 위해 우선적으로 조건부 엔트로피(850, 860)를 산출할 수 있다. 조건부 엔트로피(860)는 개별 속성의 엔트로피(810, 820) 중 하나의 엔트로피(820)를 나머지 엔트로피(810)을 뺀 엔트로피일 수 있다. 즉, 조건부 엔트로피(860)은 하나의 엔트로피(820)에 교집합인 엔트로피(880)을 뺀 나머지 엔트로피일 수 있다. The dependency calculation unit 740 can calculate the dependency of the binding entropy. The dependency calculation unit 740 may calculate the conditional entropy 850 and 860 in order to calculate the dependency. The conditional entropy 860 may be an entropy 820 of one of the entropy 810 and 820 of the individual attribute minus the other entropy 810. [ That is, the conditional entropy 860 may be the entropy of one entropy 820 minus the intersection entropy 880.

의존도 산출부(740)는 [수학식 8]을 이용하여 조건부 엔트로피를 산출할 수 있다.
The dependency calculating unit 740 can calculate the conditional entropy using Equation (8).

여기서,

는 개별 속성 y에 대한 개별 속성 x를 의미한다.here,

Means an individual attribute x for an individual attribute y.

의존도 산출부(740)는 [수학식 9]를 이용하여 개별 속성 y에 대한 x의 의존도를 산출할 수 있다.
The dependency calculation unit 740 can calculate the dependence of x on the individual attribute y using Equation (9).

의존도 산출부(740)는 의존도를 0 내지 1의 값을 산출할 수 있다. 만약 두 속성간의 연관성이 없는 경우, 의존도 산출부(740)는 0 값을 산출할 수 있고, 반대로 두 속성간의 연관성이 완벽할 경우, 의존도 산출부(740)는 1 값을 산출할 수 있다.The dependency calculator 740 can calculate the dependency from 0 to 1. If there is no association between the two attributes, the dependency calculator 740 can calculate a value of 0, and if the associations between the two attributes are perfect, the dependency calculator 740 can calculate a value of 1.

속성 조합 선택부(750)는 산출된 결합 엔트로피 및 산출된 의존도의 속성 조합들 중 기설정된 정보 손실량보다 적은 속성 조합을 선택할 수 있다. 속성 조합 선택부(750)는 결합 엔트로피의 값이 높은 속성 조합을 선택할 수 있다. 하지만 속성 조합 선택부(750)는 개별 속성의 엔트로피가 높음으로써 결합 엔트로피가 높게 나오는 경우가 아닌 속성의 조합으로 결합 엔트로피가 올라가는 속성 조합을 선택해야 하기 때문에 의존도를 반영하여 속성 조합을 선택할 수 있다. The attribute combination selection unit 750 can select an attribute combination less than a predetermined information loss amount among the attribute combinations of the calculated binding entropy and the calculated dependency. The attribute combination selection unit 750 can select an attribute combination having a high value of the binding entropy. However, since the entropy of individual attributes is high, the attribute combination selecting unit 750 can select an attribute combination reflecting the dependency because the entropy of the entities is high so that the combination entropy is increased by combining the attributes rather than when the entropy is high.

속성 조합 선택부(750)는 더 좋은 속성 조합을 산출하기 위해 속성 조합부(720), 결합 엔트로피 산출부(730) 및 의존도 산출부(740)를 수행하는 과정을 반복할 수 있다. 즉, 속성 조합 선택부(750)는 기설정된 정보 손실량보다 적은 속성 조합이 없을 경우, 속성 조합부(720), 결합 엔트로피 산출부(730) 및 의존도 산출부(740)를 수행하는 과정을 반복할 수 있다.The attribute combination selection unit 750 may repeat the process of performing the attribute combination unit 720, the combining entropy calculation unit 730, and the dependency calculation unit 740 to calculate a better attribute combination. That is, when there is no attribute combination less than the predetermined information loss amount, the attribute combination selection unit 750 repeats the process of performing the attribute combination unit 720, the combining entropy calculation unit 730, and the dependency calculation unit 740 .

k-재식별 방지부(760)는 선택된 속성 조합을 기초로 k-재식별 방지 기법을 이용하여 전자의무기록의 데이터를 익명화할 수 있다. k-재식별 방지부(760)는 입력부(610)에서 입력된 k를 기초로 준식별자로 구분된 항목들을 대상으로 같은 기록의 형태가 최소 k개 이상되도록 데이터를 일반화할 수 있다. 즉, k-재식별 방지부(760)는 적어도 k개의 기록이 같은 준식별자 값을 가지도록 데이터를 일반화할 수 있다.
The k-re-identification preventing unit 760 can anonymize the data of the electronic medical record using the k-re-identification prevention technique based on the selected attribute combination. The k-re-identification preventing unit 760 can generalize data such that the types of the same record are at least k or more, with respect to the items classified by the identifier based on k input from the input unit 610. [ That is, the k-re-identification preventing unit 760 can generalize the data so that at least k records have the same quasi-identifier value.

도 11은 본 발명의 일 실시예에 따른 고유한 환자 정보를 줄여나가는 결과에 대한 비교를 설명하기 위한 그래프를 도시한 도면이고, 도 12는 본 발명의 일 실시예에 따른 알고리즘이 수행되는 시간에 대한 비교를 설명하기 위한 그래프를 도시한 도면이다.FIG. 11 is a graph for explaining a comparison of results of reducing unique patient information according to an embodiment of the present invention. FIG. 12 is a graph illustrating a comparison between results obtained when an algorithm according to an embodiment of the present invention is performed FIG. 2 is a graph showing a comparison of the two cases.

도 11 또는 도 12를 참조하면, 익명화 장치(1)는 NHANES 데이터베이스에서 미국 주민의 건강과 영양상태를 측정하기 위해 고안된 데이터를 이용하여 성능 실험을 할 수 있다. 상기 NHANES 데이터베이스는 National Center로부터 연구에 필요한 검사와 승인을 받은 데이터베이스이다. 상기 성능 실험은 NHANES 2009-2010 dataset을 사용했으며, 인종, 민족, 임금, 나이, 영양상태 및 과거력 등의 다양한 인구 통계학적 정보를 포함되었다. 또한 실험 데이터는 43개의 속성과 10,538 환자정보로 구성되었고, 성능 실험에서는 시컨스(sequence) 번호가 포함되지 않은 41개의 속성을 이용되었다.
Referring to FIG. 11 or 12, the anonymizing device 1 can perform a performance test using data designed to measure the health and nutritional status of US residents in the NHANES database. The NHANES database is a database that has been inspected and approved by the National Center for research. The performance experiments used NHANES 2009-2010 dataset and included various demographic information such as race, ethnicity, wage, age, nutritional status, and past history. The experiment data consisted of 43 attributes and 10,538 patient information. In the performance experiment, 41 attributes without sequence number were used.

속성property 엔트로피Entropy 설명Explanation WTMEC2YRWTMEC2YR 12.2781912.27819 Full Sample 2 Year MEC Exam WeightFull Sample 2 Year MEC Exam Weight WTINT2YRWTINT2YR 11.6305511.63055 Full Sample 2 Year Interview weightFull Sample 2 Year Interview weight RIDAGEMNRIDAGEMN 9.5367059.536705 Age in Months at Screening - RecodeAge in Months at Screening - Recode RIDAGEEXRIDAGEEX 9.4082249.408224 Age in Months at Exam - RecodeAge in Months at Exam - Recode INDFMPIRINDFMPIR 7.1639487.163948 Ratio of family income to provertyRatio of family income to proverty RIDAGEYRRIDAGEYR 6.1649256.164925 Age at Screening Adjudicated - RecodeAge at Screening Adjudicated - Recode DMDHRAGEDMDHRAGE 5.8034025.803402 HH Ref Person AgeHH Ref Person Age

표 1은 익명화 장치(1)가 높은 엔트로피 값을 가지는 속성을 내림차순으로 정렬한 결과이다. 상기 결과는 익명화 장치(1)가 어떤 속성들이 일반화되어야 정보손실이 작은지를 판단하는 기준이 된다. 이는 높은 엔트로피를 가지는 속성일수록 정보손실을 일으킬 가능성이 높기 때문이다. 그러나 단일 속성의 엔트로피 값이 낮다고 하더라도, 속성간의 조합에 의해 엔트로피가 크게 증가하는 경우가 있을 수 있다. 상기 문제를 풀기 위해 익명화 장치(1)은 가능한 모든 속성 조합의 엔트로피를 산출한다.
Table 1 shows the result of sorting the attributes having a high entropy value in descending order by the anonymizing device (1). The result is a criterion for judging which attributes are generalized in the information loss by the anonymizing device 1. This is because the property with high entropy is more likely to cause information loss. However, even if the entropy value of a single attribute is low, there may be a case where entropy greatly increases due to the combination of attributes. In order to solve the above problem, the anonymizing device 1 calculates the entropy of all possible attribute combinations.

속성property 결합 엔트로피Join entropy 의존성Dependency RIDAGEMN||WTMEC2YRRIDAGEMN || WTMEC2YR 13.289913.2899 0.6944080.694408 RIDAGEMN||WTINT2YRRIDAGEMN || WTINT2YR 13.2551513.25515 0.6803770.680377 RIDAGEEX||WTINT2YRRIDAGEEX || WTINT2YR 13.2530313.25303 0.6695170.669517 RIDAGEYR||WTMEC2YRRIDAGEYR || WTMEC2YR 13.1155213.11552 0.4339830.433983 INDFMPIR||WTMEC2YRINDFMPIR || WTMEC2YR 13.1091513.10915 0.5158820.515882 RIDAGEEX||WTMEC2YRRIDAGEEX || WTMEC2YR 13.092813.0928 0.6999980.699998 DMDHRAGE||WTMEC2YRDMDHRAGE || WTMEC2YR 13.0868713.08687 0.4068620.406862

표 2는 익명화 장치(1)가 가능한 모든 속성 조합의 결합 엔트로피와 의존도를 산출한 결과이다. 표 2를 참조하면, 모든 속성 조합들이 속성 각각의 엔트로피와 비교하여 더 큰 결합 엔트로피를 가지는 것을 볼 수 있다. 즉, 데이터에서의 결합 엔트로피는 각 속성의 엔트로피보다 반드시 크거나 같기 때문이다.
Table 2 shows the result of calculating the combined entropy and dependence of all possible attribute combinations of the anonymizing device (1). Referring to Table 2, it can be seen that all attribute combinations have a larger binding entropy compared to the entropy of each attribute. That is, the join entropy in the data is necessarily greater than or equal to the entropy of each attribute.

속성property 결합 엔트로피Join entropy 의존성Dependency RIDEXMON||RIDAGEMNRIDEXMON || RIDAGEMN 10.5711110.57111 0.0122510.012251 RIDAGEMN||SDMVPSURIDAGEMN || SDMVPSU 10.5654710.56547 0.1063820.106382 RIDAGEMN||DMDHRGNDRIDAGEMN || DMDHRGND 10.4637910.46379 0.0725980.072598 RIDAGEEX||SDMVPSURIDAGEEX || SDMVPSU 10.4387710.43877 0.1048530.104853 RIDAGEYR||SDMVSTRARIDAGEYR || SDMVSTRA 9.9343019.934301 0.0254120.025412 DMDHRAGE||SDMVSTRADMDHRAGE || SDMVSTRA 9.4615739.461573 0.0541650.054165 DMDEDUC2||INDFMPIRDMDEDUC2 || INDFMPIR 9.1479599.147959 0.0479650.047965

표 3은 익명화 장치(1)가 높은 결합 엔트로피와 낮은 의존도를 가지는 속성 조합에 대한 결과이다. 익명화 장치(1)은 높은 결합 엔트로피 값을 가지면서 낮은 의존도를 가지는 속성 조합을 찾아낼 수 있다. 이를 통해 익명화 장치(1)는 일반화 과정에서 상대적으로 낮은 정보손실을 일으킬 수 있다.Table 3 shows the results for attribute combinations in which the anonymizing device 1 has high binding entropy and low dependency. The anonymizing device 1 can find an attribute combination having a low dependency entropy value and a low dependency. In this way, the anonymizing device 1 can cause a relatively low information loss in the generalization process.

도 11는 익명화 장치(1)가 제안한 알고리즘과 랜덤 선택(random selection) 알고리즘을 기초로 속성 조합을 제거하였을 경우, 남은 고유한 환자의 수를 도시한다.FIG. 11 shows the number of unique patients remaining when the combination of attributes is removed based on the algorithm proposed by the anonymizing device 1 and the random selection algorithm.

익명화 장치(1)는 제거되는 속성 조합을 3번의 반복을 통해 선택하였다. 익명화 장치(1)는 상기 3번의 반복에 대해서 5개의 속성이나 속성 조합을 추천하고, 동일한 속성이 선택되면 추가적인 반복 연산을 수행하였다. The anonymizing device 1 selected the attribute combination to be removed through three iterations. The anonymizing device 1 recommends five attributes or attribute combinations for the above three iterations, and performs additional iterative operations when the same attributes are selected.

따라서, 도 11의 결과는 익명화 장치(1)가 랜덤 선택 알고리즘의 결과보다 보다 빠르고 효율적으로 고유한 환자 정보를 줄여나가는 것을 확인할 수 있다.Therefore, the result of FIG. 11 confirms that the anonymizing device 1 reduces unique patient information more quickly and efficiently than the result of the random selection algorithm.

도 12는 익명화 장치(1)가 제안한 알고리즘의 수행 시간을 도시하고 있다. 성능 실험의 환경은 2.4GHz CPU와 3.5 GB 메인 메모리 사양을 갖춘 컴퓨터이다. FIG. 12 shows the execution time of the algorithm proposed by the anonymizing device 1. The performance test environment is a computer with 2.4GHz CPU and 3.5GB main memory specification.

도 12의 결과를 보면, 결과로부터 데이터를 합치는 것보다 합동 엔트로피를 산출하는데 더 많은 시간이 소비되는 것을 확인할 수 있다. 제안한 해쉬 맵 기반의 구조는 필요로 하는 값을 빠르게 찾아내고, 접근하는데 강점을 가진다. 따라서 각 반복 과정에서, 모든 과정은 5,000번의 찾기와 접근을 포함하였다. 즉, 익명화 장치(1)은 효율적으로 연산 시간을 줄여줌을 알 수 있다.
It can be seen from the results of FIG. 12 that more time is consumed to calculate the joint entropy than to merge the data from the results. The proposed hash map based structure has a strong point to quickly find and access needed values. Thus, in each iteration, the whole process involved 5,000 searches and approaches. That is, it can be seen that the anonymizing device 1 efficiently reduces the calculation time.

도 13은 본 발명의 일 실시예에 따른 익명화 방법의 수행과정을 도시한 순서도이다.13 is a flowchart illustrating an anonymization method according to an embodiment of the present invention.

도 13을 참조하면, 익명화 장치(1)는 전자의무기록의 데이터를 k-재식별 방지 기법을 이용하여 익명화를 할 수 있다. 익명화 장치(1)는 정보손실을 최소화 하기 위해 정보이론를 기초로 익명화를 할 수 있다.Referring to FIG. 13, the anonymizing device 1 can anonymize the data of the electronic medical record using the k-re-identification prevention technique. An anonymizing device 1 can anonymize based on information theory to minimize information loss.

익명화 장치(1)는 높은 순위 엔트로피를 선택한다(S100). 익명화 장치(1)는 산출된 엔트로피를 내림차순으로 정렬하여 높은 엔트로피를 가지는 속성을 선택할 수 있다. The anonymizing device 1 selects a high ranking entropy (S100). The anonymizing device 1 can select attributes having a high entropy by sorting the calculated entropy in descending order.

익명화 장치(1)는 속성 조합에 따른 결과가 충분히 좋은지 판단한다(S110). 익명화 장치(1)는 속성 조합에 따른 결과가 좋은 성능을 가는지 판단할 수 있다. 익명화 장치(1)는 만약 좋은 성능을 가지는 속성 조합을 있을 경우, 단계 S170으로 이동하고, 좋은 성능을 가지는 속성 조합이 없을 경우, 단계 S120으로 이동할 수 있다.The anonymizing device 1 determines whether the result according to the attribute combination is sufficiently good (S110). The anonymizing device 1 can judge whether the result according to the attribute combination has a good performance. If there is an attribute combination having a good performance, the anonymizing apparatus 1 moves to step S170, and if there is no attribute combination having a good performance, the anonymizing apparatus 1 can move to step S120.

익명화 장치(1)는 낮은 결합 엔트로피인 속성을 제거한다(S120). 익명화 장치(1)는 내림차순으로 정렬된 결합 엔트로피 중 낮은 순위의 엔트로피를 가지는 속성을 제거할 수 있다. 특히, 첫 번째 반복 수행에서는 아직 속성 조합의 결합 엔트로피가 산출되기 이전이기 때문에 낮은 엔트로피를 가지는 속성을 제거할 수 있다.The anonymizing device 1 removes the attributes that are low binding entropy (S120). The anonymizing device 1 can remove the attribute having the lowest ranking entropy among the binding entropies arranged in descending order. In particular, in the first iteration, attributes with low entropy can be removed because the join entropy of the attribute combination is still calculated.

익명화 장치(1)는 높은 의존도인 속성을 제거한다(S130). 익명화 장치(1)는 오름차순으로 정렬된 의존도 중 높은 순위의 의존도를 가지는 속성을 제거할 수 있다. 특히, 첫 번째 반복 수행에서는 아직 의존도가 산출되기 이전이기 때문에 단계 S130를 수행하지 않을 수 있다.The anonymizing device 1 removes the attribute which is highly dependent (S130). The anonymizing device 1 can remove attributes having a high order of dependency among the ascending order dependencies. In particular, since the dependency is not yet calculated in the first iteration, step S130 may not be performed.

익명화 장치(1)는 속성을 조합한다(S140). 익명화 장치(1)는 개별 속성들을 조합하여 여러 경우의 수에 따른 속성 조합을 생성 할 수 있다.The anonymizing device 1 combines attributes (S140). The anonymizing device 1 can combine the individual attributes to generate the attribute combination according to the number of cases.

익명화 장치(1)는 결합 엔트로피를 산출한다(S150). 익명화 장치(1)는 생성된 속성 조합들에 대한 결합 엔트로피를 산출할 수 있다. 결합 엔트로피는 속성 조합에 대한 특성을 반영할 수 있다. 또한 결합 엔트로피의 값은 조합을 이루는 개별 엔트로피의 값에 크게 의존될 수 있다.The anonymizing device 1 calculates a binding entropy (S150). The anonymizing device 1 can calculate the combining entropy for the generated attribute combinations. The join entropy may reflect properties for attribute combinations. Also, the value of the combining entropy can be highly dependent on the value of the individual entropy forming the combination.

익명화 장치(1)는 의존도를 산출한다(S160). 익명화 장치(1)는 산출된 결합 엔트로피의 의존도를 산출할 수 있다. 익명화 장치(1)는 의존도를 산출하기 전에 결합 엔트로피에 대한 조건부 엔트로피를 산출할 수 있다.The anonymizing device 1 calculates dependency (S160). The anonymizing device 1 can calculate the dependence of the calculated coupling entropy. The anonymizing device 1 can calculate the conditional entropy for the binding entropy before calculating the dependency.

익명화 장치(1)는 속성 조합을 선택한다(S170). 익명화 장치(1)는 최적화된 성능을 나타낼 수 있는 속성 조합을 선택할 수 있다. 상기 최적화된 속성 조합은 높은 결합 엔트로피와 낮은 의존도를 가지는 속성 또는 속성 조합일 수 있다.The anonymizing device 1 selects an attribute combination (S170). The anonymizing device 1 can select an attribute combination that can exhibit optimized performance. The optimized attribute combination may be a combination of attributes or attributes having a high binding entropy and a low dependency.

익명화 장치(1)는 k-재식별 방지 기법을 수행한다(S180). 익명화 장치(1)는 선택된 속성 또는 속성 조합에 대해 k-재식별 방지 기법을 이용하여 익명화를 할 수 있다. 익명화 장치(1)는 익명화하는데 있어서 정보손실을 최소화할 수 있다.
The anonymizing device 1 performs a k-re-identification prevention technique (S180). The anonymizing device 1 can anonymize the selected attribute or combination of attributes using the k-re-identity prevention technique. The anonymizing device 1 can minimize information loss in anonymizing.

본 발명은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 장치에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 장치에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can also be embodied as computer-readable codes on a computer-readable recording medium. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer apparatus is stored. Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like, and may be implemented in the form of a carrier wave (for example, transmission via the Internet) . The computer-readable recording medium may also be distributed to networked computer devices so that computer readable code can be stored and executed in a distributed manner.

이상에서 본 발명의 바람직한 실시예에 대해 도시하고 설명하였으나, 본 발명은 상술한 특정의 바람직한 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변형 실시가 가능한 것은 물론이고, 그와 같은 변경은 청구범위 기재의 범위 내에 있게 된다.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is clearly understood that the same is by way of illustration and example only and is not to be taken by way of limitation in the embodiment in which said invention is directed. It will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the appended claims.

1: 익명화 장치
610: 입력부 630: 제어부
650: 출력부 670: 저장부
710: 엔트로피 산출부 720: 속성 조합부
730: 결합 엔트로피 산출부 740: 의존도 산출부
750: 속성 조합 선택부 760: k-재식별 방지부1: Anonymizing device
610: input unit 630:
650: output unit 670: storage unit
710: Entropy calculation unit 720: Attribute combination unit
730: Combined entropy calculation unit 740: Dependency calculation unit
750: Attribute combination selection unit 760: k-

Claims

A storage unit for storing data of an electronic medical record (EMR);
An entropy calculation unit for calculating an entropy for individual attributes of the stored data;
An attribute combination unit for combining the individual attributes to generate attribute combinations;
A joint entropy calculation unit for calculating a joint entropy for the generated attribute combinations;
A dependence calculation unit for calculating a dependency of the calculated binding entropy;
An attribute combination selection unit for selecting an attribute combination less than a predetermined information loss amount among the attribute combinations of the binding entropy and the dependency; And
And an k-re-identification unit for anonymizing the data using a k-anonymity scheme based on the selected combination of attributes.

The method according to claim 1,
Further comprising an input unit for receiving new data of the electronic medical record and inputting a necessary variable k for gerneralization of the data.

The method according to claim 1,
Wherein the attribute combination unit comprises:
Wherein said entropy or binding entropy is sorted in descending order to remove an attribute or combination of attributes including entropy lower in priority or lower binding entropy.

The method according to claim 1,
The attribute combination unit
Wherein the dependency is sorted in ascending order so that a combination of attributes including a dependency higher than a predetermined priority order is eliminated.

The method according to claim 1,
Wherein the combining entropy calculating unit comprises:
Wherein the combining entropy for two different attributes is computed as: < EMI ID =

[Mathematical Expression]

Here, i and j represent states of individual attributes,

Is the probability for ij, the combined state of the attributes.

The method according to claim 1,
The dependence calculation unit may calculate,
An anonymizing device using an information theory of electronic medical records that calculates conditional entropy for the binding entropy as: < EMI ID =

[Mathematical Expression]

here,

Means an individual attribute x for an individual attribute y.

7. The method according to claim 1 or 6,
The dependence calculation unit may calculate,
An anonymizing apparatus using an information theory of electronic medical records that calculates the dependence of x on the individual attribute y as: < EMI ID =

[Mathematical Expression]

here,

Is the entropy for the individual attribute y.

The method according to claim 1,
Wherein the attribute combination selecting unit comprises:
Wherein the attribute combining unit, the combining entropy calculating unit, and the dependency calculating unit are repeated if there is no attribute combination less than the preset information loss amount.

Loading data of an electronic medical record (EMR);
Computing an entropy for individual attributes of the imported data;
Combining the individual attributes to generate attribute combinations;
Computing a joint entropy for the generated attribute combinations;
Calculating a dependency of the calculated coupling entropy;
Selecting an attribute combination less than a predetermined amount of information loss among attribute combinations of the binding entropy and the dependency; And
And anonymizing the data using a k-anonymity based on the selected combination of attributes. &Lt; RTI ID = 0.0 > 11. < / RTI >