KR20200026559A

KR20200026559A - Dataset De-identification Method and Apparatus Using K-anonymity Model

Info

Publication number: KR20200026559A
Application number: KR1020180104660A
Authority: KR
Inventors: 박성규
Original assignee: (주)아이알컴퍼니
Priority date: 2018-09-03
Filing date: 2018-09-03
Publication date: 2020-03-11
Also published as: KR102126386B1

Abstract

The present invention relates to a dataset deidentification method using a K-anonymity model and a device thereof. According to the present invention, the dataset deidentification method comprises: a step of confirming attribute value distribution for each column corresponding to a quasi-identifier attribute with respect to a dataset to be deidentified; a first processing step of considering the attribute value distribution for each column to deidentify each column of the dataset to be deidentified; a step of grouping a record with the identical attribute value into a record group when the number of records with the identical attribute value of one or more columns is K or more in the first-processed dataset to be deidentified; and a step of considering attribute value distribution of the column with a non-identical attribute value in the record group to deidentify the column with the non-identical attribute value so as to satisfy K-anonymity requirements. Accordingly, the present invention provides advantages capable of minimizing data crush while the K-anonymity requirements are satisfied and increasing a deidentification speed.

Description

Dataset De-identification Method and Apparatus Using K-anonymity Model}

본 발명은 데이터 셋 비식별화 방법 및 장치에 관한 것으로, 보다 자세하게는 K-익명성 모델 이용 데이터 셋 비식별화 방법 및 장치에 관한 것이다.The present invention relates to a data set de-identification method and apparatus, and more particularly, to a data set de-identification method and apparatus using a K-anonymity model.

개인정보가 포함된 데이터는 데이터 수집자(예를 들어 기업, 병원, 정부 등의 여러 기관)에 의하여 수집되고 이용된다. 데이터 수집자는 개인 맞춤형 서비스 등을 제공하기 위하여 고객 또는 사용자들의 방대한 개인정보를 수집하고 있다. 또한, 데이터 수집자는 수집된 데이터를 통합하여 제3의 기관(예를 들어, 데이터 분석 기관 등)에 데이터 분석을 의뢰할 수도 있다. 이때, 수집된 개인정보는 정보주체의 민감한 정보를 포함하고 있기 때문에, 유출시 각종 범죄에 악용될 가능성이 있다.Data containing personal information is collected and used by data collectors (eg, organizations such as corporations, hospitals, government, etc.). Data collectors collect vast personal information of customers or users in order to provide personalized services. In addition, the data collector may integrate the collected data and request data analysis from a third party (eg, a data analysis institution). At this time, since the collected personal information includes sensitive information of the information subject, there is a possibility that the collected personal information may be misused for various crimes upon leakage.

일반적으로 통계적으로 수집된 데이터는 식별자(identifier), 준 식별자(quasi-identifier, QI), 민감한 속성(sensitive attribute, SA)으로 구분한다. 개인의 신원을 명백히 나타내는 주민등록번호 등을 식별자라고 하며, 생년월일, 성별, 우편번호 등 개인의 특징을 나타내는 속성인 준 식별자는 직접적으로 대상을 알 수는 없지만 조합을 통해 간접적으로 개인 식별이 가능한 속성이다. 또한, 민감한 속성은 데이터 테이블이 제공하고자 하는 개인의 민감한 정보를 나타낸다.Generally, statistically collected data is classified into an identifier, a quasi-identifier (QI), and a sensitive attribute (SA). An identification number, such as a social security number that clearly identifies an individual's identity, is called an identifier. A quasi-identifier that represents an individual's characteristics, such as date of birth, sex, and zip code, is an attribute that is indirectly personally identifiable through a combination, although it is not directly known. In addition, the sensitive attribute represents the sensitive information of the individual that the data table intends to provide.

일반적으로 민감한 속성에 대한 정보를 제공하기 위해, 식별자를 제거하고 준 식별자를 익명화함으로써 프라이버시 보호를 수행한다. 이를 위해 공개된 정보를 연결해서 민감한 정보를 알아내는 방법(연결 공격 : LinKage AttacK)을 막기 위한 프라이버시 보호 모델 중 하나인 K-익명성 기법이 이용된다.In general, to provide information about sensitive attributes, privacy protection is achieved by removing the identifier and anonymizing the quasi-identifier. To do this, K-anonymization is used, which is one of the privacy protection models used to prevent sensitive information by linking publicly available information (link attack: LinKage AttacK).

K-익명성 모델은 공개된 데이터 집합에서 나이, 거주 지역과 같은 준식별자 속성값들이 동일한 레코드가 적어도 K개 존재해야 하는 것으로 정의가 되며, 비식별화된 개인정보의 재식별을 방지하기 위하여 K-익명성이 요구되고 있다.The K-anonymity model is defined as having at least K records with the same quasi-identifier attribute values, such as age and region of residence, in the published data set, to prevent re-identification of de-identified personal information. Anonymity is required.

도 1은 종래의 K-익명성 모델을 이용한 비식별화 과정을 설명하기 위해 제공되는 도면이다.1 is a view provided to explain the de-identification process using a conventional K-anonymity model.

도 1을 참고하면, '성명', '나이', '주민등록번호', '주소' 등의 준 식별자에 대해서 K-익명성을 맞추기 위한 개수가 3으로 지정된 경우, K=3을 만족하기 어려운 컬럼에 해당하는 '성명', '주민등록번호'를 삭제하고, '나이'에 대해서는 5살 단위로 범주화를 하고, '주소'에 대해서 '동' 단위를 삭제하는 부분 삭제를 적용하였다(Round1). Round1에서 K=3을 만족하지 못하므로, '나이'에 대해서는 10살 단위로 범주화를 하고, '주소'에 대해서도 한 단계 더 부분 삭제를 적용하였다(ROUND2). Round2에서도 K=3을 만족하지 못하므로, '나이'에 대해서 추가로 '20-60'으로 범주화 처리를 하여 최종적으로 K=3을 만족시켰다.Referring to FIG. 1, when the number for matching K-anonymity is specified as 3 for quasi-identifiers such as 'name', 'age', 'resident registration number', and 'address', K = 3 is difficult to satisfy. The 'name' and 'resident registration number' were deleted, the 'age' was categorized by the age of five, and the 'delete' part was deleted to 'address' (Round1). In Round1, K = 3 was not satisfied, so we categorized 'age' by the age of 10 and applied partial deletion to 'address' (ROUND2). In Round2, K = 3 was not satisfied, so we further categorized '20' to '20 -60 'and finally satisfied K = 3.

이와 같이 종래의 K-익명성을 만족시키기 위해 바로 속성별 일반화 계층 트리의 조합으로 이루어진 계층 격자에 따라 일반화 단계를 높여가는 계층적 알고리즘을 적용할 경우, 분석에 활용할 수 있는 컬럼이 과도하게 삭제되거나, 데이터 비식별화 처리 후 데이터 뭉개짐이 심해져서, 분석 데이터로 활용이 어려워지는 문제점이 있었다.In order to satisfy the conventional K-anonymity, when applying a hierarchical algorithm that increases the generalization level according to a hierarchical grid composed of a combination of attribute-specific generalization hierarchical trees, columns that can be used for analysis are excessively deleted or After data de-identification processing, data crushing became more severe, making it difficult to use as analytical data.

한국공개특허 제10-2012-0063050호(공개일자: 2012-06-15)Korean Patent Publication No. 10-2012-0063050 (published date: 2012-06-15) 한국등록특허 제1,652,328호(등록일자: 2016-08-24)Korean Patent No. 1,652,328 (Registration Date: 2016-08-24)

따라서 본 발명이 해결하고자 하는 기술적 과제는 데이터 뭉개짐을 최소화할 수 있는 K-익명성 모델 이용 데이터 셋 비식별화 방법 및 장치을 제공하는 것이다.Accordingly, the technical problem to be solved by the present invention is to provide a data set de-identification method and apparatus using the K-anonymity model that can minimize data lumping.

상기한 기술적 과제를 해결하기 위한 본 발명에 따른 K-익명성 알고리즘 개선 데이터 비식별화 방법은 비식별화 대상 데이터 셋에 대해서 준식별자 속성에 대응하는 컬럼별 속성값 분포를 확인하는 단계, 상기 컬럼별 속성값 분포를 고려하여 상기 비식별화 대상 데이터 셋의 컬럼별로 비식별화 처리를 하는 1차 가공 단계, 상기 1차 가공된 비식별화 대상 데이터 셋에서 하나 이상의 컬럼의 속성값이 동일한 레코드가 K 개 이상이면, 상기 속성값이 동일한 레코드를 레코드 그룹으로 그룹핑하는 단계, 그리고 상기 레코드 그룹에서 속성값이 동일하지 않은 컬럼의 속성값 분포를 고려하여 K-익명성 요건을 만족하도록 속성값이 동일하지 않은 컬럼에 대한 비식별화 처리를 하는 단계를 포함한다.In order to solve the above technical problem, the method for improving data of the K-anonymity algorithm according to the present invention may include: determining an attribute value distribution for each column corresponding to a quasi-identifier attribute for a non-identification target data set; In the first processing step of de-identifying each column of the de-identification target data set in consideration of the distribution of attribute values for each other, records having the same attribute value of one or more columns in the first processed de-identification target data set Grouping records having the same attribute value into a record group, and having the same attribute value so as to satisfy the K-anonymity requirement in consideration of the distribution of attribute values of columns whose attribute values are not identical in the record group. And de-identifying the non-column column.

상기 비식별화 처리는 가명처리, 총계처리, 데이터 삭제, 데이터범주화 및 데이터마스킹 중 하나일 수 있다.The de-identification process may be one of an alias process, a total process, data deletion, data categorization, and data masking.

상기 방법은, 상기 비식별화 대상 데이터 셋에서 K-익명성 요건을 만족하지 않는 레코드는 삭제하는 단계를 더 포함할 수 있다.The method may further include deleting a record that does not satisfy a K-anonymity requirement in the de-identification target data set.

상기 비식별화 대상 데이터 셋의 전체 레코드 중에서 레코드 그룹으로 그룹핑되는 레코드의 비율이 미리 정해진 기준 이상이 되도록 상기 1차 가공 단계에서 비식별화 처리를 수행할 수 있다.The de-identification process may be performed in the first processing step such that a ratio of records grouped into a record group among all records of the de-identification target data set is equal to or greater than a predetermined criterion.

상기한 기술적 과제를 해결하기 위한 본 발명에 따른 K-익명성 알고리즘 개선 데이터 비식별화 장치는, 비식별화 대상 데이터 셋에 대해서 준식별자 속성에 대응하는 컬럼별 속성값 분포를 확인하는 분포 확인부, 상기 컬럼별 속성값 분포를 고려하여 상기 비식별화 대상 데이터 셋의 컬럼별로 비식별화 처리를 하는 1차 가공부, 상기 1차 가공된 비식별화 대상 데이터 셋에서 하나 이상의 컬럼의 속성값이 동일한 레코드가 K 개 이상이면, 상기 속성값이 동일한 레코드를 레코드 그룹으로 그룹핑하는 그룹핑부, 그리고 상기 레코드 그룹에서 속성값이 동일하지 않은 컬럼의 속성값 분포를 고려하여 K-익명성 요건을 만족하도록 속성값이 동일하지 않은 컬럼에 대한 비식별화 처리를 하는 데이터 비식별화부를 포함한다.The K-anonymity algorithm improved data de-identification apparatus according to the present invention for solving the above technical problem, the distribution confirmation unit for confirming the distribution of the attribute value for each column corresponding to the quasi-identifier attribute for the non-identification target data set And a primary processing unit for de-identifying each column of the non-identification target data set in consideration of the distribution of attribute values for each column, and attribute values of one or more columns in the first processed non-identification target data set. If the same record is K or more, the grouping unit for grouping records having the same attribute value into a record group and the distribution of attribute values of columns whose attribute values are not identical in the record group are satisfied to satisfy the K-anonymity requirement. It includes a data de-identifying unit for de-identifying a column whose attribute values are not the same.

상기 장치는, 상기 비식별화 대상 데이터 셋에서 K-익명성 요건을 만족하지 않는 레코드는 삭제하는 데이터 삭제부를 더 포함할 수 있다.The apparatus may further include a data deletion unit for deleting a record that does not satisfy a K-anonymity requirement in the de-identification target data set.

상기 1차 가공부는, 상기 비식별화 대상 데이터 셋의 전체 레코드 중에서 레코드 그룹으로 그룹핑되는 레코드의 비율이 미리 정해진 기준 이상이 되도록 상기 1차 가공 단계에서 비식별화 처리를 수행할 수 있다.The primary processing unit may perform de-identification processing in the primary processing step such that a ratio of records grouped into a record group among all records of the de-identification target data set is equal to or greater than a predetermined criterion.

컴퓨터에 상기 방법을 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 포함할 수 있다.And a computer readable recording medium having recorded thereon a program for executing the method on a computer.

본 발명에 의하면 K-익명성 요건은 만족하되 데이터 뭉개짐을 최소화할 수 있으며, 비식별화 처리 속도를 향상하는 장점이 있다.According to the present invention, the K-anonymity requirement may be satisfied, but data minimization may be minimized, and the de-identification processing speed may be improved.

도 1은 종래의 K-익명성 모델을 이용한 비식별화 과정을 설명하기 위해 제공되는 도면이다.
도 2는 본 발명의 일 실시예에 따른 K-익명성 모델 이용 데이터 셋 비식별화 장치의 구성을 나타낸 블록도이다.
도 3은 본 발명의 일 실시예에 따른 K-익명성 모델을 이용한 비식별화 과정을 설명하기 위해 제공되는 도면이다.
도 4는 본 발명의 일 실시예에 따른 K-익명성 모델 이용 데이터 셋 비식별화 장치의 동작을 설명하기 위해 제공되는 흐름도이다.1 is a view provided to explain the de-identification process using a conventional K-anonymity model.
2 is a block diagram illustrating a configuration of a data set de-identifying apparatus using K-anonymity model according to an embodiment of the present invention.
3 is a view provided to explain the de-identification process using the K-anonymity model according to an embodiment of the present invention.
4 is a flowchart provided to explain the operation of the K-anonymity model using data set de-identifying apparatus according to an embodiment of the present invention.

그러면 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention.

도 2는 본 발명의 일 실시예에 따른 K-익명성 모델 이용 데이터 셋 비식별화 장치의 구성을 나타낸 블록도이고, 도 3은 본 발명의 일 실시예에 따른 K-익명성 모델을 이용한 비식별화 과정을 설명하기 위해 제공되는 도면이다.2 is a block diagram illustrating a configuration of a data set de-identifying apparatus using a K-anonymity model according to an embodiment of the present invention, and FIG. 3 is a ratio using a K-anonymity model according to an embodiment of the present invention. A diagram provided to explain the identification process.

도 2를 참고하면, 본 발명의 일 실시예에 따른 K-익명성 모델 이용 데이터 셋 비식별화 장치(100)는 분포 확인부(110), 1차 가공부(120), 그룹핑부(130), 데이터 비식별화부(140) 및 데이터 삭제부(150)를 포함할 수 있다.Referring to FIG. 2, the data set de-identification apparatus 100 using the K-anonymity model according to an embodiment of the present invention includes a distribution checker 110, a primary processing unit 120, and a grouping unit 130. The data de-identifier 140 and the data delete unit 150 may be included.

분포 확인부(110)는 비식별화 대상 데이터 셋에 대해서 준식별자 속성에 대응하는 컬럼별 속성값 분포를 확인할 수 있다.The distribution checking unit 110 may check the distribution of attribute values for each column corresponding to the quasi-identifier attribute for the non-identification target data set.

비식별화 대상 데이터 셋(10)이 도 3에 예시한 것과 같이 성명, 나이, 주민등록번호 및 주소 등 준식별자 속성에 대응하는 복수 개의 컬럼(11, 12, 13, 14)으로 이루어져 있는 경우, 분포 확인부(110)는 컬럼별 속성값 분포를 확인할 수 있다.If the de-identification target data set 10 is composed of a plurality of columns 11, 12, 13, 14 corresponding to quasi-identifier attributes such as name, age, social security number, and address, as illustrated in FIG. The unit 110 may check distribution of attribute values for each column.

1차 가공부(120)는 분포 확인부(110)에 서 확인된 컬럼별 속성값 분포를 고려하여 비식별화 대상 데이터 셋(10)의 컬럼별로 비식별화 처리를 할 수 있다.The primary processing unit 120 may perform de-identification processing for each column of the non-identification target data set 10 in consideration of the distribution of attribute values for each column identified by the distribution confirmation unit 110.

예컨대 성명 컬럼(11)의 경우 성명의 뒷부분 2글자에 대해 마스킹 처리를 하고, 나이 컬럼(12)의 경우 10살 단위로 범주화하면, 레코드(R1, R4, R7)와 레코드(R2, R3, R6)에 대해서는 부분적으로 성명 컬럼(11)과 나이 컬럼(12)에 대해서는 K-익명성 요건을 만족하게 할 수 있다. 여기서 K=3으로 설정된 것으로 가정한다. 한편 주민등록번호 컬럼(13)과 주소 컬럼(14)의 경우 속성값 분포 등을 고려할 때 K=3을 만족할 수 없으므로, 해당 컬럼에 대해서 데이터가 완전히 뭉개지기 직전 단계까지 비식별화 처리를 할 수 있다. 에컨대 주민등록번호 컬럼(13)은 주민등록번호의 맨 앞자리를 기준으로 60년생, 70년생, 80년생, 90년생 등으로 총계처리를 하고, 주소 컬럼(14)은 경기도, 서울시 등의 최상위 행정 단위를 남기고 나머지 주소는 부분 삭제 처리를 할 수 있다.For example, in the name column 11, masking is performed on the last two letters of the name, and in the age column 12, categorized by the age of 10, records (R1, R4, R7) and records (R2, R3, R6) For the name column 11 and age column 12 can be satisfied K-anonymity requirements. Assume that K = 3 is set here. Meanwhile, in the case of the social security number column 13 and the address column 14, K = 3 cannot be satisfied in consideration of the distribution of attribute values, so that the column can be de-identified until the stage immediately before the data is completely crushed. For example, the resident registration number column (13) is based on the first digit of the resident registration number (60 years old, 70 years old, 80 years old, 90 years old, etc.), and the address column (14) leaves the top administrative units of Gyeonggi-do, Seoul, etc. The address can be partially deleted.

1차 가공부(120)에서 수행되는 비식별화 처리는 일반적으로 알려진 17가지 방법이 해당 컬럼의 속성을 고려하여 적용될 수 있다. 예컨대 가명 처리 기법에 해당하는 휴리스틱 가명화, 암호화, 교환방법, 총계 처리 기법에 해당하는 총계 처리(총합 평균), 부분 총계, 라운딩, 재배열, 데이터 삭제 기법에 해당하는 식별자 삭제, 식별자 부분 삭제, 레코드 삭제, 식별요소 전부 삭제, 데이터 범주화 기법에 해당하는 감추기, 랜덤 라운딩, 범위 방법, 제어 라운딩, 데이터 마스킹 기법에 해당하는 임의 잡음 추가, 공백과 대체 등 17가지 방법이 적용될 수 있다.In the non-identification process performed by the primary processing unit 120, 17 commonly known methods may be applied in consideration of attributes of a corresponding column. For example, heuristic pseudonymization, encryption, exchange method, total processing (total average) corresponding to the pseudonym processing technique, partial total, rounding, rearrangement, identifier deletion corresponding to data deletion technique, identifier partial deletion, Seventeen methods can be applied: record deletion, all identification elements hidden, data corresponding to categorization, random rounding, range method, control rounding, random noise addition, blanking and substitution.

그룹핑부(130)는 1차 가공된 비식별화 대상 데이터 셋(20)에서 하나 이상의 컬럼의 속성값이 동일한 레코드가 K 개 이상이면(여기서 K=3), 속성값이 동일한 레코드를 레코드 그룹으로 그룹핑할 수 있다. 그룹핑부(130)는 레코드(R1, R4, R7)와 레코드(R2, R3, R6)를 각각 하나의 그룹으로 그룹핑할 수 있다.If the grouping unit 130 has K or more records having the same attribute value of one or more columns in the primary non-identification target data set 20 (here, K = 3), the grouping unit 130 converts records having the same attribute value into the record group. Can be grouped The grouping unit 130 may group the records R1, R4, and R7 and the records R2, R3, and R6 into one group, respectively.

데이터 비식별화부(140)는 레코드 그룹별로 각각 속성값이 동일하지 않은 컬럼의 속성값 분포를 고려하여 K-익명성 요건을 만족하도록 속성값이 동일하지 않은 컬럼에 대한 비식별화 처리를 할 수 있다. 예컨대 레코드(R1, R4, R7) 그룹에 대해서 속성값이 동일하지 않은 컬럼(13, 14)의 속성값 분포를 고려하여, 컬럼(13)에 대해서는 속성값을 '90년생'으로 비식별화를 수행하고, 컬럼(14)에 대해서는 '서울-경기도'로 비식별화를 수행할 수 있다. 마찬가지로 레코드(R2, R3, R6)에 대해서도 속성값이 동일하지 않은 컬럼(13, 14)의 속성값 분포를 고려하여, 컬럼(13)에 대해서는 속성값을 '70-80년생'으로 비식별화를 수행하고, 컬럼(14)에 대해서는 '서울-경기도'로 비식별화를 수행할 수 있다.The data de-identifier 140 may de-identify the columns having the same attribute value to satisfy the K-anonymity requirement in consideration of the distribution of the attribute values of the columns having the same attribute value for each record group. have. For example, in consideration of the distribution of attribute values of columns 13 and 14 whose attribute values are not the same for the groups of records R1, R4 and R7, the column 13 is de-identified as '90 years'. In addition, the column 14 may be de-identified as 'Seoul-Gyeonggi-do'. Similarly, for records R2, R3, and R6, the attribute value is de-identified as 'born in 70-80' for column 13, taking into account the distribution of the attribute values of columns 13 and 14 that do not have the same attribute values. For example, the column 14 may be de-identified as 'Seoul-Gyeonggi-do'.

주민등록번호 컬럼(13)의 경우, 레코드(R1, R4, R7) 그룹의 경우 90년생으로도 K-익명성 요건을 만족하므로 추가적인 비식별화를 더 수행하지 않았고, 레코드(R2, R3, R6)의 경우는 70년생과 80년생이 포함되어 있으므로, '70-80년생'으로 비식별화를 수행할 수 있다. 이와 같이 컬럼별로 일반화 정도를 다르게 할 수 있다. 레코드(R1, R4, R7) 그룹에 비해서 레코드(R2, R3, R6)의 일반화가 더 수행되었다.In the case of the social security number column (13), the group of records (R1, R4, R7) also met the K-anonymity requirement even though it was 90 years old, so no further de-identification was performed and the records (R2, R3, R6) The case includes 70- and 80-year-olds, so de-identification can be done with '70 -80-years'. In this way, the degree of generalization can be different for each column. More generalization of the records R2, R3, R6 was performed compared to the groups of records R1, R4, R7.

데이터 삭제부(150)는 데이터 비식별화부(140)에 의해 비식별화가 수행된 비식별화 대상 데이터 셋(30)에서 K-익명성 요건을 만족하지 않는 레코드(R5)를 삭제할 수 있다. 최초 비식별화 대상 데이터 셋(10)에 포함된 모든 레코드에 대해서 K-익명성 요건을 만족시키려면, 데이터 뭉개짐 현상이 심화될 수 있으므로, 레코드(R5)를 삭제하여, 최종적으로 데이터 셋 비식별화를 완료할 수 있다.The data deleting unit 150 may delete the record R5 that does not satisfy the K-anonymity requirement in the de-identification target data set 30 in which de-identification is performed by the data de-identification unit 140. In order to satisfy the K-anonymity requirement for all records included in the first non-identification target data set 10, data crushing may be intensified, so that the record R5 is deleted and finally the data set ratio is deleted. The identification can be completed.

도 4는 본 발명의 일 실시예에 따른 K-익명성 모델 이용 데이터 셋 비식별화 장치의 동작을 설명하기 위해 제공되는 흐름도이다.4 is a flowchart provided to explain the operation of the K-anonymity model using data set de-identifying apparatus according to an embodiment of the present invention.

도 4를 참고하면, 먼저 분포 확인부(110)는 비식별화 대상 데이터 셋에 대해서 준식별자 속성에 대응하는 컬럼별 속성값 분포를 확인할 수 있다(S410).Referring to FIG. 4, first, the distribution confirming unit 110 may confirm the distribution of attribute values for each column corresponding to the quasi-identifier attribute for the non-identification target data set (S410).

다음으로 1차 가공부(120)는 분포 확인부(110)에 서 확인된 컬럼별 속성값 분포를 고려하여 비식별화 대상 데이터 셋(10)의 컬럼별로 비식별화 처리를 할 수 있다(S420).Next, the primary processing unit 120 may perform de-identification processing for each column of the non-identification target data set 10 in consideration of the distribution of attribute values for each column identified by the distribution checking unit 110 (S420). ).

단계(S420)에서 1차 가공부(120)는 비식별화 대상 데이터 셋의 전체 레코드 중에서 레코드 그룹으로 그룹핑되는 레코드의 비율이 미리 정해진 기준 이상이 되도록 1차 가공 단계에서 비식별화 처리를 수행할 수 있다. In operation S420, the primary processing unit 120 may perform de-identification processing in the primary processing step such that a ratio of records grouped into a record group among all records of the non-identification target data set is equal to or greater than a predetermined criterion. Can be.

이후 그룹핑부(130)는 1차 가공된 비식별화 대상 데이터 셋(20)에서 하나 이상의 컬럼의 속성값이 동일한 레코드가 K 개 이상이면, 속성값이 동일한 레코드를 레코드 그룹으로 그룹핑할 수 있다(S430).Then, the grouping unit 130 may group records having the same attribute value into a record group when K or more records having the same attribute value of one or more columns in the primary processed non-identification target data set 20 are included. S430).

다음으로 데이터 비식별화부(140)는 레코드 그룹별로 각각 속성값이 동일하지 않은 컬럼의 속성값 분포를 고려하여 K-익명성 요건을 만족하도록 속성값이 동일하지 않은 컬럼에 대한 비식별화 처리를 할 수 있다(S440).Next, the data de-identification unit 140 performs de-identification processing on columns whose attribute values are not the same to satisfy the K-anonymity requirement in consideration of distribution of attribute values of columns whose attribute values are not the same for each record group. It may be (S440).

마지막으로 데이터 삭제부(150)는 데이터 비식별화부(140)에 의해 비식별화가 수행된 비식별화 대상 데이터 셋(30)에서 K-익명성 요건을 만족하지 않는 레코드(R5)를 삭제할 수 있다(S450). 최초 비식별화 대상 데이터 셋(10)에 포함된 모든 레코드에 대해서 K-익명성 요건을 만족시키려면, 데이터 뭉개짐 현상이 심화될 수 있으므로, 레코드(R5)를 삭제하여, 최종적으로 데이터 셋 비식별화를 완료할 수 있다.Finally, the data deleting unit 150 may delete the record R5 that does not satisfy the K-anonymity requirement in the de-identification target data set 30 in which the de-identification is performed by the data de-identification unit 140. (S450). In order to satisfy the K-anonymity requirement for all records included in the first non-identification target data set 10, data crushing may be intensified, so that the record R5 is deleted and finally the data set ratio is deleted. The identification can be completed.

본 발명의 실시예는 다양한 컴퓨터로 구현되는 동작을 수행하기 위한 프로그램 명령을 포함하는 컴퓨터로 읽을 수 있는 매체를 포함한다. 이 매체는 앞서 설명한 방법을 실행시키기 위한 프로그램을 기록한다. 이 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 이러한 매체의 예에는 하드디스크, 플로피디스크 및 자기 테이프와 같은 자기 매체, CD 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disK)와 자기-광 매체, 롬, 램, 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 구성된 하드웨어 장치 등이 있다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.Embodiments of the invention include a computer readable medium containing program instructions for performing various computer-implemented operations. This medium records a program for executing the method described above. The media may include, alone or in combination with the program instructions, data files, data structures, and the like. Examples of such media include magnetic media such as hard disks, floppy disks and magnetic tape, optical recording media such as CDs and DVDs, floppy discs and program commands such as magnetic-optical media, ROM, RAM, flash memory, and the like. Hardware devices configured to store and perform such operations. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

이상에서 본 발명의 바람직한 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the preferred embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements of those skilled in the art using the basic concepts of the present invention defined in the following claims are also provided. It belongs to the scope of right.

110: 분포 확인부
120: 1차 가공부
130: 그룹핑부
140: 데이터 비식별화부
150: 데이터 삭제부110: distribution check unit
120: primary processing part
130: grouping part
140: data de-identifier
150: data deletion unit

Claims

Identifying the attribute value distribution for each column corresponding to the quasi-identifier attribute for the de-identification target data set;
A first processing step of performing de-identification for each column of the de-identification target data set in consideration of the distribution of attribute values for each column;
Grouping records having the same attribute value into a record group when there are more than K records having the same attribute value of at least one column in the first processed non-identification target data set; and
De-identifying a column that does not have the same attribute value to satisfy the K-anonymity requirement in consideration of the distribution of the attribute value of the column having the same attribute value in the record group
K-anonymity algorithm improved data de-identification method comprising a.

In claim 1,
The de-identification process is one of anonymity processing, aggregate processing, data deletion, data categorization, and data masking.

In claim 1,
Deleting records that do not satisfy the K-anonymity requirement in the de-identification target data set
K-anonymity algorithm improved data de-identification method further comprising.

In claim 1,
Improvement method of K-anonymity algorithm in which de-identification processing is performed in the first processing step such that the ratio of records grouped into a record group among all records of the de-identification target data set is equal to or greater than a predetermined criterion. .

A distribution checking unit for identifying an attribute value distribution for each column corresponding to a quasi-identifier attribute for the non-identification target data set;
A primary processing unit performing de-identification for each column of the de-identification target data set in consideration of the distribution of attribute values for each column;
A grouping unit for grouping records having the same attribute value into a record group when there are K or more records having the same attribute value of at least one column in the first processed non-identification target data set, and
A data de-identification unit for de-identifying a column whose attribute values are not the same to satisfy the K-anonymity requirement in consideration of the distribution of attribute values of columns whose attribute values are not identical in the record group.
K-anonymity algorithm improved data de-identification apparatus comprising a.

In claim 5,
The de-identification process is a K-anonymity algorithm improved data de-identification apparatus of one of the pseudonym processing, total processing, data deletion, data categorization and data masking.

In claim 6,
A data deletion unit for deleting a record that does not satisfy the K-anonymity requirement in the de-identification target data set
K-anonymity algorithm improved data de-identification device further comprising.

In claim 5,
The primary processing unit,
K-anonymity algorithm improved data de-identifying apparatus performing de-identification processing in the first processing step such that the ratio of records grouped into a record group among all records of the de-identification target data set is equal to or greater than a predetermined criterion .

A computer-readable recording medium having a computer recorded thereon a program for executing the method according to any one of claims 1 to 4.