KR102218374B1

KR102218374B1 - Method and Apparatus for Measuring Quality of De-identified Data for Unstructured Transaction

Info

Publication number: KR102218374B1
Application number: KR1020190045149A
Authority: KR
Inventors: 이원석
Original assignee: 연세대학교 산학협력단
Priority date: 2019-04-17
Filing date: 2019-04-17
Publication date: 2021-02-19
Also published as: KR20200122195A

Abstract

본 실시예들은 복수의 개인에 대한 개인 정보를 담고 있는 트랜잭션 데이터베이스에서 동일한 항목 집합들이 최소 서로 다른 개인의 트랜잭션에 나타날 때 특정 개인을 식별하지 못하게 하는지 검증하는 모델인 개인 중복도 모델을 이용하여 재식별 여부를 검증하고, 원본 레코드와 비식별 레코드 차이에 대한 통계적 유사성을 수치적으로 측정한 활용 품질 지표를 이용하여 원본 유사도 여부를 측정함으로써, 비식별화된 트랜잭션 데이터와 원본 데이터 간의 유사도에 대한 지표와 재식별 가능성에 대한 측정 지표를 제시하는 개인정보 비식별 테이터의 품질 측정 방법 및 장치를 제공한다.The present embodiments re-identify using an individual redundancy model, a model that verifies whether a specific individual cannot be identified when the same set of items appears in at least different transactions in a transaction database containing personal information for a plurality of individuals. By verifying the existence of the original record and measuring the similarity of the original using the utilization quality index that numerically measured the statistical similarity of the difference between the original record and the non-identified record, an indicator of the similarity between the de-identified transaction data and the original data Provides a method and apparatus for measuring the quality of non-identifying data of personal information that provides a measure of the possibility of re-identification.

Description

{Method and Apparatus for Measuring Quality of De-identified Data for Unstructured Transaction}

본 실시예가 속하는 기술 분야는 비정형 트랜잭션 비식별 데이터의 품질을 측정하는 방법 및 장치에 관한 것이다.The technical field to which this embodiment belongs relates to a method and apparatus for measuring the quality of unstructured transaction unidentified data.

이 부분에 기술된 내용은 단순히 본 실시예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The content described in this section merely provides background information on the present embodiment and does not constitute the prior art.

최근에 들어 정보 통신 기술의 급속한 발전과 폭발적인 성장으로 인하여 데이터의 양이 늘어나고 이에 따라서 고도화된 분산 처리 시스템을 기반으로 하여 대용량 데이터의 수집 및 관리 기술과 분석 기술이 성숙해지고 있다.In recent years, due to the rapid development and explosive growth of information and communication technologies, the amount of data is increasing, and accordingly, technologies for collecting and managing large amounts of data and analysis technologies based on advanced distributed processing systems are maturing.

유통되는 대용량 데이터 중에 비정형 데이터인 트랜잭션 데이터는 개개인의 슈퍼마켓 구매 정보 및 병원의 진단 내역을 포함하여 다양한 정보를 가지며 다양한 형태의 항목 집합 형태로 나타난다. 또한, 트랜잭션 데이터는 마케팅 및 의약품과 같은 영역에서 굉장히 많은 새로운 지식을 발견하는데 도움을 주기 때문에 트랜잭션 데이터에 대한 연구 및 유통이 필수적이다.Among the large volumes of data distributed, transaction data, which is unstructured data, has various information including individual supermarket purchase information and hospital diagnosis details, and appears in the form of various types of item sets. In addition, since transaction data helps discover a great deal of new knowledge in areas such as marketing and pharmaceuticals, research and distribution of transaction data is essential.

그러나 트랜잭션 데이터를 유통할 때, 트랜잭션 데이터 안에 포함되어 있는 개개인의 기록을 신원 정보에 연결하여 개인 정보 침해가 발생할 위험도 존재한다. 이와 같은 개인 정보의 침해를 막기 위해 개인 정보보호 관련 법률이 발의되면서 개인의 사생활 정보 유출 관련 이슈가 발생하지 않도록 현행 법규에서 제시하는 개인 식별 가능 정보가 포함된 데이터 활용 시 데이터의 비식별화 조치가 필수적으로 요구된다.However, when circulating transaction data, there is also a risk of personal information infringement by linking individual records contained in transaction data to identity information. In order to prevent such infringement of personal information, measures to de-identify data when using data containing personally identifiable information suggested by current laws to prevent issues related to leakage of personal information from occurring as a law related to personal information protection is initiated. Essentially required.

한국공개공보 제10-2018-0119104호 (2018.11.01)Korea Publication No. 10-2018-0119104 (2018.11.01)

본 발명의 실시예들은 비식별 데이터와 원본 데이터 간의 유사도에 대한 지표와 재식별 가능성에 대한 측정 지표를 제시하는 데 발명의 주된 목적이 있다.Embodiments of the present invention have a main object of the present invention to provide an index for similarity between non-identified data and original data and a measure for re-identification possibility.

본 발명의 명시되지 않은 또 다른 목적들은 하기의 상세한 설명 및 그 효과로부터 용이하게 추론할 수 있는 범위 내에서 추가적으로 고려될 수 있다.Still other objects, not specified, of the present invention may be additionally considered within the range that can be easily deduced from the following detailed description and effects thereof.

본 실시예의 일 측면에 의하면, 컴퓨팅 디바이스에 의한 개인정보 비식별 데이터의 품질 측정 방법에 있어서, 복수의 개인에 대한 개인 정보를 담고 있는 트랜잭션 데이터베이스에서 동일한 항목 집합들이 최소 서로 다른 개인의 트랜잭션에 나타날 때 특정 개인을 식별하지 못하게 하는지 검증하는 모델인 개인 중복도 모델을 이용하여 재식별 여부를 검증하는 단계 및 원본 레코드와 비식별 레코드 차이에 대한 통계적 유사성을 수치적으로 측정한 활용 품질 지표를 이용하여 원본 유사도 여부를 측정하는 단계를 포함하는 개인정보 비식별 데이터의 품질 측정 방법을 제공한다.According to an aspect of the present embodiment, in a method of measuring the quality of personal information non-identifying data by a computing device, when identical item sets appear in at least different individual transactions in a transaction database containing personal information for a plurality of individuals. The original using quality indicators that numerically measure statistical similarity between the original record and the non-identified record, and the step of verifying whether to re-identify using the individual redundancy model, a model that verifies whether a specific individual cannot be identified. It provides a method of measuring the quality of non-identifying data of personal information, including measuring the degree of similarity.

본 실시예의 다른 측면에 의하면, 하나 이상의 프로세서 및 상기 하나 이상의 프로세서에 의해 실행되는 하나 이상의 프로그램을 저장하는 메모리를 포함하며, 상기 프로세서는 복수의 개인에 대한 개인 정보를 담고 있는 트랜잭션 데이터베이스에서 동일한 항목 집합들이 최소 서로 다른 개인의 트랜잭션에 나타날 때 특정 개인을 식별하지 못하게 하는지 검증하는 모델인 개인 중복도 모델을 이용하여 재식별 여부를 검증하는 재식별 검증부 및 원본 레코드와 비식별 레코드 차이에 대한 통계적 유사성을 수치적으로 측정한 활용 품질 지표를 이용하여 원본 유사도 여부를 측정하는 원본 유사도 측정부를 포함하는 개인정보 비식별 데이터의 품질 측정 장치를 제공한다.According to another aspect of the present embodiment, it includes one or more processors and a memory for storing one or more programs executed by the one or more processors, wherein the processor is the same item set in a transaction database containing personal information for a plurality of individuals. A re-identification verifier that verifies whether or not a specific individual is not identified when they appear in at least different individuals' transactions, and the statistical similarity between the original and non-identified records A device for measuring the quality of non-identifying data of personal information including an original similarity measuring unit that measures whether or not the original similarity is measured using the utilization quality index measured numerically.

이상에서 설명한 바와 같이 본 발명의 실시예들에 의하면, 재식별 위험도 및 개인 추정 가능성은 여러 명의 개인에 대한 개인 정보를 담고 있는 트랜잭션 데이터베이스에서 동일한 항목 집합들이 최소 서로 다른 p 명의 해당 항목 집합에 나타나 특정 개인을 식별하지 못하게 하는지 검증하는 모델인 '개인 중복도(p)' 모델을 검증 모델로 제시한다. 원본 유사도는 원본 트랜잭션 데이터와 비식별화 시킨 트랜잭션 데이터의 유사도가 얼마나 되는지 정량적으로 평가함에 따라 원본과 매우 유사하여 특정 개인이 재식별되는 문제가 있거나 원본과 지나치게 상이하여 통계적 유사성이 떨어지는 데이터의 활용 방지를 막을 수 있는 효과가 있다.As described above, according to the embodiments of the present invention, the re-identification risk and the possibility of personal estimation are specific to the same item sets appearing in the corresponding item set of at least p different persons in the transaction database containing personal information for several individuals. The'individual redundancy (p)' model, which is a model that verifies whether an individual cannot be identified, is presented as a verification model. Original similarity is very similar to the original by quantitatively evaluating the degree of similarity between the original transaction data and the de-identified transaction data, preventing the use of data with poor statistical similarity due to the problem of re-identification of a specific individual or too different from the original There is an effect that can prevent.

여기에서 명시적으로 언급되지 않은 효과라 하더라도, 본 발명의 기술적 특징에 의해 기대되는 이하의 명세서에서 기재된 효과 및 그 잠정적인 효과는 본 발명의 명세서에 기재된 것과 같이 취급된다.Even if it is an effect not explicitly mentioned herein, the effect described in the following specification expected by the technical features of the present invention and the provisional effect thereof are treated as described in the specification of the present invention.

도 1은 본 발명의 일 실시 예에 따른 개인정보 비식별 데이터의 품질 측정 장치를 예시한 블록도이다.
도 2는 본 발명의 일 실시 예에 따른 비식별 위험 품질 지표의 흐름도이다.
도 3은 본 발명의 일 실시 예에 따른 잔존율 원본 유사도를 나타내는 흐름도이다.
도 4는 본 발명의 일 실시 예에 따른 항목 기반 원본 유사도의 흐름도이다.
도 5는 본 발명의 일 실시 예에 따른 표 2의 항목 계층 정보를 나타내는 도면이다.
도 6은 본 발명의 일 실시 예에 따른 개인정보 비식별 데이터의 품질 측정 방법을 예시한 흐름도이다.
도 7은 본 발명의 일 실시 예에 따른 개인정보 비식별 데이터의 품질 측정 방법을 예시한 흐름도이다.
도 8은 실시예들에서 사용되기에 적합한 컴퓨팅 디바이스를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도이다.1 is a block diagram illustrating an apparatus for measuring quality of non-identifying personal information data according to an embodiment of the present invention.
2 is a flowchart of a non-identifying risk quality indicator according to an embodiment of the present invention.
3 is a flowchart showing a similarity of the original residual rate according to an embodiment of the present invention.
4 is a flowchart of an item-based original similarity according to an embodiment of the present invention.
5 is a diagram illustrating item hierarchy information of Table 2 according to an embodiment of the present invention.
6 is a flowchart illustrating a method of measuring quality of non-identifying personal information data according to an embodiment of the present invention.
7 is a flowchart illustrating a method of measuring quality of non-identifying personal information data according to an embodiment of the present invention.
8 is a block diagram illustrating and describing a computing environment including a computing device suitable for use in embodiments.

이하, 본 발명을 설명함에 있어서 관련된 공지기능에 대하여 이 분야의 기술자에게 자명한 사항으로서 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하고, 본 발명의 일부 실시예들을 예시적인 도면을 통해 상세하게 설명한다.Hereinafter, in describing the present invention, when it is determined that the subject matter of the present invention may be unnecessarily obscured as matters apparent to those skilled in the art with respect to known functions related to the present invention, a detailed description thereof will be omitted, and some embodiments of the present invention will be described. It will be described in detail through exemplary drawings.

도 1은 개인정보 비식별 데이터의 품질 측정 장치를 예시한 블록도이다. 도 1에 도시한 바와 같이, 개인정보 비식별 데이터의 품질 측정 장치(10)는 재식별 검증부(100) 및 원본 유사도 측정부(200)를 포함한다. 개인정보 비식별 데이터의 품질 측정 장치(10)는 도 1에서 예시적으로 도시한 다양한 구성요소들 중에서 일부 구성요소를 생략하거나 다른 구성요소를 추가로 포함할 수 있다.1 is a block diagram illustrating an apparatus for measuring quality of non-identifying personal information data. As shown in FIG. 1, the apparatus 10 for measuring the quality of non-identified personal information data includes a re-identification verification unit 100 and an original similarity measurement unit 200. The apparatus 10 for measuring the quality of personal information non-identifying data may omit some elements or additionally include other elements among various elements illustrated in FIG. 1.

개인정보 비식별 데이터의 품질 측정 장치(10)는 비식별 처리된 트랜잭션 데이터들의 재식별 위험성 및 개인 추정도와 원본 유사도를 기반으로 비식별 데이터 세트에 대한 품질 평가 지표를 제시한다.The apparatus 10 for measuring the quality of non-identified personal information data presents a quality evaluation index for the non-identified data set based on the risk of re-identification of the non-identified transaction data and the personal estimate and the similarity of the original.

트랜잭션 데이터는 미리 정의된 데이터 모델이 없거나 정형화 되지 않은 비정형 데이터의 한 형태로서, 개인의 슈퍼마켓 구매 정보 및 병원의 진단 내역을 포함하여 다양한 정보를 가지고 있으며, 다양한 항목 집합 형태로 나타난다. 이는 마케팅 및 의약품과 같은 영역에 도움을 줄 수 있다.Transaction data is a form of unstructured data that does not have a predefined data model or is not standardized. It has various information including personal supermarket purchase information and hospital diagnosis details, and appears in the form of various item sets. This can help in areas such as marketing and pharmaceuticals.

비식별화는 특정 개인을 식별할 수 없도록 개인 정보의 일부 또는 전부를 변환하는 과정으로, 개인 식별 데이터를 다른 값으로 변환하거나 대체한다.De-identification is a process of converting some or all of personal information so that a specific individual cannot be identified, and the personal identification data is converted or replaced with another value.

재식별 검증부(100)는 개인 중복도 기반 테이블 생성부(110), 재식별 위험도 계산부(120) 및 개인 추정 가능성 위험도 계산부(130)를 포함한다.The re-identification verification unit 100 includes a personal redundancy-based table generation unit 110, a re-identification risk calculation unit 120, and a personal estimation possibility risk calculation unit 130.

재식별 검증부(100)는 복수의 개인에 대한 개인 정보를 담고 있는 트랜잭션 데이터베이스에서 동일한 항목 집합들이 최소 서로 다른 개인의 트랜잭션에 나타날 때 특정 개인을 식별하지 못하게 하는지 검증하는 모델인 개인 중복도 모델을 이용하여 재식별 여부를 검증한다.The re-identification verification unit 100 uses a personal redundancy model, which is a model that verifies whether a specific individual cannot be identified when the same set of items appears in at least different transactions in a transaction database containing personal information for a plurality of individuals. To verify re-identification.

개인 중복도 기반 테이블 생성부(110)는 개인 정보의 항목이 나타난 트랜잭션을 발생시킨 각 개인들의 총합인 개인 중복수를 지지도의 개념으로 사용하여 개인 중복도 기반 빈발 항목 집합을 찾아 개인 중복도 기반 테이블을 생성한다.The personal redundancy-based table generation unit 110 finds a set of frequent items based on personal redundancy by using the total number of individual redundancy, which is the total number of individuals who have generated the transaction in which the items of personal information appear, as a concept of support, and finds a set of frequent items based on personal redundancy. Create

재식별 위험도 계산부(120)는 개인 정보 모델이 1인 경우, 개인 정보가 포함된 트랜잭션 레코드에 해당 트랜잭션 항목들이 특정 개인에게만 발생하여 특정 개인이 트랜잭션인지를 식별할 수 있는 트랜잭션 레코드를 검사하여 재식별 위험도를 계산한다.When the personal information model is 1, the re-identification risk calculation unit 120 checks and re-examines the transaction record that can identify whether a specific individual is a transaction because the corresponding transaction items occur only to a specific individual in the transaction record including personal information. Calculate identification risk.

개인 추정 가능성 위험도 계산부(130)는 개인 정보 모델이 2 이상일 경우, 특정 개인을 추정하는 가능성을 확률로 평가하여 트랜잭션 데이터베이스에 따른 개인 추정 가능성 위험을 통해 개인 추정 가능성을 계산한다.When the personal information model is 2 or more, the individual estimation possibility risk calculation unit 130 evaluates the probability of estimating a specific individual and calculates the individual estimation possibility through the individual estimation possibility risk according to the transaction database.

원본 유사도 측정부(200)는 잔존율 계산부(210) 및 원본 유사도 계산부(220)를 포함한다.The original similarity measurement unit 200 includes a residual ratio calculation unit 210 and an original similarity calculation unit 220.

원본 유사도 측정부(200)는 원본 레코드와 비식별 레코드 차이에 대한 통계적 유사성을 수치적으로 측정한 활용 품질 지표를 이용하여 원본 유사도 여부를 측정한다.The original similarity measurement unit 200 measures whether or not the original similarity is based on a utilization quality index that numerically measures the statistical similarity between the original record and the non-identified record.

잔존율 계산부(210)는 원본 데이터를 비식별 데이터로 변환하는 과정에서 비식별 처리에 위반하는 레코드를 제거하여 원본 데이터 수보다 적은 수의 비식별 레코드를 형성하며, 비식별 레코드의 수 및 원본 레코드의 수를 이용하여 잔존율을 계산한다.In the process of converting the original data into non-identifying data, the residual rate calculation unit 210 removes records that violate the de-identification processing to form fewer non-identified records than the number of original data, and The residual rate is calculated using the number of records.

원본 유사도 계산부(220)는 트랜잭션 레코드의 각 항목에 대한 1-항목 원본 유사도를 형성하고, 각 항목 별 원본 유사도와 원본 데이터 세트를 비교하여 트랜잭션 레코드의 원본 유사도를 계산한다. 1-항목 원본 유사도는 항목 전체의 도메인 크기 대비 항목 계층 정보를 나타내는 자식 노드의 수로 계산을 하며, 원본 유사도와 정보 손실이 반비례한다. 트랜잭션 레코드의 원본 유사도는 트랜잭션에 해당하는 각 항목들은 항목들의 전체 크기로 가중치를 가지며, 1-항목 원본 유사도에 상기 가중치를 부여하여 계산한다.The original similarity calculation unit 220 calculates the original similarity of the transaction record by forming a 1-item original similarity for each item of the transaction record, and comparing the original similarity and the original data set for each item. The 1-item original similarity is calculated as the number of child nodes representing the item hierarchy information to the domain size of the entire item, and the original similarity and information loss are inversely proportional. The original similarity of the transaction record is calculated by assigning the weight to the 1-item original similarity, and each item corresponding to the transaction has a weight as the total size of the items.

트랜잭션 레코드의 원본 유사도는 상기 원본 레코드 세트와 비식별 결과 레코드 세트 사이의 유사도를 나타내며, 비식별 결과 레코드 세트의 각각의 레코드에 대한 원본 레코드와의 유사도를 계산하고, 유사도를 평균 내어 결과 유사도를 산출하며, 결과 유사도는 비식별 조치에 대한 통계적 유사성 및 활용 품질 지표로 활용한다.The original similarity of the transaction record represents the similarity between the original record set and the non-identification result record set, calculates the similarity to the original record for each record of the non-identification result record set, and averages the similarity to calculate the result similarity. The result similarity is used as an indicator of statistical similarity and utilization quality for non-identification measures.

이하에서는 개인정보 비식별 데이터의 품질 측정 장치(10)가 비정형 트랜잭션을 통해 재식별 검증 기법을 설명하기로 한다.Hereinafter, a method of re-identification verification by the apparatus 10 for measuring the quality of non-identifying personal information data through an unstructured transaction will be described.

개인 중복도(p-중복성)는 여러 명의 개인들에 대한 개인정보를 담고 있는 항목들의 트랜잭션 데이터베이스에서 동일한 항목들이 최소 서로 다른 P(P는 자연수)명의 트랜잭션에 나타날 때 해당 항목들을 갖게 하여 특정 개인을 식별하지 못하게 한다.Individual redundancy (p-redundancy) is a transactional database of items that contain personal information for multiple individuals, and when the same items appear in at least different P (P is a natural number) transactions, a specific individual can be identified. Make it impossible to discern.

개인 정보를 담고 있는 트랜잭션 데이터베이스에서의 p-중복성은 동일한 생활정보에 대해서 최소 서로 다른 P명의 개인들이 해당 생활정보를 갖고 있어야 한다. The p-redundancy in a transactional database containing personal information requires at least P different individuals to have the living information for the same living information.

아래의 표 1은 개인 정보 신상 테이블을 나타내고, 표 2는 개인별 원본 트랜잭션 테이블을 나타낸다.Table 1 below shows the personal information personal information table, and Table 2 shows the original transaction table for each individual.

USER_IDUSER_ID NAMENAME INCOMEINCOME UID1UID1 JohnJohn 3,400,0003,400,000 UID2UID2 AliceAlice 1,500,0001,500,000 UID3UID3 BobBob 5,640,0005,640,000

TRANSACTIONTRANSACTION USER_IDUSER_ID ITEMSITEMS TID1TID1 UID1UID1 우유, 계란, 버터, 식빵Milk, eggs, butter, bread TID2TID2 UID2UID2 우유, 계란, 식빵Milk, eggs, bread TID3TID3 UID3UID3 우유, 버터Milk, butter TID4TID4 UID1UID1 우유, 계란, 커피Milk, eggs and coffee TID5TID5 UID2UID2 식빵, 라면Bread, ramen TID6TID6 UID1UID1 식빵, 버터White bread, butter

상술한 표 1 및 표 2는 공격자가 John이 계란과 버터를 구입했다는 사실을 알고 있으며, 표 2의 개인별 트랜잭션 데이터베이스를 검색할 때 TID1이 John의 구입내역이라는 사실을 알게 된다면 John이 우유와 식빵도 같이 구입했다는 사실을 알 수 있다. 이는 개인의 구매 항복에 대하여 정보가 유출될 위험이 있다는 것을 안다.Tables 1 and 2 above know that the attacker has purchased eggs and butter, and if he finds out that TID1 is John's purchase when searching the personal transaction database in Table 2, John will also have milk and bread. You can see that they were purchased together. It knows that there is a risk of information leakage in case of individual surrendering to purchase.

따라서, 트랜잭션 비식별 조치에 대한 위험 품질 평가 지표는 이러한 비식별 처리된 트랜잭션 데이터 세트에 대해 개인 중복도(p-중복성) 모델을 이용하여 비식별 트랜잭션 데이터 세트에 대한 재식별 위험도 및 개인 추정 가능성을 측정하여 사용한다.Therefore, the risk quality assessment indicator for transaction de-identification measures uses the personal redundancy (p-redundancy) model for these de-identified transaction data sets to determine the re-identification risk and individual estimation potential for the de-identified transaction data set. Measure and use.

개인 중복도 검증은 개인 중복도 기반 빈발 항목 집합을 통해 검증한다. 개인 중복도 기반 빈발 항목 집합을 구하는 과정은 Apriori 방법과 유사하지만 항목이 나타난 트랜잭션을 발생시킨 각 개인들의 총 합인 개인 중복수(Personal Support)를 지지도 개념으로 사용한다. Individual redundancy verification is verified through a set of frequent items based on individual redundancy. The process of obtaining a set of frequent items based on personal redundancy is similar to the Apriori method, but uses Personal Support, which is the total sum of individuals who have generated transactions in which the item appears, as the concept of support.

아래의 표 3은 표 2의 개인 구매 내역 테이블을 나타낸 것이다.Table 3 below shows the personal purchase history table of Table 2.

USER_IDUSER_ID NAMENAME ITEM SETITEM SET UID1UID1 JohnJohn 우유, 식빵, 버터, 계란, 라면Milk, bread, butter, eggs, ramen UID2UID2 AliceAlice 우유, 식빵, 계란, 커피Milk, bread, eggs, coffee UID3UID3 BobBob 우유, 버터Milk, butter

아래의 표 4는 표 3을 개인 중복수 기반으로 지지도를 구한 것으로 모든 항목 집합에 대해 나타낸 것이며, 트랜잭션 테이블의 개인 중복수 기반 테이블을 나타낸 것이다.Table 4 below shows the support of Table 3 based on the number of individual duplicates, and shows all item sets, and shows the table based on the number of individual duplicates in the transaction table.

1-ITEM1-ITEM PSPS ITEM SETITEM SET PSPS ITEM SETITEM SET PSPS ITEM SETITEM SET PSPS 우유milk 33 우유, 계란Milk, eggs 22 계란, 커피Eggs, coffee 1One 우유, 계란, 커피Milk, eggs and coffee 1One 계란egg 22 우유, 버터Milk, butter 22 식빵, 버터White bread, butter 1One 우유, 버터, 식빵Milk, butter, bread 1One 버터butter 22 우유, 식빵Milk, bread 22 식빵, 커피Bread, coffee 1One 우유, 계란, 버터, 식빵Milk, eggs, butter, bread 1One 식빵bread 22 우유, 커피Milk, coffee 1One 식빵, 라면White bread, ramen 1One 커피coffee 1One 계란, 버터Eggs, butter 1One 우유, 계란, 식빵Milk, eggs, bread 22 라면Ramen 1One 계란, 식빵Egg, white bread 22 우유, 계란, 버터Milk, eggs and butter 1One

상술한 표 4를 참조하면 전체 항복(집합)에 대한 개인 중복수는 총 3명(John, Alice, Bob)의 고객 중 몇 명의 개인 트랜잭션에 나타난 항목인지를 개인 중복수를 통해 확인할 수 있다.Referring to Table 4 above, the number of individual duplicates for the total surrender (set) can be checked through the number of individual duplicates as to how many individual transactions among a total of three customers (John, Alice, and Bob).

비식별 위험 품질 평가 지표는 생성된 개인 중복도 기반 빈발 항목 집합 테이블과 해당 비식별 트랜잭션 테이블을 이용하여 개인 중복도 검증에서 발견된 레코드들을 사용할 수 있다.The non-identification risk quality evaluation index can use the records found in the personal redundancy verification using the generated personal redundancy-based frequent item set table and the corresponding non-identification transaction table.

개인 중복도 검증은 개인 중복도 모델 즉, p 값에 따라서 검증하려는 지표가 두 가지로 나누어진다. 상기 p 값이 1인 경우는 재식별 위험도를 나타내며, p 값이 2 이상인 경우는 개인 추정 가능성을 계산한다. 개인 중복도 검증은 비식별 트랜잭션에 대한 개인 중복도 기반 테이블을 생성한 다음 이루어진다. 개인 중복도 p 값을 개인 최소 지지도(Minimum Personal Support)로 정의하고 개인 최소 지지도를 넘지 못하는 항목 집합을 갖는 트랜잭션 레코드의 수를 구한다.Individual redundancy verification is divided into two indexes to be verified according to the individual redundancy model, that is, p value. When the p value is 1, the risk of re-identification is indicated, and when the p value is 2 or more, the possibility of personal estimation is calculated. Personal redundancy verification is performed after creating a table based on personal redundancy for non-identifying transactions. The personal redundancy p value is defined as the minimum personal support, and the number of transaction records having an item set that does not exceed the personal minimum support is calculated.

재식별 위험도와 개인 추정 가능성의 비율은 표준적인 측정 방법으로 활용하기 위해 전체 데이터 세트의 크기 대비로 계산하여 재식별 위험도 및 개인 추정 가능성의 크기가 항상 [0,1] 사이의 값을 갖게 한다.The ratio of the risk of re-identification and the likelihood of individual estimation is calculated as a comparison of the size of the entire data set in order to use it as a standard measurement method, so that the size of the risk of re-identification and the likelihood of individual estimation is always between [0,1].

이하에서는 재식별 위험도와 개인 추정 가능성 위험도를 비식별 위험 품질 지표로 사용하는 것에 대해 설명하기로 한다. 도 2는 비식별 위험 품질 지표의 흐름도이다.Hereinafter, a description will be given of using the re-identification risk and the individual presumable risk as non-identification risk quality indicators. 2 is a flow chart of a non-identifying risk quality indicator.

개인 중복도(p)는 분석되는 데이터의 성격에 따라 그 값을 특정하여 사용할 수 있으며, p 값이 커질수록 비식별화 정도는 높아지지만, p 값이 무한대라면 데이터베이스의 모든 내용의 구분이 불가능할 것이며, 반대로 p 값이 1이라면, 기존의 데이터베이스와 동일한 형태로 모든 값의 구별이 가능하게 되므로, 분석되는 데이터의 성격에 따라 적절한 값을 설정한다.Individual redundancy (p) can be used by specifying its value according to the nature of the data being analyzed, and the degree of de-identification increases as the p value increases, but if the p value is infinite, it will not be possible to distinguish all contents of the database. On the contrary, if the p value is 1, all values can be distinguished in the same form as in the existing database, so an appropriate value is set according to the nature of the analyzed data.

개인 중복도 검증은 개인 중복도(p)의 값이 1 또는 2 이상일 때로 나뉘며, p=1일 경우, 재식별 위험도를 계산하며, p≥2일 경우, 개인 추정 가능성을 계산한다.Individual redundancy verification is divided into when the value of the individual redundancy (p) is 1 or 2 or more, and when p=1, the risk of re-identification is calculated, and when p≥2, the individual estimation probability is calculated.

1-중복도 검증은 개인 정보가 담겨있는 트랜잭션 레코드에 해당 트랜잭션의 항목들이 특정 개인 한명에게만 발생하여 특정 개인의 트랜잭션인지를 식별할 수 있게 하는 트랜잭션 레코드를 검사한다. 즉, 재식별 위험성이 존재하는 트랜잭션 레코드를 검증한다. 검증 방법은 개인 최소 지지도(Minimum Personal Support)인 p 값을 1로 두어 그 지지도를 넘지 못하는 항목 집합을 가지는 트랜잭션 레코드의 수를 구한다.The 1-redundancy verification checks the transaction record that allows you to identify whether the transaction is of a specific individual because the items of the transaction occur only to a specific individual in the transaction record containing personal information. In other words, it verifies the transaction record in which the risk of re-identification exists. In the verification method, a value of p, which is the minimum personal support, is set to 1 and the number of transaction records having an item set that does not exceed the support is calculated.

1-중복도 검증에서 식별되는 레코드들은 항목 집합의 조합이 유일한 값 조합이므로 해당 레코드들을 재식별 위험이 있는 레코드로 정의할 수 있다. 재식별 위험도를 표준적인 측정 방법으로 활용하기 위하여 재식별 위험도를 다음과 같이 정의한다.Records identified in 1-redundancy verification can be defined as records at risk of re-identification because the combination of item sets is the only combination of values. In order to use the re-identification risk as a standard measurement method, the re-identification risk is defined as follows.

상술한 수학식 1을 참조하면, 개인 중복도가 1인 경우,

은 재식별 위험성이 있는 트랜잭션 레코드,

는 비식별 트랜잭션 전체 레코드를 의미하며, 재식별 위험도를 [0,1]로 상기 수학식 1로 정의할 수 있다. 이때, T는 검사 할 비식별 트랜잭션이다.Referring to Equation 1 above, when the individual redundancy is 1,

Is a transaction record at risk of re-identification,

Denotes an entire record of non-identification transactions, and may be defined as Equation 1 as the risk of re-identification as [0,1]. At this time, T is the non-identifying transaction to be checked.

개인 중복도(p)가 2 이상일 경우, p-중복성에서 검증된 트랜잭션 데이터는 모든 항목 집합에 대하여 유일한 항목 집합이 존재하지 않아 개인정보를 담고 있는 항목들을 통해 특정 개인을 식별할 수 있는 데이터가 존재하지 않는다. 특정 개인을 재식별 할 수는 없지만 개인 중복도 값을 2이상으로 두어 검증을 할 때에는 특정 개인을 추정할 수 있는 가능성을 평가할 수 있다.When the individual redundancy (p) is 2 or more, the transaction data verified in p-redundancy does not have a unique item set for all item sets, so data that can identify a specific individual through items containing personal information exist. I never do that. Although it is not possible to re-identify a specific individual, the possibility of estimating a specific individual can be evaluated when verifying with an individual redundancy value of 2 or more.

아래의 표 5는 원본 트랜잭션 데이터베이스인 표 2를 항목 계층 구조를 이용하여 일반화 조치가 일어난 비식별 트랜잭션 데이터베이스이며, 표 2의 비식별 조치된 트랜잭션 테이블을 나타낸다.Table 5 below is a non-identified transaction database in which generalization measures were taken using the item hierarchy of Table 2, the original transaction database, and Table 2 shows the de-identified transaction table.

TRANSACTIONTRANSACTION USER_IDUSER_ID ITEMSITEMS TID1TID1 UID1UID1 우유, 계란, 식빵Milk, eggs, bread TID2TID2 UID2UID2 우유, 계란, 식빵Milk, eggs, bread TID3TID3 UID3UID3 계란, 식빵Egg, white bread TID4TID4 UID1UID1 우유, 계란Milk, eggs TID5TID5 UID2UID2 식빵, 음식Bread, food TID6TID6 UID1UID1 식빵, 음식Bread, food

상술한 표 5의 트랜잭션은 모두 1-중복도 검증에서 발견되는 레코드가 존재하지 않는다. 하지만 2-중복도 검증을 하면 4개의 레코드(TID1, TID2, TID5, TID6)가 발견된다. 이는 <우유, 계란, 식빵>의 항목 집합을 가지는 트랜잭션 데이터가 (TID1, UID1)과 (TID2, UID2)만이 존재하고 <식빵, 음식>의 항목 집합을 갖는 트랜잭션 데이터가 (TID5, UID2)와 (TID6, UID1)이기 때문이다.In all of the above-described transactions in Table 5, there are no records found in 1-redundancy verification. However, when performing 2-redundancy verification, 4 records (TID1, TID2, TID5, TID6) are found. This means that only (TID1, UID1) and (TID2, UID2) are the transaction data with the item set of <Milk, Egg, Bread>, and the transaction data with the item set of <Bread, Food> is (TID5, UID2) and ( TID6, UID1).

TID3의 경우 해단 항목 집합<계란, 식빵>을 가지는 레코드가 (TID1, UID1)과 (TID2, UID2)가 있어 개인 중복수가 3이기 때문에 발견되지 않는다. 이와 같이, 재식별 위험성을 가지는 레코드는 아니지만 낮은 p 값을 가지는 개인 중복도 검증에서 발견되는 레코드들을 특정 개인을 추정할 수 있는 가능성이 높은 레코드들임을 알 수 있다.In the case of TID3, there are (TID1, UID1) and (TID2, UID2) records with a set of dissolution items <egg, bread>, and since the number of individual duplicates is 3, it is not found. As such, it can be seen that the records that are not at risk of re-identification, but are found in the verification of individual redundancy with a low p value, are records with a high probability of estimating a specific individual.

따라서, 개인 추정 가능성은 특정 개인을 추정할 수 있는 확률로 개인 중복도 검증으로 검증을 하게 되면 트랜잭션 데이터 베이스에서 동일한 항목들이 최소 서로 다른 p명의 트랜잭션에 나타나기 때문에

만큼의 개인 추정 가능성을 갖는다.Therefore, the possibility of personal estimation is the probability of estimating a specific individual, and if verification is performed by verifying individual redundancy, the same items appear in at least p different transactions in the transaction database.

It has as many personal estimates as possible.

상술한 수학식 2를 참조하면, 개인 중복도가 p,

는 개인 추정 가능성이 있는 트랜잭션 레코드,

비식별 트랜잭션 전체 레코드를 의미하며, 개인 추정 가능성 위험도를 [0,1]로 수학식 2를 정의한다. 이때, T는 검사 할 비식별 트랜잭션이다.Referring to Equation 2 above, the individual redundancy is p,

Is a transaction record with potential personal estimation,

It means the entire record of the non-identifying transaction, and Equation 2 is defined as [0,1] as the risk of personal estimation. At this time, T is the non-identifying transaction to be checked.

따라서, 표 5의 개인 추정 가능성을 상술한 수학식 2를 사용하여 계산하면 2-중복성 검증을 하였을 때, 전체 트랜잭션 데이터베이스의

만큼의 데이터가

확률로 개인 추정 가능성 위험이 존재한다고 할 수 있다.Therefore, if the individual estimation probability in Table 5 is calculated using Equation 2 above, when the 2-redundancy verification is performed, the entire transaction database

As much data as

It can be said that there is an individual presumable possibility risk by probability.

비식별 모델에 관한 검증 방법으로 재식별 위험도 및 개인 추정 가능성은 레코드의 유통에 대한 위험 품질 지표일 수 있다. 유통에 대한 위험 품질 지표와 반대로 원본 유사도는 원본 레코드 세트와 결과 레코드 세트 간의 차이에 대한 통계적 유사성을 수치적으로 측정한 활용 품질 지표이다. 즉, 생성된 데이터의 활용성을 정량적으로 평가한다. 원본 유사도 측정은 크기가 항상 [0,1] 사이의 값을 가지게 하여 정량적으로 평가할 수 있는 일반적인 품질 지표로 쓰일 수 있다.As a verification method for the non-identification model, the risk of re-identification and the likelihood of individual estimation may be indicators of the risk quality for the distribution of records. Contrary to the risk quality index for distribution, the original similarity is a utilization quality index that numerically measures the statistical similarity of the difference between the original record set and the result record set. That is, the utility of the generated data is quantitatively evaluated. The original similarity measurement can be used as a general quality index that can be quantitatively evaluated by making the size always have a value between [0,1].

활용 품질 지표는 두 가지로 나누어 원본 유사도를 계산 및 평가한다. 첫 번째는 잔존율로 비식별 조치 과정에서 삭제되는 데이터의 비율을 지표로 분석함으로써 비식별 조치된 유통 데이터의 활용성을 평가하고, 두 번째는 항목 기반 원본 유사도이다. The utilization quality index is divided into two, and the original similarity is calculated and evaluated. The first is the residual rate, which evaluates the usability of distribution data that has been de-identified by analyzing the ratio of data deleted in the process of de-identification as an indicator, and the second is the item-based original similarity.

이하에서는 해당 비식별 프라이버시 모델에 위반하는 레코드를 제거하여 비식별 데이터를 형성하는 과정을 설명하기로 한다. 도 3에서는 잔존율 원본 유사도를 나타내는 흐름도가 도시되어 있다.Hereinafter, a process of forming non-identifying data by removing a record violating the corresponding non-identifying privacy model will be described. 3 is a flowchart showing the similarity of the original residual ratio.

잔존율은 비식별 변환 과정에서 해당 비식별 처리에 위반하는 레코드를 제거하여 이에 따라 원본 레코드 수보다 적은 수의 결과 레코드를 보유하고 있다. 이렇게 제거되는 레코드의 수가 많을수록 데이터 세트의 활용성은 떨어지게 되므로 원본 레코드 수 대비 결과 레코드 수를 비율로 잔존율을 정의하여 데이터 활용성에 대한 지표로 평가한다. 도 3은 잔존율을 원본 유사도로 계산하는 설명에 대한 도면이다.Retention rate removes records that violate the de-identification process during the de-identification conversion process, and thus retains fewer result records than the original number of records. As the number of records removed in this way decreases the utilization of the data set, the residual ratio is defined as the ratio of the number of resulting records to the number of original records and evaluated as an indicator of data utilization. 3 is a diagram for explanation of calculating a residual rate as an original similarity.

상술한 수학식 3을 참조하면, 트랜잭션 데이터 세트(T)에 대하여 잔존율은 비식별 레코드의 수를 원본 레코드의 수로 나누어 구할 수 있다. 잔존율은 비식별된 레코드의 수가 클수록 크며, 이는 비식별 처리에 위반하는 레코드가 적게 제거될수록 잔존율이 크다. 따라서, 잔존율이 클수록 데이터 세트의 활용성도 크다.Referring to Equation 3 above, the residual rate for the transaction data set T can be obtained by dividing the number of non-identified records by the number of original records. The persistence rate increases as the number of unidentified records increases, and the residual rate increases as fewer records violating the de-identification process are removed. Therefore, the greater the survival rate, the greater the usability of the data set.

이하에서는 전체 비식별 처리된 트랜잭션 데이터에 대한 항목 기반 원본 유사도를 구하는 프로세서에 대해 설명하기로 하며, 트랜잭션 레코드의 각 1-항목에 대한 항목 유사도에 대해 정의하고 최종적으로 전체 트랜잭션 데이터 세트의 원본 유사도까지 설명한다. 도 4에서는 항목 기반 원본 유사도의 흐름도가 도시되어 있다.Hereinafter, a processor that obtains the item-based original similarity for the entire unidentified transaction data will be described, defining the item similarity for each 1-item of the transaction record, and finally, the original similarity of the entire transaction data set. Explain. 4 shows a flowchart of the item-based original similarity.

항목 기반 원본 유사도 흐름도와 같이 항목 기반 원본 유사도의 계산은 트랜잭션 레코드의 각 1-항목에 대한 항목 유사도를 정의하고, 이를 기준으로 트랜잭션 레코드에 해당하는 항목의 전체 크기로 각 항목마다 가중치를 부여하여 트랜잭션 레코드에 대한 원본 유사도를 정의한다. 이 후 모든 트랜잭션 레코드 유사도 값을 평균내어 전체 비식별 트랜잭션 데이터 세트의 원본 유사도 값을 정의한다.As shown in the item-based original similarity flow chart, the item-based original similarity calculation defines the item similarity for each 1-item in the transaction record, and based on this, the total size of the item corresponding to the transaction record is given a weight for each item. Define the original similarity to the record. After that, all transaction record similarity values are averaged to define the original similarity value of the entire unidentified transaction data set.

트랜잭션 데이터의 원본 유사도는 트랜잭션 데이터가 항목 집합의 형태로 나타나게 되어 각 항목 별로 원본 유사도를 원본 데이터 세트와 비교하여 구할 수 있다.The original similarity of the transaction data is obtained by comparing the original similarity of each item with the original data set because the transaction data is displayed in the form of an item set.

항목을 구성하는 데이터는 항목에 대한 계층 정보가 포함되어 있는데 이 계층 정보는 트랜잭션 데이터의 비식별 조치가 이루어 질 때 사용되는 정보이다. 개인의 기록을 신원 정보에 연결하여 개인 정보 침해가 일어날 위험이 있는 항목에 대해 계층 정보를 이용하여 일반화(Generalization)를 한다. 일반화는 초기 도메인의 다른 값이 대상 도메인의 단일 값에 매핑이 되도록 초기 도메인에서 다른 도메인으로 값이 매핑되는 것을 의미한다. 얼마만큼의 일반화가 이루어졌는지 그 양을 계산하여 정보 손실의 양을 계산한다.The data composing an item contains hierarchical information about the item, and this hierarchical information is information used when de-identification of transaction data is performed. By linking individual records to identity information, generalization is performed using hierarchical information about items that are at risk of invading personal information. Generalization means that values are mapped from the initial domain to other domains so that different values of the initial domain are mapped to a single value of the target domain. The amount of information loss is calculated by calculating how much generalization has been made.

1-항목은 트랜잭션 데이터를 구성하는 단위로 1-항목 집합 I = {i1,i2....,ii}는 트랜잭션 전체의 도메인으로 정의할 수 있다. (

= n,

는 I 집합의 전체 원소 개수) 1-항목으로 이루어진 원본 트랜잭션을 T = {t1,t2....,tt}, 비식별이 이루어진 트랜잭션을 P = {p1, p2,....,pk}, 비식별이 이루어진 트랜잭션의 항목(

)은 각 트랜잭션마다 일대일로 대응된다.1-item is a unit that composes transaction data, and 1-item set I = {i1,i2....,ii} can be defined as the domain of the entire transaction. (

= n,

Is the total number of elements in the set of I) T = {t1,t2....,tt} for the original transaction consisting of 1-item, and P = {p1, p2,....,pk} for the transaction with de-identification. , The item of the transaction in which de-identification was made (

) Is a one-to-one correspondence for each transaction.

비식별 조치가 이루어진 일반화된 1-항목에 대하여 원본과 비교하였을 때, 원본과의 차이를 정량적으로 계산하기 위해 근접 상위 노드를 정의한다. 예를 들어, a1,....,al이 조상 노드이고, u는 a1,....,al의 자식 노드 외에 다른 자식 노드를 갖지 않는다. 이때, u는 a1,....,al의 근접 상위 노드라 말한다. 근접 상위 노드는 계층 정보 트리에서, 입사귀 노드 a1,a2....,al만을 자식 노드로 갖는 상위 노드를 S에 대한 근접 상위 노드라 한다.When comparing with the original for a generalized 1-item for which de-identification measures were taken, a neighboring parent node is defined to quantitatively calculate the difference from the original. For example, a1,....,al is an ancestor node, and u has no child nodes other than the child nodes of a1,....,al. At this time, u is said to be the adjacent upper node of a1,....,al. In the hierarchical information tree, the neighboring higher node is called a neighboring higher node to S having only the incident ear nodes a1,a2....,al as child nodes.

근접 상의 노드는 1-항목 원본 유사도(1-item Similarity)를 이용하여 계산할 수 있다.(

= 입사귀 노드 a1,....,al의 총 개수 = l) 원본 트랜잭션 항목(

)의 근접 상위 항목을 ux 로 표현하면 두 가지 경우로 나누어 1-항목 원본 유사도를 계산할 수 있다.Nodes in proximity can be calculated using 1-item similarity.

= Total number of incoming ear nodes a1,....,al = l) Original transaction item (

If we express the close parent item of) as ux, we can calculate the 1-item original similarity by dividing it into two cases.

이하에서는 표 2가 표 5로 비식별 되는 트랙잭션을 구성하는 항목들에 대한 계층 정보를 설명한다. 도 5는 표 2의 항목 계층 정보를 나타내는 도면이다.Hereinafter, table 2 describes hierarchical information on items constituting a transaction that is not identified in Table 5. 5 is a diagram illustrating item hierarchy information in Table 2.

첫 번째로 원본 유사도가 0이 되는 경우는 비식별화된 황목의 일반화가 계층 정보의 최고 계층 노드(root node)까지 이루어져 그 모든 정보가 손실되었을 때 발생하는데 이는 도 5로 설명할 수 있다.First, when the original similarity is 0, generalization of the de-identified yellow tree occurs up to the highest layer node (root node) of the layer information, and all the information is lost, which can be described with FIG. 5.

도 5 및 표 2와 표 5에서 나타낸 바와 같이 (TID1, UID1)의 항목인 '버터'는 최고 계층 노드(root node)인 '*'로 일반화가 되었다. 이는 트랜잭션 TID1이 비식별 조치에 대하여 만족하지 못해 항목 '버터'의 일반화가 일어나야 하는데 항목 '버터'의 근접 상위 노드인 항목 '육류'로 일반화가 되어도 비식별 조치에 만족하지 못한다. 같은 방식으로 '육류'의 근접 상위 노드로 일반화하여 '음식'으로 일반화가 되어도 비식별 조치에 만족하지 못하기 때문에 최종적으로 최고 계층 노드인 '*'로 일반화가 된 것이다. 항목의 일반화가 계층 정보의 최고 계층 노드(*)까지 이루어지면 이 항목을 전지 작업(Pruning)시키며, TID1의 비식별 트랙잭션은 {우유, 계란, 식빵}이 된다. 이는 항목에 대한 모든 정보가 손실되었기 때문에 TID1의 항목 '버터'에 대한 원본 유사도는 0으로 계산한다.As shown in Figs. 5 and 2 and 5,'butter', which is an item of (TID1, UID1), has been generalized to'*', which is a root node. This is because transaction TID1 is not satisfied with the de-identification action, so the generalization of the item'butter' should take place. Even if it is generalized to the item'meat', which is the adjacent upper node of the item'butter', the de-identification action is not satisfied. In the same way, even if it generalizes to the neighboring upper node of'meat' and generalizes to'food', it is not satisfied with the de-identification measure, so it is finally generalized to'*', the highest level node. When the generalization of an item is made up to the highest level node (*) of the hierarchical information, this item is pruning, and the non-identifying transaction of TID1 becomes {milk, egg, bread}. This is because all information on the item is lost, so the original similarity to the item'butter' of TID1 is calculated as 0.

Pruning은 규칙의 일부가 잘리거나 무시되어도 좋은가를 결정하여 탐색 공간을 줄여줄 수 있다. Pruning can reduce the search space by deciding whether part of a rule can be truncated or ignored.

TID3의 항목 {우유, 계란, 커피} 중 1-항목 '커피'는 위와 같은 방식으로 근접 상위 노드로 일반화가 일어나다가 모든 상위 계층 항목이 비식별 조치에 만족하지 못하여 최종적으로 최고 계층 노드로 일반화가 되고 Pruning 되어 원본 유사도가 0이 된다.Among the items {Milk, Eggs, Coffee} of TID3, 1-item'coffee' was generalized to the neighboring upper node in the same way as above, but all higher-level items were not satisfied with the de-identification action, and finally generalized to the highest-tier node. And Pruning to make the original similarity 0.

두 번째로 원본 유사도를 근접 상위 노드의 크기를 이용하여 계산하는 경우는 TID5의 항목 집합 {식빵, 라면} 중 1-항목 '라면'이 '음식'으로 일반화된 것을 예로 들 수 있다. 이는 원본 트랜잭션 1-항목

가 최고 계층 항목이 아닌 근접 상위 항목인 ux 보다 더 상위 항목으로 일반화 된 경우에 대한 원본 유사도 계산 방식이다. '간식'은 항목 '라면'의 근접 상위 노드이므로 '간식'의 크기를 1이라 할 수 있다. 최종적으로 '간식'에서 '음식'으로 일반화가 되었기 때문에 '음식'의 크기를 계산해야 한다. '음식'의 크기는 4로 계산하는데 이는 '음식'이 1-항목 {계란, 버터, 식빵, 라면}을 자식 입사귀 노드로 갖기 때문이다.Second, when calculating the original similarity by using the size of the neighboring upper node, one-item'Ramen' of the item set {bread, ramen} of TID5 is generalized to'food'. This is the original transaction 1-item

This is the original similarity calculation method for the case where is generalized to an item higher than ux, which is the closest higher item, not the highest level item. Since'snack' is a node adjacent to the item'Ramen', the size of'snack' can be said to be 1. Finally, since it has been generalized from'snack'to'food', the size of'food' must be calculated. The size of'food' is calculated as 4, because'food' has 1-item {eggs, butter, bread, ramen} as child entrance ear nodes.

1-항목 원본 유사도는 전체 항목 도메인의 크기 대비로 계산하여 항상 그 범위가 [0,1] 사이의 값을 갖는다. 항목이 일반화가 일어났을 때의 원본 유사도는 일반화가 일어나지 않았을 때의 원본 유사도 1을 기준으로 항목 도메인 크기 대비 얼마만큼의 항목에 대한 일반화가 일어나 결과의 정보 손실이 발생하였는지 계산한다The 1-item original similarity is calculated as compared to the size of the entire item domain, and the range always has a value between [0,1]. The original similarity when the item is generalized is calculated based on the original similarity 1 when generalization has not occurred, and how many items have been generalized relative to the item domain size, resulting in loss of information.

따라서, (라면 -> 음식)의 원본 유사도는 전체 항목의 도메인의 크기(

= 6)를 분모로 항목 '간식'의 근접 상위 노드의 크기(

= 4)를 분자로 계산하여 총

만큼의 정보의 손실이 일어났다고 계산을 하여 트랜잭션 TID5의 (라면 -> 음식) 원본 유사도는 1-

=

로 계산할 수 있다. 같은 방식으로 트랜잭션 TID6 (버터 -> 음식) 원본 유사도는 (라면 -> 음식)과 같이 1-

=

로 계산할 수 있다.Therefore, the original similarity of (Ramen -> Food) is the size of the domain of the entire item (

= 6) as the denominator, the size of the adjacent parent node of the item'snack' (

= 4) to calculate the total

The similarity of the original (Ramen -> Food) in transaction TID5 is 1-

=

Can be calculated as In the same way, transaction TID6 (butter -> food) original similarity is 1-

=

Can be calculated as

아래의 표 6 및 표 7은 서로 다른 크기의 근접 상위 항목으로 일반화가 일어난 예시를 나타낸 것이다. 표 6은 개인별 원본 트랜잭션 테이블2이며, 표 7은 표 6의 비식별 조치된 트랜잭션 테이블이다.Tables 6 and 7 below show examples in which generalization has occurred with close-up items of different sizes. Table 6 shows the original transaction table 2 for each individual, and Table 7 shows the de-identified transaction table in Table 6.

TRANSACTIONTRANSACTION USER_IDUSER_ID ITEMSITEMS TID1TID1 UID1UID1 우유, 계란, 버터, 식빵Milk, eggs, butter, bread TID2TID2 UID2UID2 우유, 계란, 식빵Milk, eggs, bread TID3TID3 UID3UID3 계란, 우유Eggs, milk TID4TID4 UID1UID1 우유, 계란, 커피Milk, eggs and coffee TID5TID5 UID2UID2 식빵, 라면White bread, ramen TID6TID6 UID1UID1 식빵, 버터White bread, butter

TRANSACTIONTRANSACTION USER_IDUSER_ID ITEMSITEMS TID1TID1 UID1UID1 우유, 계란, 식빵Milk, eggs, bread TID2TID2 UID2UID2 우유, 계란, 식빵Milk, eggs, bread TID3TID3 UID3UID3 계란, 음료Eggs, drinks TID4TID4 UID1UID1 우유, 계란Milk, eggs TID5TID5 UID2UID2 식빵, 음식Bread, food TID6TID6 UID1UID1 식빵, 음식Bread, food

상술한 표 6 및 표 7에 따르면, TID3, TID4는 각 1-항목 {우유, 커피}가 같은 두 단계 상위 항목인 '음식'로 일반화가 되어 비식별 조치가 만족된 트랜잭션이 되었다. TID5, TID6의 각 항목 {라면, 버터}는 같은 두 단계 상위 항목인 '음식'으로 일반화가 되어 비식별 트랜잭션이 되었다.According to Tables 6 and 7 described above, TID3 and TID4 are generalized to'food', which is a two-level higher level item in which each 1-item {milk, coffee} is the same, resulting in a transaction satisfying the de-identification measure. Each item {ramen, butter} of TID5 and TID6 was generalized to'food', which is the same two-level higher item, and became a non-identifying transaction.

TID3의 항목 '우유'에 대한 원본 유사도를 계산하게 되면 항목 전체 도메인의 크기(

= 6)를 분모로 항목 '우유'의 두 단계 상위 노드의 크기(

= 2)를 분자로 계산하여 총

만큼의 정보의 손실이 일어났다고 계산을 하여 TID3의 항목 (우유 -> 음료) 원본 유사도는 1-

=

로 계산할 수 있다. 반면 TID5의 항목 (라면 -> 음식)에 대한 원본 유사도는 상술하듯이 1-

=

가 나온다.When calculating the original similarity for the item'milk' of TID3, the size of the entire domain of the item (

= 6) as the denominator, the size of the two-level upper node of the item'milk' (

= 2) to the numerator

Calculate that there was a loss of information as much as that, and the similarity of the original item (milk -> beverage) of TID3 is 1-

=

Can be calculated as On the other hand, the original similarity to the item (Ramen -> Food) of TID5 is 1-

=

Comes out.

서로 다른 근접 상위 노드를 갖는 각 1-항목에 대하여 같은 단계 크기만큼 일반화가 되었다고 해도 근접 상위 노드의 크기가 다르면 변환된 결과 1-항목의 정보 손실은 다르게 계산된다.Even if each 1-item having different adjacent upper nodes is generalized by the same step size, if the sizes of the adjacent higher nodes are different, the information loss of 1-item is calculated differently as a result of conversion.

따라서, 원본 유사도 계산 방식을 정리하면 1-항목 원본 유사도의 계산은 해당 항목이 얼마나 일반화가 되어 정보 손실이 일어났는지를 항목 전체의 도메인 크기 대비 자식 잎사귀 노드의 수로 계산을 하여 구한 뒤 원본 유사도와 정보 손실이 반비례함을 이용하여 계산한다. 이를 원본 유사도(SIMitem)라 정의하고 다음과 같이 정의한다.Therefore, when the method of calculating the original similarity is summarized, the calculation of the 1-item original similarity is calculated by calculating the number of child leaf nodes compared to the domain size of the entire item to determine how generalized the item has caused information loss. It is calculated using the loss is inversely proportional. This is defined as the original similarity (SIMitem) and is defined as follows.

상술한 수학식 4를 참조하면, 1-항목 원본 유사도(1-item Similarity)는 원본의 1-항목이 비식별 과정을 거치며 얼마나 많은 정보 손실을 하였는지 계산하고 정보 손실의 양과 반비례을 이용하여 1-항목에 대한 원본 유사도를 측정한다.Referring to Equation 4 above, 1-item similarity is calculated by calculating how much information was lost while 1-item of the original went through a de-identification process, and 1-item similarity was calculated using an inverse proportion to the amount of information loss. Measure the similarity to the original.

이하에서, 트랜잭션 레코드 원본 유사도는 특정 원본 레코드와 비식별 트랜잭션 레코드 쌍 사이의 유사성을 의미하며 비식별 트랜잭션 레코드가 가지고 있는 각각의 항목에 대한 원본 항목과의 유사도를 계산하고, 그 값을 기준으로 계산한다. 트랜잭션에 해당하는 각 항목들은 동일한 가중치(Weight)를 갖는다. 도 6은 트랜잭션 원본 유사도의 흐름도이다.In the following, the transaction record source similarity means the similarity between a specific original record and a pair of non-identifying transaction records, and the similarity to the original item for each item of the non-identifying transaction record is calculated, and calculated based on the value. do. Each item corresponding to a transaction has the same weight. 6 is a flowchart of the similarity of the original transaction.

예를 들어 i번째 트랜잭셕 Ti의 항목 집합이 {a, b, c, d, e, f}라 하였을 때, 각 항목들은 총

의 가중치를 갖게 된다. 각 항목의 동일한 가중치와 수학식 5를 이용하여 1-항목 유사도를 곱한 값을 각 항목 당 해당 트랜잭션 항목 집합에서 영향을 준 지표로 계산하여 그 합을 해당 트랜잭션 레코드의 원본 유사도로 계산한다.For example, if the item set of the ith transaction Ti is {a, b, c, d, e, f}, each item is

Will have a weight of The same weight of each item is multiplied by 1-item similarity using Equation (5), and the sum is calculated as an index affected by the corresponding transaction item set for each item, and the sum is calculated as the original similarity of the corresponding transaction record.

표 6 및 표 7을 기준으로 트랜잭션(TID4, UID1)의 원본 유사도를 계산하게 되면 각 항목은 트랜잭션의 항목 집합의 크기인 3(우유, 계란, 커피)을 동일한 가중치를 갖는다. 항목 '우유, 계란'은 일반화 되지 않고 원본을 유지하여 1-항목 원본 유사도의 값은 1이고, 항목 '커피'의 경우 상위 항목인 '음료'로 일반화되어 원본 유사도의 값이

이다. 그러므로 트랜잭션(TID4, UID1)의 레코드 원본 유사도는

(1+1+

) =

의 값을 갖는다. 위의 계산 방식에 따라서 트랜잭션 레코드 원본 유사도는 아래 수학식 5와 같이 나타낼 수 있다.When calculating the original similarity of transactions (TID4, UID1) based on Tables 6 and 7, each item has the same weight as 3 (milk, egg, coffee), the size of the item set of the transaction. The item'milk, egg' is not generalized and the original is maintained, so that the value of 1-item original similarity is 1, and in the case of item'coffee', the value of original similarity is generalized to'beverage', the parent item.

to be. Therefore, the record source similarity of transaction (TID4, UID1) is

(1+1+

) =

Has the value of According to the above calculation method, the similarity of the original transaction record can be expressed as Equation 5 below.

상술한 수학식 5를 참조하면, I는 전체 도메인, T는 트랜잭션의 항목 집합의 크기이다. 수학식 5는 1-항목 원본 유사도에 동일한 가중치를 적용하여 항목의 원본 유사도 값을 모두 더한 식이다.Referring to Equation 5 above, I is the entire domain, and T is the size of a transaction item set. Equation 5 is an equation obtained by adding all the original similarity values of the items by applying the same weight to the 1-item original similarity.

결론적으로 전체 비식별 트랜잭션 테이블의 원본 유사도는 원본 레코드 세트와 비식별 결과 레코드 세트 사이의 유사도를 의미하며, 비식별 결과 레코드 세트의 각각의 레코드에 대해 원본 레코드와의 레코드 유사도를 계산하고, 그 값을 평균내에 계산한다. 그 결과 값을 비식별 조치에 대한 통계적 유사성 및 활용성 품질 지표로 사용한다.In conclusion, the original similarity of the entire non-identification transaction table means the similarity between the original record set and the non-identification result record set. For each record in the non-identification result record set, record similarity with the original record is calculated, and its value Is calculated within the average. The resulting value is used as an indicator of statistical similarity and usability quality for non-identifying measures.

도 7은 본 발명의 다른 실시예에 따른 개인정보 비식별 데이터의 품질 측정 방법을 예시한 흐름도이다. 개인정보 비식별 데이터의 품질 측정 방법은 컴퓨팅 디바이스에 의하여 수행될 수 있으며, 개인정보 비식별 데이터의 품질 측정 장치가 수행하는 동작에 관한 상세한 설명과 중복되는 설명은 생략하기로 한다.7 is a flowchart illustrating a method of measuring quality of non-identifying personal information data according to another embodiment of the present invention. The method of measuring the quality of the non-identified personal information data may be performed by a computing device, and a detailed description of the operation performed by the apparatus for measuring the quality of the non-identified personal information will be omitted.

단계 S710에서, 컴퓨팅 디바이스는 복수의 개인에 대한 개인 정보를 담고 있는 트랜잭션 데이터베이스에서 동일한 항목 집합들이 최소 서로 다른 개인의 트랜잭션에 나타날 때 특정 개인을 식별하지 못하게 하는지 검증하는 모델인 개인 중복도 모델을 이용하여 재식별 여부를 검증한다.In step S710, the computing device uses a personal redundancy model, which is a model that verifies whether a specific individual cannot be identified when the same item sets appear in at least different transactions in a transaction database containing personal information for a plurality of individuals. To verify the re-identification.

단계 S710은 개인 정보의 항목이 나타난 트랜잭션을 발생시킨 각 개인들의 총 합인 개인 중복수를 지지도의 개념으로 사용하여 개인 중복도 기반 빈발 항목 집합을 찾아 개인 중복도 기반 테이블을 생성하고, 생성된 개인 중복도 기반 테이블을 기반으로, 개인 중복도 모델에 따라 개인 중복도를 검증한다.In step S710, the personal redundancy number, which is the total number of individuals who have generated the transaction in which the items of personal information appear, is used as a concept of support, finds a set of frequent items based on individual redundancy, creates a table based on personal redundancy, and creates a personal redundancy. Based on the degree-based table, personal redundancy is verified according to the individual redundancy model.

개인 중복도 검증에 발견되는 레코드를 계산하는 것은 의사코드로 표현하면 다음과 같다.Calculation of records found in personal redundancy verification is expressed in pseudocode as follows.

for each item set in in Tfor each item set in in T

RC.count ← 0RC.count ← 0

if(for all record of PFI.ps p)if(for all record of PFI.ps p)

for n=0; n<T.item.size; n++ dofor n=0; n<T.item.size; n++ do

if in

PFI of the record thenif in

PFI of the record then

end ifend if

if in

PFI of the record thenif in

PFI of the record then

RC.count ++ // Uniqueness record countRC.count++ // Uniqueness record count

end ifend if

end forend for

return RC.countreturn RC.count

개인 중복도 검증 알고리즘에서, T는 검사 트랜잭션 테이블, PFI는 개인 중복도 기반 빈발 항목 집합 테이블, p는 개인 최소 지지도를 의미한다. 개인 중복도 검증 알고리즘은 T, PFI 및 p가 입력되어 p-중복도 검증 레코드 카운트(RC)가 출력된다.In the personal redundancy verification algorithm, T is the check transaction table, PFI is the personal redundancy-based frequent item set table, and p is the minimum individual support. In the personal redundancy verification algorithm, T, PFI, and p are input, and a p-redundancy verification record count (RC) is output.

개인 중복도 검증은 개인 중복도 모델이 1일 경우, 개인 최소 지지도를 1로 두고 개인 최소 지지도를 넘지 못하는 항목 집합을 갖는 재식별 위험이 있는 레코드를 기반으로 재식별 위험도를 계산하고, 개인 중복도 모델이 2 이상일 경우, 특정 개인을 추정하는 가능성을 확률로 평가하여 트랜잭션 데이터베이스에 따른 개인 추정 가능성 위험을 통해 개인 추정 가능성을 계산한다.In the personal redundancy verification, if the individual redundancy model is 1, the individual minimum support is set to 1, and the re-identification risk is calculated based on the record with the risk of re-identification having an item set that does not exceed the individual minimum support. If the model is 2 or more, the probability of estimating a specific individual is evaluated as a probability, and the individual estimating probability is calculated through the individual estimating probability risk according to the transaction database.

단계 S720에서, 컴퓨팅 디바이스는 원본 레코드와 비식별 레코드 차이에 대한 통계적 유사성을 수치적으로 측정한 활용 품질 지표를 이용하여 원본 유사도 여부를 측정한다.In step S720, the computing device measures the original similarity by using a utilization quality index that numerically measures the statistical similarity between the original record and the non-identified record.

원본 유사도는 원본 트랜잭션 데이터와 비식별화 시킨 트랜잭션 데이터의 유사도가 얼마나 되는지를 평가한다. 원본 유사도가 매우 유사한 경우, 특정 개인의 정보가 재식별되는 문제가 있으며, 원본 유사도가 지나치게 상이한 경우, 통계적 유사성이 떨어지는 데이터를 활용하는 문제가 있다.Original similarity evaluates the degree of similarity between the original transaction data and the de-identified transaction data. If the original similarity is very similar, there is a problem that information of a specific individual is re-identified, and if the original similarity is too different, there is a problem of using data with poor statistical similarity.

활용 품질 지표를 이용하여 원본 유사도를 여부를 측정하는 단계(S720)는 원본 데이터를 비식별 데이터로 변환하는 과정에서 비식별 처리에 위반하는 레코드를 제거하여 원본 데이터 수보다 적은 수의 상기 비식별 레코드를 형성하며, 비식별 레코드의 수 및 원본 레코드의 수를 이용하여 잔존율을 계산한다.In the step (S720) of measuring whether or not there is an original similarity using the utilization quality indicator, the number of the non-identifying records less than the number of the original data is removed by removing records that violate the non-identification processing in the process of converting the original data into non-identifying data. And the number of non-identified records and the number of original records are used to calculate the residual ratio.

또한, 활용 품질 지표를 이용하여 원본 유사도를 여부를 측정하는 단계(S720)는 트랜잭션 레코드의 각 항목에 대한 1-항목 원본 유사도를 형성하고, 각 항목 별 원본 유사도와 원본 데이터 세트를 비교하여 트랜잭션 레코드의 원본 유사도를 계산한다. 1-항목 원본 유사도는 항목 전체의 도메인 크기 대비 항목 계층 정보를 나타내는 자식 노드의 수로 계산을 하며, 원본 유사도와 정보 손실이 반비례한다.In addition, in the step of measuring whether or not there is an original similarity using the utilization quality index (S720), a 1-item original similarity is formed for each item of the transaction record, and the original similarity for each item is compared with the original data set to record the transaction. Calculate the original similarity of The 1-item original similarity is calculated as the number of child nodes representing the item hierarchy information to the domain size of the entire item, and the original similarity and information loss are inversely proportional.

트랜잭션 레코드의 원본 유사도는 트랜잭션에 해당하는 각 항목들이 항목들의 전체 크기로 가중치를 가지며, 상기 1-항목 원본 유사도에 가중치를 부여하여 계산한다. 트랜잭션 레코드의 원본 유사도는 원본 레코드 세트와 비식별 결과 레코드 세트 사이의 유사도를 나타내며, 비식별 결과 레코드 세트의 각각의 레코드에 대한 원본 레코드와의 유사도를 계산하고, 유사도를 평균 내어 결과 유사도를 산출하며, 결과 유사도는 비식별 조치에 대한 통계적 유사성 및 활용 품질 지표로 활용한다.The original similarity of the transaction record is calculated by assigning a weight to the 1-item original similarity of each item corresponding to the transaction as the total size of the items. The original similarity of the transaction record represents the similarity between the original record set and the non-identified result record set, calculates the similarity with the original record for each record of the non-identified result record set, and averages the similarity to calculate the result similarity. , The result similarity is used as an indicator of statistical similarity and utilization quality for non-identifying measures.

도 7에서는 각각의 과정을 순차적으로 실행하는 것으로 개재하고 있으나 이는 예시적으로 설명한 것에 불과하고, 이 분야의 기술자라면 본 발명의 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 도 7에 기재된 순서를 변경하여 실행하거나 또는 하나 이상의 과정을 병렬적으로 실행하거나 다른 과정을 추가하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이다.In FIG. 7, it is interposed that each process is sequentially executed, but this is merely an example, and if a person skilled in the field is concerned, the order shown in FIG. 7 is changed and executed without departing from the essential characteristics of the embodiment of the present invention. Or, by executing one or more processes in parallel, or adding other processes, various modifications and variations may be applied.

도 8은 예시적인 실시예들에서 사용되기에 적합한 컴퓨팅 디바이스를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도이다. 도시된 실시예에서, 각 컴포넌트들은 이하에 기술된 것 이외에 상이한 기능 및 능력을 가질 수 있고, 이하에 기술되지 것 이외에도 추가적인 컴포넌트를 포함할 수 있다.8 is a block diagram illustrating and describing a computing environment including a computing device suitable for use in example embodiments. In the illustrated embodiment, each component may have different functions and capabilities in addition to those described below, and may include additional components in addition to those not described below.

도시된 컴퓨팅 환경은 개인정보 비식별 데이터의 품질 측정 장치(10)를 포함한다. 일 실시예에서, 개인정보 비식별 데이터의 품질 측정 장치(10)는 타 단말과 신호를 송수신하는 모든 형태의 컴퓨팅 디바이스일 수 있다. The illustrated computing environment includes an apparatus 10 for measuring the quality of personal information non-identifying data. In an embodiment, the apparatus 10 for measuring the quality of personal information non-identifying data may be any type of computing device that transmits and receives signals to and from other terminals.

개인정보 비식별 데이터의 품질 측정 장치(10)는 적어도 하나의 프로세서(810), 컴퓨터 판독 가능한 저장매체(820) 및 통신 버스(860)를 포함한다. 프로세서(810)는 개인정보 비식별 데이터의 품질 측정 장치(10)로 하여금 앞서 언급된 예시적인 실시예에 따라 동작하도록 할 수 있다. 예컨대, 프로세서(810)는 컴퓨터 판독 가능한 저장 매체(820)에 저장된 하나 이상의 프로그램들을 실행할 수 있다. 상기 하나 이상의 프로그램들은 하나 이상의 컴퓨터 실행 가능 명령어를 포함할 수 있으며, 상기 컴퓨터 실행 가능 명령어는 프로세서(810)에 의해 실행되는 경우 개인정보 비식별 데이터의 품질 측정 장치(10)로 하여금 예시적인 실시예에 따른 동작들을 수행하도록 구성될 수 있다.The apparatus 10 for measuring the quality of personal information non-identifying data includes at least one processor 810, a computer-readable storage medium 820, and a communication bus 860. The processor 810 may cause the apparatus 10 for measuring the quality of personal information non-identifying data to operate according to the aforementioned exemplary embodiment. For example, the processor 810 may execute one or more programs stored in the computer-readable storage medium 820. The one or more programs may include one or more computer-executable instructions, and when the computer-executable instructions are executed by the processor 810, the apparatus 10 for measuring the quality of personal information non-identifying data may be used as an exemplary embodiment. It may be configured to perform operations according to.

컴퓨터 판독 가능한 저장 매체(820)는 컴퓨터 실행 가능 명령어 내지 프로그램 코드, 프로그램 데이터 및/또는 다른 적합한 형태의 정보를 저장하도록 구성된다. 컴퓨터 판독 가능한 저장 매체(820)에 저장된 프로그램(830)은 프로세서(810)에 의해 실행 가능한 명령어의 집합을 포함한다. 일 실시예에서, 컴퓨터 판독한 가능 저장 매체(820)는 메모리(랜덤 액세스 메모리와 같은 휘발성 메모리, 비휘발성 메모리, 또는 이들의 적절한 조합), 하나 이상의 자기 디스크 저장 디바이스들, 광학 디스크 저장 디바이스들, 플래시 메모리 디바이스들, 그 밖에 개인정보 비식별 데이터의 품질 측정 장치(10)에 의해 액세스되고 원하는 정보를 저장할 수 있는 다른 형태의 저장 매체, 또는 이들의 적합한 조합일 수 있다.Computer-readable storage medium 820 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information. The program 830 stored in the computer-readable storage medium 820 includes a set of instructions executable by the processor 810. In one embodiment, the computer-readable storage medium 820 includes memory (volatile memory such as random access memory, nonvolatile memory, or a suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, It may be flash memory devices, other types of storage media that can be accessed by the quality measurement apparatus 10 of personal information non-identifying data and store desired information, or a suitable combination thereof.

통신 버스(860)는 프로세서(810), 컴퓨터 판독 가능한 저장 매체(820)를 포함하여 개인정보 비식별 데이터의 품질 측정 장치(10)의 다른 다양한 컴포넌트들을 상호 연결한다.The communication bus 860 interconnects various other components of the apparatus 10 for measuring the quality of personal information non-identifying data, including a processor 810 and a computer-readable storage medium 820.

개인정보 비식별 데이터의 품질 측정 장치(10)는 또한 하나 이상의 입출력 장치(미도시)를 위한 인터페이스를 제공하는 하나 이상의 입출력 인터페이스(840) 및 하나 이상의 통신 인터페이스(850)를 포함할 수 있다. 입출력 인터페이스(840) 및 통신 인터페이스(850)는 통신 버스(860)에 연결된다. 입출력 장치(미도시)는 입출력 인터페이스(840)를 통해 개인정보 비식별 데이터의 품질 측정 장치(10)의 다른 컴포넌트들에 연결될 수 있다. 예시적인 입출력 장치는 포인팅 장치(마우스 또는 트랙패드 등), 키보드, 터치 입력 장치(터치패드 또는 터치스크린 등), 음성 또는 소리 입력 장치, 다양한 종류의 센서 장치 및/또는 촬영 장치와 같은 입력 장치, 및/또는 디스플레이 장치, 프린터, 스피커 및/또는 네트워크 카드와 같은 출력 장치를 포함할 수 있다. 예시적인 입출력 장치(미도시)는 개인정보 비식별 데이터의 품질 측정 장치(10)를 구성하는 일 컴포넌트로서 개인정보 비식별 데이터의 품질 측정 장치(10)의 내부에 포함될 수도 있고, 개인정보 비식별 데이터의 품질 측정 장치(10)와는 구별되는 별개의 장치로 컴퓨팅 디바이스와 연결될 수도 있다.The apparatus 10 for measuring quality of personal information non-identifying data may also include one or more input/output interfaces 840 and one or more communication interfaces 850 that provide an interface for one or more input/output devices (not shown). The input/output interface 840 and the communication interface 850 are connected to the communication bus 860. The input/output device (not shown) may be connected to other components of the apparatus 10 for measuring the quality of non-identifying data of personal information through the input/output interface 840. Exemplary input/output devices include pointing devices (mouse or trackpad, etc.), keyboards, touch input devices (touch pads or touch screens, etc.), voice or sound input devices, input devices such as various types of sensor devices and/or photographing devices, And/or an output device such as a display device, a printer, a speaker, and/or a network card. An exemplary input/output device (not shown) is a component constituting the device 10 for measuring the quality of non-identifying personal information, and may be included in the device 10 for measuring the quality of non-identified personal information, or non-identifying personal information. It may be connected to the computing device as a separate device different from the data quality measuring apparatus 10.

본 실시예들에 따른 동작은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능한 매체에 기록될 수 있다. 컴퓨터 판독 가능한 매체는 실행을 위해 프로세서에 명령어를 제공하는 데 참여한 임의의 매체를 나타낸다. 컴퓨터 판독 가능한 매체는 프로그램 명령, 데이터 파일, 데이터 구조 또는 이들의 조합을 포함할 수 있다. 예를 들면, 자기 매체, 광기록 매체, 메모리 등이 있을 수 있다. 컴퓨터 프로그램은 네트워크로 연결된 컴퓨터 시스템 상에 분산되어 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다. 본 실시예를 구현하기 위한 기능적인(Functional) 프로그램, 코드, 및 코드 세그먼트들은 본 실시예가 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있을 것이다.The operations according to the embodiments may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. Computer-readable medium refers to any medium that has participated in providing instructions to a processor for execution. The computer-readable medium may include program instructions, data files, data structures, or a combination thereof. For example, there may be a magnetic medium, an optical recording medium, a memory, and the like. Computer programs may be distributed over networked computer systems to store and execute computer-readable codes in a distributed manner. Functional programs, codes, and code segments for implementing the present embodiment may be easily inferred by programmers in the technical field to which the present embodiment belongs.

본 실시예들은 본 실시예의 기술 사상을 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The present embodiments are for explaining the technical idea of the present embodiment, and the scope of the technical idea of the present embodiment is not limited by these embodiments. The scope of protection of this embodiment should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present embodiment.

10: 개인정보 비식별의 품질 측정 장치
100: 재식별 검증부
200: 원본 유사도 측정부10: Personal information non-identification quality measuring device
100: re-identification verification unit
200: original similarity measurement unit

Claims

In the method of measuring the quality of personal information non-identifying data by a computing device,
The step of verifying the risk of re-identification using the personal redundancy model, a model that verifies whether a specific individual is identified when the same set of items appears in at least different transactions in a transaction database containing personal information for a plurality of individuals. ; And
Including the step of measuring the original similarity using a utilization quality index that numerically measured the statistical similarity of the difference between the original record and the non-identified record,
The step of measuring the original similarity using the utilization quality indicator is to form a 1-item original similarity for each item of the transaction record, and calculate the original similarity of the transaction record by comparing the original similarity and the original data set for each item. Including the step of,
The 1-item original similarity is calculated as the number of child nodes representing item hierarchy information relative to the domain size of the entire item, and the original similarity and information loss are inversely proportional to the quality of personal information non-identifying data.

The method of claim 1,
Verifying the risk of re-identification using the personal redundancy model,
Generating an individual redundancy-based table by finding a set of frequent items based on individual redundancy by using the number of individual redundancy, which is the total number of individuals who generated the transaction in which the items of personal information appear, as a concept of support; And
And verifying an individual redundancy according to the individual redundancy model based on the individual redundancy-based table.

The method of claim 2,
The step of verifying the degree of personal redundancy,
When the individual redundancy model is 1, calculating a re-identification risk based on a record having a re-identification risk having an item set that does not exceed the individual minimum support level with the minimum individual support level as 1; And
When the personal redundancy model is 2 or more, the quality of personal information non-identifying data further comprising the step of calculating a personal estimation possibility through the personal estimation possibility risk according to the transaction database by evaluating a probability of estimating a specific individual How to measure.

The method of claim 1,
The step of measuring the original similarity using the utilization quality index,
In the process of converting original data to non-identifying data, records that violate the de-identification process are removed to form the number of non-identified records less than the number of the original data, and the number of the non-identified records and the number of the original records are determined. Method for measuring the quality of personal information non-identifying data, including the step of calculating a residual rate by using.

delete

The method of claim 1,
The quality of personal information non-identifying data, characterized in that the original similarity of the transaction record is calculated by assigning the weight to the 1-item original similarity of each item corresponding to the transaction as the total size of the items. How to measure.

The method of claim 7,
The original similarity of the transaction record represents the similarity between the original record set and the non-identification result record set, calculates the similarity with the original record for each record of the non-identification result record set, and averages the similarity. Calculate the similarity,
The result similarity is used as the statistical similarity to the non-identification measure and the utilization quality index.

Re-identification risk is verified using the personal redundancy model, a model that verifies whether a specific individual is identified when the same set of items appears in at least different transactions in a transaction database containing personal information about multiple individuals. Identification verification unit; And
Including an original similarity measurement unit that measures the original similarity using a utilization quality index that numerically measures the statistical similarity of the difference between the original record and the non-identified record,
The original similarity measurement unit includes an original similarity calculator configured to form a 1-item original similarity for each item of the transaction record, and calculate the original similarity of the transaction record by comparing the original similarity and the original data set for each item,
The 1-item original similarity is calculated by the number of child nodes representing the item hierarchy information relative to the domain size of the entire item, and the original similarity and information loss are inversely proportional to the quality measuring apparatus of personal information non-identifying data.

The method of claim 9,
The original similarity measurement unit,
In the process of converting original data to non-identifying data, records that violate the de-identification process are removed to form the number of non-identified records less than the number of the original data, and the number of the non-identified records and the number of the original records are determined. An apparatus for measuring quality of non-identifying personal information, including a residual rate calculation unit that calculates a residual rate by using.

delete

The method of claim 9,
The quality of personal information non-identifying data, characterized in that the original similarity of the transaction record is calculated by assigning the weight to the 1-item original similarity of each item corresponding to the transaction as the total size of the items. Measuring device.

The method of claim 9,
The original similarity of the transaction record represents the similarity between the original record set and the non-identification result record set, calculates the similarity with the original record for each record of the non-identification result record set, and averages the similarity. Calculate the similarity,
The resultant similarity is used as a statistical similarity to non-identifying measures and a usability quality index.