KR20210109299A

KR20210109299A - Method for distributed de-identification of large graph data

Info

Publication number: KR20210109299A
Application number: KR1020200024401A
Authority: KR
Inventors: 임동혁; 전민혁
Original assignee: 호서대학교 산학협력단
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2021-09-06
Also published as: KR102405084B1

Abstract

The present invention relates to distributed processing de-identification for large-scale graph data. More specifically, the provided distributed processing de-identification method for large-scale graph data based on distributed processing makes it impossible to know contents of personal information for large-capacity RDF graph data used for public data, personal medical data, and social relation data.

Description

{Method for distributed de-identification of large graph data}

본 발명은 대규모 그래프 데이터에 대한 분산처리 비식별화에 관한 것으로서, 더욱 상세하게는 공공 데이터, 개인 의료 데이터, 소셜 관계 데이터에 사용되는 대용량의 RDF 그래프 데이터에 대하여 개인 정보에 대한 내용을 알 수 없게 해주는 분산처리 기반의 대규모 그래프 데이터에 대한 분산처리 비식별화 방법에 관한 것이다. The present invention relates to distributed processing de-identification of large-scale graph data, and more particularly, to prevent the content of personal information from being known about large-capacity RDF graph data used for public data, personal medical data, and social relation data. It relates to a distributed processing de-identification method for large-scale graph data based on distributed processing.

최근 들어 RDF 포맷의 대용량 데이터 등이 웹에서 활발하게 작성 및 배포되고 있는 추세이다. 이러한 데이터는 정부 공공데이터, 개인 의료 데이터, 소셜 관계 데이터 등의 포맷으로 많이 쓰이는데, 배포시에 개인정보를 포함하므로 배포가 불가능하다. 그렇기에 배포를 위해선 RDF에 포함된 개인정보에서 개인의 신원을 제거해주는 비식별 처리 과정이 필요하다. Recently, large-capacity data in RDF format has been actively created and distributed on the web. These data are widely used in the format of government public data, personal medical data, and social relation data, but distribution is impossible because personal information is included in the distribution. Therefore, for distribution, a de-identification process that removes an individual's identity from the personal information contained in the RDF is required.

비식별 처리를 위해서 k-익명성, l-다양성, t-근접성과 같은 모델들이 연구되었다. 하지만 RDF 그래프 데이터에 비식별화 모델을 적용한 사례는 k-익명성 밖에 없다. 따라서 k-익명성보다 강력한 l-다양성 모델에 중점을 두어 개인정보를 보호하고자 한다. 이때 비식별화 모델의 구현 알고리즘 중에서 Anatomy 알고리즘은 l-다양성 모델의 구현에 많이 사용되고 있고, 일반화나 범주화 처리도 없어 데이터의 유용성 또한 높다. 그러나 데이터의 구조가 다른 RDF 데이터에 대해서는 Anatomy 알고리즘을 적용하기가 어렵고, 대규모 데이터 처리를 하기에는 소요되는 시간도 매우 크다는 단점이 있다. For de-identification processing, models such as k-anonymity, l-diversity, and t-proximity have been studied. However, the only case of applying the de-identification model to RDF graph data is k-anonymity. Therefore, we want to protect personal information by focusing on the l-diversity model, which is stronger than k-anonymity. At this time, the Anatomy algorithm among the implementation algorithms of the de-identification model is widely used to implement the l-diversity model, and there is no generalization or categorization processing, so the usefulness of the data is also high. However, it has disadvantages that it is difficult to apply the Anatomy algorithm to RDF data with different data structures, and the time required to process large-scale data is very large.

KRUS 10-2015-012044310-2015-0120443 AA

본 발명은 이와 같은 문제점을 해결하기 위하여 창안된 것으로서, Anatomy 알고리즘을 사용한 ㅣ-다양성 모델을 RDF 데이터에 적용하여 분산처리 기반의 대규모 그래프 데이터에 대한 분산처리 비식별화 방법을 제공하는 것을 그 목적으로 한다.The present invention was devised to solve such a problem, and it is to provide a distributed processing de-identification method for large-scale graph data based on distributed processing by applying the l-diversity model using the Anatomy algorithm to RDF data. do.

이와 같은 목적을 달성하기 위하여 본 발명에 따른 대규모 그래프 데이터에 대한 분산처리 비식별화 방법으로서, (a) RDF 데이터를 RDB의 속성 테이블로 변환시키는 단계; (b) 상기 단계 (a)에서 변환된 속성 테이블에서 식별자(ID) 및 준식별자(QI) 그리고 민감속성(SA)으로 구분한 후, 상기 식별자(ID)의 속성 테이블을 제거하는 단계; (c) 상기 단계 (b)에서 구분된 민감속성(SA)의 l-다양성값을 가지는 속성키를 사용하여 준식별자 속성 데이터를 민감속성버킷(SABucket) 분류하고, 이를 그룹으로 지정하는 단계; 및 (d) 상기 단계 (c)에서 지정된 그룹이 상기 준식별자(QI) 테이블 및 민감속성 테이블로 각각이 합쳐진 후 결합되는 단계를 포함한다. In order to achieve the above object, there is provided a distributed processing de-identification method for large-scale graph data according to the present invention, comprising the steps of: (a) converting RDF data into an attribute table of RDB; (b) removing the attribute table of the identifier (ID) after classifying it into an identifier (ID), a quasi-identifier (QI), and a sensitive attribute (SA) in the attribute table converted in step (a); (c) classifying the quasi-identifier attribute data into a sensitive attribute bucket (SABucket) using an attribute key having l-diversity value of the sensitive attribute (SA) separated in step (b), and designating it as a group; and (d) combining the groups designated in step (c) after each of the groups are merged into the quasi-identifier (QI) table and the sensitive attribute table.

상기 단계 (a) 이전에 RDF 데이터를 적어도 하나 이상의 파티션으로 분할하는 단계를 더 포함한다. The method further includes dividing the RDF data into at least one partition before the step (a).

상기 단계 (c)의 그룹으로 지정은, (c1) 상기 민감속성버킷(SABucket)에서 순차적으로 하나씩의 준식별자 속성 데이터를 그룹으로 이동시키는 단계; (c2) 상기 단계 (c1)에서 이동시킨 준식별자 속성 데이터가 l-다양성값 이상의 데이터가 포함되는지를 판단하는 단계; (c3) 상기 판단결과, 준식별자 속성 데이터가 l-다양성값 이상의 데이터가 포함된 경우, 다음 그룹을 생성시키는 단계; 및 (c4) 상기 단계 (c1) 내지 (c2)를 반복하여 수행한 후, 민감속성버킷에서 l-다양성값 미만의 준식별자 속성 데이터가 남아 있을 경우, 상기 l-다양성값 미만의 준식별자 속성 데이터를 임의의 그룹에 추가하는 단계를 포함하는 것이다. The designation as a group in step (c) includes: (c1) moving each quasi-identifier attribute data sequentially from the sensitive attribute bucket (SABucket) to a group; (c2) determining whether the quasi-identifier attribute data moved in step (c1) includes data greater than or equal to the l-diversity value; (c3) generating a next group when the quasi-identifier attribute data includes data greater than or equal to l-diversity value as a result of the determination; and (c4) if quasi-identifier attribute data less than the l-diversity value remains in the sensitive attribute bucket after repeating steps (c1) to (c2), the quasi-identifier attribute data less than the l-diversity value adding to an arbitrary group.

상기 준식별자(QI)는 둘 이상의 다른 속성값들과 결합되면 특정 개인을 식별할 수 있는 속성인 것이다. The quasi-identifier (QI) is an attribute that can identify a specific individual when combined with two or more other attribute values.

상기 민감속성(SA)은 공격이나 식별의 대상이 될 수 있는 특정 속성값인 것이다. The sensitive attribute SA is a specific attribute value that can be a target for attack or identification.

이와 같은 목적을 달성하기 위한 본 발명의 다른 측면은 대규모 그래프 데이터에 대한 분산처리 비식별화 장치로서, 적어도 하나의 프로세서; 및 컴퓨터로 실행가능한 명령을 저장하는 적어도 하나의 메모리를 포함하되, 상기 적어도 하나의 메모리에 저장된 상기 컴퓨터로 실행가능한 명령은, 상기 적어도 하나의 프로세서에 의하여, (a) RDF 데이터를 RDB의 속성 테이블로 변환시키는 단계; (b) 상기 단계 (a)에서 변환된 속성 테이블에서 식별자(ID) 및 준식별자(QI) 그리고 민감속성(SA)으로 구분한 후, 상기 식별자(ID)의 속성 테이블을 제거하는 단계; (c) 상기 단계 (b)에서 구분된 민감속성(SA)의 l-다양성값을 가지는 속성키를 사용하여 준식별자 속성 데이터를 민감속성버킷(SABucket)으로 분류하고, 이를 다시 그룹으로 지정하는 단계; 및 (d) 상기 단계 (c)에서 지정된 그룹이 상기 준식별자(QI) 테이블 및 민감속성 테이블로 각각이 합쳐진 후 결합되는 단계가 실행되도록 한다. Another aspect of the present invention for achieving the above object is a distributed processing de-identification apparatus for large-scale graph data, comprising: at least one processor; and at least one memory storing computer-executable instructions, wherein the computer-executable instructions stored in the at least one memory are configured to: (a) convert RDF data into an attribute table of RDB; converting to; (b) removing the attribute table of the identifier (ID) after classifying it into an identifier (ID), a quasi-identifier (QI), and a sensitive attribute (SA) in the attribute table converted in step (a); (c) classifying the quasi-identifier attribute data into a sensitive attribute bucket (SABucket) using the attribute key having the l-diversity value of the sensitive attribute (SA) separated in step (b), and designating it as a group again ; and (d) the group designated in step (c) is combined into the quasi-identifier (QI) table and the sensitive attribute table, respectively, and then the step of combining is executed.

이와 같은 목적을 달성하기 위한 본 발명의 또 다른 측면은 대규모 그래프 데이터에 대한 분산처리 비식별화를 위한 비일시적 저장매체에 저장된 컴퓨터 프로그램으로서, 비일시적 저장매체에 저장되며, 프로세서에 의하여 (a) RDF 데이터를 RDB의 속성 테이블로 변환시키는 단계; (b) 상기 단계 (a)에서 변환된 속성 테이블에서 식별자(ID) 및 준식별자(QI) 그리고 민감속성(SA)으로 구분한 후, 상기 식별자(ID)의 속성 테이블을 제거하는 단계; (c) 상기 단계 (b)에서 구분된 민감속성(SA)의 l-다양성값을 가지는 속성키를 사용하여 준식별자 속성 데이터를 민감속성버킷(SABucket)으로 분류하고, 이를 다시 그룹으로 지정하는 단계; 및 (d) 상기 단계 (c)에서 지정된 그룹이 상기 준식별자(QI) 테이블 및 민감속성 테이블로 각각이 합쳐진 후 결합되는 단계가 실행되도록 하는 명령을 포함한다. Another aspect of the present invention for achieving the above object is a computer program stored in a non-transitory storage medium for distributed processing de-identification of large-scale graph data, which is stored in the non-transitory storage medium, and is (a) converting the RDF data into an attribute table of the RDB; (b) removing the attribute table of the identifier (ID) after classifying it into an identifier (ID), a quasi-identifier (QI), and a sensitive attribute (SA) in the attribute table converted in step (a); (c) classifying the quasi-identifier attribute data into a sensitive attribute bucket (SABucket) using the attribute key having the l-diversity value of the sensitive attribute (SA) separated in step (b), and designating it as a group again ; and (d) the group designated in step (c) being combined into the quasi-identifier (QI) table and the sensitive attribute table, respectively, and then the combining step is executed.

본 발명에 의하면, 그러나 데이터의 구조가 다른 RDF 데이터에 대해서는 Anatomy 알고리즘을 적용하였고, 대규모 데이터 처리에 소요되는 시간이 감소되는 효과가 있다. According to the present invention, however, the Anatomy algorithm is applied to RDF data having different data structures, and the time required for large-scale data processing is reduced.

도 1은 본 발명에 따른 대규모 그래프 데이터에 대한 분산처리 비식별화 장치를 나타낸 블록도.
도 2는 본 발명에 따른 대규모 그래프 데이터에 대한 분산처리 비식별화 방법을 나타낸 순서도.
도 3은 도 2에 따른 대규모 그래프 데이터에 대한 분산처리 비식별화 방법에서 그룹으로 지정하는 방법을 나타낸 순서도.
도 4는 도 2에 따른 대규모 그래프 데이터에 대한 분산처리 비식별화 방법에서 단계 S110의 설명을 위해 나타낸 도면.
도 5는 도 2에 따른 대규모 그래프 데이터에 대한 분산처리 비식별화 방법에서 단계 S120 내지 S130의 설명을 위해 나타낸 도면.
도 6은 도 2에 따른 대규모 그래프 데이터에 대한 분산처리 비식별화 방법에서 단계 S140의 설명을 위해 나타낸 도면.
도 7은 도 2에 따른 대규모 그래프 데이터에 대한 분산처리 비식별화 방법에서 단계 S150의 설명을 위해 나타낸 도면.
도 8 내지 10은 본 발명에 따른 대규모 그래프에 대한 분산처리 비식별화를 입증하기 위한 실험 평가를 설명하기 위한 도면.1 is a block diagram showing a distributed processing de-identification apparatus for large-scale graph data according to the present invention.
2 is a flowchart illustrating a distributed processing de-identification method for large-scale graph data according to the present invention.
3 is a flowchart illustrating a method of designating a group in the distributed processing de-identification method for large-scale graph data according to FIG. 2 .
FIG. 4 is a diagram for explaining step S110 in the distributed processing de-identification method for large-scale graph data according to FIG. 2 .
5 is a diagram illustrating steps S120 to S130 in the distributed processing de-identification method for large-scale graph data according to FIG. 2 .
6 is a view for explaining step S140 in the distributed processing de-identification method for large-scale graph data according to FIG. 2 .
7 is a view for explaining step S150 in the distributed processing de-identification method for large-scale graph data according to FIG. 2 .
8 to 10 are diagrams for explaining experimental evaluation for verifying distributed processing de-identification for a large-scale graph according to the present invention.

이하 첨부된 도면을 참조로 본 발명의 바람직한 실시예를 상세히 설명하기로 한다. 이에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서, 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시예에 불과할 뿐이고 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. Prior to this, the terms or words used in the present specification and claims should not be construed as being limited to conventional or dictionary meanings, and the inventor should properly understand the concept of the term in order to best describe his invention. Based on the principle that it can be defined, it should be interpreted as meaning and concept consistent with the technical idea of the present invention. Therefore, the configuration shown in the embodiments and drawings described in the present specification is only the most preferred embodiment of the present invention and does not represent all of the technical spirit of the present invention, so at the time of the present application, various It should be understood that there may be equivalents and variations.

도 1은 본 발명에 따른 대규모 그래프 데이터에 대한 분산처리 비식별화 장치를 나타낸 블록도이다. 본 발명에 따른 대규모 그래프 데이터에 대한 분산처리 비식별화 장치(100)는 프로세서(110), 프로그램과 데이터를 저장하는 비휘발성 저장부(120), 실행 중인 프로그램을 저장하는 휘발성 메모리(130), 다른 기기와 통신을 수행하기 위한 통신부(140), 이들 장치 사이의 내부 통신 통로인 버스 등으로 이루어져 있다. 실행 중인 프로그램으로는 장치 드라이버, 운영체계(Operating System), 및 다양한 어플리케이션이 있을 수 있으며, 본 발명에서는 비식별화 어플리케이션(310) 이 실행되는 컴퓨터 장치이다. 그리고 도시되지는 않았지만, 컴퓨터 장치는 배터리와 같은 전력제공부를 포함한다.1 is a block diagram illustrating a distributed processing de-identification apparatus for large-scale graph data according to the present invention. Distributed processing de-identification apparatus 100 for large-scale graph data according to the present invention includes a processor 110, a non-volatile storage unit 120 for storing programs and data, a volatile memory 130 for storing a program being executed, It consists of a communication unit 140 for performing communication with other devices, and a bus that is an internal communication path between these devices. The running program may include a device driver, an operating system, and various applications. In the present invention, the de-identified application 310 is a computer device on which the running program is executed. And, although not shown, the computer device includes a power supply unit such as a battery.

도 2는 본 발명에 따른 대규모 그래프 데이터에 대한 분산처리 비식별화 방법을 나타낸 순서도이다. 2 is a flowchart illustrating a distributed processing de-identification method for large-scale graph data according to the present invention.

먼저 RDF의 데이터를 분산처리를 위하여 적어도 하나 이상의 파티션으로 분할한다(S100).First, RDF data is divided into at least one partition for distributed processing (S100).

그리고 단계 S100에서 적어도 하나 이상의 파티션으로 분할된 RDF 데이터를 RDB의 속성 테이블로 변환시킨다(S110). RDF는 기본적으로 주어(Subject), 서술(predicate), 목적(object)의 트리플(triple) 모델로 기술된다. 주어란 표현하고자 하는 데이터를 의미하며, 서술은 주어에 대해 기술하거나 주어와 목적의 관계를 의미한다. 그리고 목적이란 서술에 대한 내용이나 값을 의미하며 또한 각 내용들에 대해서 URI를 통해 기술할 수 있다. Then, the RDF data divided into at least one partition in step S100 is converted into an attribute table of the RDB (S110). RDF is basically described as a triple model of subject, predicate, and object. The subject means the data to be expressed, and the description means the description of the subject or the relationship between the subject and the purpose. And purpose means the content or value of the description, and each content can be described through URI.

도 4는 도 2에 따른 단계 S110의 설명을 위해 나타낸 도면으로, 왼쪽은 RDF 데이터를 나타내고 있으며, 오른쪽은 변환된 RDB의 속성 테이블이다. 도 4에 도시된 바와 같이 왼쪽의 RDF 데이터에서 주어(Subject)는 '환자'이고 서술(predicate)은 '이름', '주민등록번호', '성명', '주소', '병명'이며, 목적(object)은 '문철식', '630128-0050352', '남성', '경기도 성남시', '당뇨'가 되며, 오른쪽은 RDF 데이터에 대하여 속성 테이블로 변환된 것이다. 여기서 RDF 데이터를 RDB의 속성 테이블로의 변환은 Apache Jena Framework를 사용하여 변환시킬 수 있으며, 본 발명에서의 Jena는 RDF의 트리플을 RDB와 같은 속성 테이블로 변환하는 것을 도와준다. FIG. 4 is a view for explaining step S110 according to FIG. 2 . The left side shows RDF data, and the right side shows the attribute table of the converted RDB. As shown in Fig. 4, in the left RDF data, the subject is 'patient', and the predicate is 'name', 'resident registration number', 'name', 'address', 'disease name', and object ) are 'Chul-Sik Moon', '630128-0050352', 'Male', 'Seongnam-si, Gyeonggi-do', and 'Diabetes', and the right side is converted into an attribute table for RDF data. Here, conversion of RDF data into attribute table of RDB can be done using Apache Jena Framework, and Jena in the present invention helps to convert triples of RDF into attribute table like RDB.

이어서 단계 S110에서 변환된 속성 테이블에서 식별자(ID; identifiers) 및 준식별자(QI: quasi-identifier) 그리고 민감속성(SA; sensitive attribute) 테이블로 구분한다(S120).Next, the attribute table converted in step S110 is divided into identifiers (IDs), quasi-identifiers (QIs), and sensitive attribute (SA) tables (S120).

그리고 단계 S120에서 구분된 식별자(ID) 및 준식별자(QI) 그리고 민감속성(SA) 테이블에서 식별자(ID) 테이블을 제거한다(S130). 도 5는 단계 S120 내지 S130의 설명을 돕기 위해 나타낸 도면으로 속성인 '이름', '주민등록번호'는 식별자(ID)로 구분하고, '성별', '주소'는 준식별자(QI), 그리고 '병명'은 민간속성(SA)으로 구분하고 있으며, 식별자(ID)는 속성 테이블에서 제거함을 보여주고 있다. 여기서 식별자(ID)는 개인을 식별할 수 있는 속성들로 1:1 대응이 가능한 모든 정보를 말하며, 예를 들면 주민번호, 전화번호, 이메일, 이름, 계좌번호, NRI 사진, 유전자 정보 등이 있으며, 암호화된 값도 식별자로 분류되며 비식별 조치시 무조건 삭제되어야 한다. 그리고 준식별자(QI)는 자체로는 식별자가 아니지만 다른 데이터와 결합을 통해 특정 개인을 간접적으로 추론하는데 사용될 수 있는 속성이다. 예를 들면 거주 도시명, 몸무게, 혈액형, 성별 등이 있을 수 있다. 마지막으로 민감속성(SA)은 개인의 사생활을 드러낼 수 있는 속성으로 병명, 예금 잔고, 카드 결제액 등을 예로 들 수 있으며, 데이터 분석시 주로 측정되는 대상 속성으로, 대부분의 현대적 비식별화 기법들에서 데이터 값들을 보존하는 값으로 공격이나 식별의 대상이 될 수 있는 특정 속성값인 것이다. Then, the identifier (ID) table is removed from the identifier (ID), the quasi-identifier (QI), and the sensitive attribute (SA) table identified in step S120 (S130). 5 is a diagram to help explain steps S120 to S130. Attributes 'name' and 'resident registration number' are divided by identifiers (IDs), 'gender' and 'address' are quasi-identifiers (QI), and 'ill name' ' is classified as a private attribute (SA), and the identifier (ID) is removed from the attribute table. Here, an identifier (ID) is an attribute that can identify an individual and refers to all information that can be matched 1:1, for example, resident number, phone number, email, name, account number, NRI photo, genetic information, etc. , encrypted values are also classified as identifiers and must be deleted unconditionally in case of de-identification. And although a quasi-identifier (QI) is not an identifier by itself, it is a property that can be used to indirectly infer a specific individual through combination with other data. For example, it may include the name of the city where you live, your weight, your blood type, and your gender. Lastly, the sensitive attribute (SA) is an attribute that can reveal an individual's privacy, such as disease name, deposit balance, card payment amount, etc., and is a target attribute mainly measured when analyzing data. It is a value that preserves data values in , and is a specific attribute value that can be targeted for attack or identification.

이어서, 단계 S120에서 구분된 민감속성(SA)의 l-다양성값을 가지는 속성키를 사용하여 준식별자 속성 데이터를 민감속성버킷(SAbucket) 분류으로 분류하고, 이를 그룹으로 지정한다(S140). 여기서 도 6은 단계 S140의 설명을 돕기 위한 도면으로, l-다양성값을 가지는 속성키는 도시된 바와 같이'에이즈', '골다공증', '암', '당뇨'가 되며, l다양성값을 가지는 속성키에 따라 준식별자 속성 데이터가 민감속성버킷(SAbucket)으로 나뉘어지고, 민감속성버킷(SAbucket)의 준식별자 속성 데이터가 그룹 1과 그룹 2로 이동됨을 보여주고 있다. 참고로 단계 S140에 따른 그룹으로 지정되는 과정은 다음의 도 3을 참조하여 다시 설명하기로 한다. Next, by using the attribute key having the l-diversity value of the sensitive attribute (SA) classified in step S120, the quasi-identifier attribute data is classified into the sensitive attribute bucket (SAbucket) classification, and this is designated as a group (S140). Here, FIG. 6 is a diagram to help explain step S140. The attribute keys having the l-diversity value are 'AIDS', 'osteoporosis', 'cancer', and 'diabetes' as shown, and having the l-diversity value. It shows that the quasi-identifier attribute data is divided into sensitive attribute buckets (SAbucket) according to the attribute key, and the quasi-identifier attribute data of the sensitive attribute bucket (SAbucket) is moved to group 1 and group 2. For reference, the process of designating a group according to step S140 will be described again with reference to FIG. 3 below.

이어서 단계 S140에서 지정된 그룹이 단계 S120의 준식별자 테이블 및 민감속성 테이블과 각각이 합쳐진(QIT, ST) 후 결합된다(S150). 여기서 그룹들은 동질클래스((EC; equivalence class)라고 한다. 도 7은 단계 S150의 설명을 위해 나타낸 도면으로, 도 7을 참조하면, 동질 클래스 및 그룹 번호는 하나의 준식별자 테이블(QIT; QI Table)로 합쳐진다. 마찬가지로 그룹의 번호 그리고 민감속성 및 그룹에 들어있는 민감속성들의 수는 민감속성 테이블(ST; SA Table)로 결합된다. 여기서 그룹이 준식별자 테이블과 합쳐져 생성된 QIT와 그룹이 민감속성 테이블과 합쳐져 생성된 ST를 보면, 단계 S110에서의 RDF 데이터의 식별자(ID)만 손실되며, 비식별자 모델과 달리 QIT는 일반화되거나 범주화 되지 않는 원본과 같다. 그리고 QIT와 ST를 통해 특정 개인을 식별할 수 없으며, 앞서 설명한 도 6과 같이 그룹 1에는 별개의 민감속성 l-다양성값 2개가 있고, 그룹 2에는 별개의 민감속성 l-다양성값 3개가 존재한다. 따라서 l-다양성 모델의 조건이 충족된다. 한편 도 7의 하단에서와 같이 QIT와 ST는 Apach Jena Framework를 사용하여 Turtle 형식의 Black node를 통해서 배포되므로 이후 데이터 사용시 각 그룹을 식별하고 참조할 수 있다. Next, the group designated in step S140 is combined with the quasi-identifier table and the sensitive attribute table of step S120 (QIT, ST) and then combined (S150). Here, the groups are referred to as an equivalence class (EC). FIG. 7 is a diagram for explanation of step S150. Referring to FIG. 7, the equivalence class and group number are one quasi-identifier table (QIT; QI Table). Similarly, the group number and the number of sensitive attributes and the number of sensitive attributes in the group are combined into a sensitive attribute table (ST; SA Table), where the QIT created by combining the group with the quasi-identifier table and the group is sensitive Looking at the ST generated by combining with the attribute table, only the identifier (ID) of the RDF data in step S110 is lost, and unlike the de-identifier model, the QIT is like the original, which is not generalized or categorized. It cannot be identified, and there are two separate sensitive attribute l-diversity values in group 1, and three separate sensitive attribute l-diversity values exist in group 2 as shown in Fig. 6. Therefore, the condition of the l-diversity model is On the other hand, as shown in the lower part of Fig. 7, QIT and ST are distributed through a Turtle-type black node using the Apache Jena Framework, so that each group can be identified and referenced when using data later.

도 3은 도 2에 따른 단계 S140의 민감속성버켓(SAbucket) 분류에 따른 그룹 지정 과정을 보여주는 순서도이다.3 is a flowchart showing a group designation process according to the classification of the sensitive attribute bucket (SAbucket) in step S140 according to FIG. 2 .

도 3을 참조하면, 먼저 민감속성버킷(SAbucket)에서 순차적으로 하나씩의 준식별자 속성 데이터를 그룹으로 이동시킨다(S141). 그리고 단계 S141에서 이동시킨 준식별자 속성 데이터가 l-다양성값 이상의 데이터가 포함되는지를 판단한다(S142).Referring to FIG. 3 , first, one quasi-identifier attribute data is sequentially moved to a group in a sensitive attribute bucket (SAbucket) (S141). Then, it is determined whether the quasi-identifier attribute data moved in step S141 includes data greater than or equal to the l-diversity value (S142).

판단결과(S142), 준식별자 속성 데이터가 l-다양성값 이상의 데이터가 포함된 경우, 다음 그룹을 생성시킨다(S143).As a result of the determination (S142), when the quasi-identifier attribute data includes data greater than or equal to the l-diversity value, the next group is generated (S143).

이후, 단계 S141 내지 S143을 반복하여 수행한 후, 민감속성버킷에 l-다양성값 미만의 준식별자 속성 데이터가 남아 있을 경우 l-다양성값 미만의 준식별자 속성 데이터를 임의의 그룹에 추가한다(S144).Thereafter, after repeating steps S141 to S143, if quasi-identifier attribute data less than the l-diversity value remains in the sensitive attribute bucket, the quasi-identifier attribute data less than the l-diversity value is added to an arbitrary group (S144). ).

다음은 본 발명의 [실시예]로서, [실시예]에서는 Spark를 사용하고 있으며, 본 발명에서와 같은 Anatomy 알고리즘을 구성하려면 기존 Anatomy와 유사하지만 추가적으로 RDD(Resilient Distributed Data)가 사용되어야 하며, [실시예]에서는 이러한 RDD를 통해 구현한 알고리즘의 Psudo Code를 나타내고 있다. The following is an [Example] of the present invention, and Spark is used in the [Example], and to configure the Anatomy algorithm as in the present invention, it is similar to the existing Anatomy, but additionally RDD (Resilient Distributed Data) must be used, [ Example] shows the pseudo code of the algorithm implemented through this RDD.

[실시예][Example]

Data: HDFS RDF 파일, L 값Data: HDFS RDF file, L value

Result: 비식별화 처리된 RDF 파일 Result: De-identified RDF file

// 1-6 행은 RDF를 튜플로 변환하는 단계// Lines 1-6 are the steps to convert RDF to tuple

1. RDF는 최소 Executor의 수만큼의 파티션으로 나뉜다.1. RDF is divided into at least as many partitions as there are executors.

2. tripleList = RDF의 Triple을 변환한 튜플들의 리스트.2. tripleList = List of tuples converted from RDF Triple.

3. tripleRDD = parallelized tripleList.3. tripleRDD = parallelized tripleList.

4. mapTripleRDD = ripleRDD를 {SA, SA's Tuple}로 Mapping.4. mapTripleRDD = Mapping ripleRDD to {SA, SA's Tuple}.

5. SASet = mapTripleRDD내의 SA집합의 Set.5. SASet = Set of SA sets in mapTripleRDD.

6. SABucket; bucketCnt = 0; groupCnt = 0.6. SABucket; bucketCnt = 0; groupCnt = 0.

// 7-8행은 SA로 채워지는 bucket들의 생성 단계이다. // Lines 7-8 are the creation stage of buckets filled with SA.

7. For each loopSA in SASet7. For each loopSA in SASet

8. SABucket_bucketCnt = filtering하여 mapTripleRDD의 Key중에서 loopSA와 같은 Value 저장.8. SABucket _bucketCnt = Filter and save the same value as loopSA among the keys of mapTripleRDD.

9. bucketCnt = bucketCnt + 1.9. bucketCnt = bucketCnt + 1.

// 10-19행은 튜플로 채워지는 그룹들의 생성 단계이다.// Lines 10-19 are the creation stage of groups filled with tuples.

10. While If there are at least L non-empty SABucket10. While If there are at least L non-empty SABucket

11. SABucket의 구성원 개수로 SABucket을 정렬한다.11. Sort SABuckets by the number of members of SABuckets.

12. new groupBucket12. new groupBucket

13. for idx=1 to L13. for idx=1 to L

14. tuple = SABucket의 idx index에서 값을 얻는다..14. tuple = Get value from idx index of SABucket.

15. 튜플의 첫번째 값과 groupCnt를 groupBucket에 추가한다.15. Add the first value of the tuple and groupCnt to groupBucket.

16. SAbucket에서 groupBucket에 추가된 값을 제거한다.16. Remove the value added to groupBucket from SAbucket.

17. groupCnt = groupCnt + 1.17. groupCnt = groupCnt + 1.

18. For each bucket in non-empty SABuckets.18. For each bucket in non-empty SABuckets.

19. Bucket의 첫 번째 값과 groupCnt 범위 내의 랜덤한 숫자값을 groupBucket에 추가한다.19. Add the first value of Bucket and a random numeric value within the range of groupCnt to groupBucket.

// 20-22행은 QIT와 ST의 분할 단계이다. // Lines 20-22 are the split steps of QIT and ST.

20. allTuples = parallised groupBucket.20. allTuples = parallised groupBucket.

21. QIT = allTuples에서 필요한 속성을 추출하여 QI 파티션을 생성한 뒤에 모든 파티션을 통합한다.21. Create a QI partition by extracting the required attributes from QIT = allTuples, and then merge all partitions.

22. ST = allTuples에서 필요한 속성을 추출하여 ST 파티션을 생성한 뒤에 모든 파티션을 통합한다. 22. Create an ST partition by extracting the necessary attributes from ST = allTuples and then merge all partitions.

// 23-24행은 QIT와 ST로 구성된 RDF의 최종 작성 단계입니다. // Lines 23-24 are the final writing steps of the RDF consisting of QIT and ST.

23. For each cnt in groupCnt23. For each cnt in groupCnt

24. QITcnt, SAcnt Triple을 Jena를 통해 추가한다. 24. Add QITcnt, SAcnt Triple through Jena.

위의 [실시예]에서와 같이 1 행을 통해 파일을 여러 개의 파티션으로 나누어 여러 Executor에서 분할하여 액세스 할 수 있게 한다. 그리고 2-4 행에서 "Professor" Subject에서 Predicate를 속성값으로써 Object들을 하나의 튜플로 결합하여 tripleRDD를 생성한다. 그런 다음 튜플을 tripleRDD의 SA를 Key 값으로 가지고 해당 튜플을 Value로 가지는 mapTripleRDD를 생성한다. 그리고 5-9 행에서 전체 SA에서 중복을 제거한 SAset을 작성하고 SASet의 모든 요소에 대해 반복을 실행한다. 이 반복에서 SASet의 각각의 SA를 loopSA라고 하고, loopSA가 mapTripleRDD의 키와 일치하는 Value값들이 SABucket의 bucketCnt번째 인덱스에 리스트 형태로 저장된다. 그리고 반복이 끝날 때까지 bucketCnt는 1씩 증가한다. 그 다음 10-19 행은 SABucket의 값들을 각 그룹으로 나누는 단계이다. 우선 10-17 행은 총 튜플 수가 L 값 미만으로 떨어질 때까지 반복하는데, 이를 통해서 SABucket의 모든 튜플을 순차적으로 Group으로 만들어 groupCnt 인덱스로 나누어 저장한다. Group별 튜플의 개수는 L개이다. 그리고 18-19 행에서는 미처 나누어지지 못한 튜플들을 랜덤한 Group에 할당한다. 20-22 행에서는 이렇게 나누어진 Group에서 준식별자, 민감속성, 그룹 번호를 바탕으로 QIT와 ST로 분할하여 하나의 테이블로 통합한다. 이후 23-24라인에서 Jena를 사용하여 Blank Node로 RDF화하여 과정을 끝낸다.As in the [Example] above, the file is divided into multiple partitions through one line so that multiple executors can access it by dividing it. And in lines 2-4, tripleRDD is created by combining Objects into one tuple as the property value of Predicate in "Professor" Subject. Then, the tuple creates a mapTripleRDD with the SA of tripleRDD as the Key value and the corresponding tuple as the Value. Then, in lines 5-9, we create a SAset that removes duplicates from the entire SA, and iterate over all elements of the SASet. In this iteration, each SA of SASet is called loopSA, and the values of loopSA matching the key of mapTripleRDD are stored in the bucketCnt index of SABucket in the form of a list. And bucketCnt is incremented by 1 until the iteration ends. The next line 10-19 is the step of dividing the values of SABucket into each group. First, lines 10-17 are repeated until the total number of tuples falls below the L value. Through this, all tuples of SABucket are sequentially grouped and stored by groupCnt index. The number of tuples per group is L. And in lines 18-19, undivided tuples are assigned to a random group. In lines 20-22, the divided groups are divided into QIT and ST based on the quasi-identifier, sensitive attribute, and group number and integrated into one table. After that, in lines 23-24, using Jena, RDF as a Blank Node ends the process.

도 8 내지 도 10은 본 발명에 따른 대규모 그래프에 대한 분산처리 비식별화를 입증하기 위한 실험 평가를 설명하기 위한 도면이다. 8 to 10 are diagrams for explaining experimental evaluation for verifying distributed processing de-identification for a large-scale graph according to the present invention.

실험 평가로는 IneMemory와 Spark 두개를 사용하여 대규모 RDF 데이터를 사용한 Anatomy 알고리즘에서 얻은 결과를 평가했으며 각각의 실행시간으로 비교한다.As an experimental evaluation, the results obtained from the Anatomy algorithm using large-scale RDF data were evaluated using both IneMemory and Spark, and the respective execution times were compared.

실험을 위해 사용한 RDF는 벤치마크 데이터를 생성하는 LUBM 데이터 생성기를 사용하여 만들었다. 이러한 생성기는 LUBM 데이터에 들어갈 대학의 수를 지정할 수 있다. 그렇기에 이번 실험에서는 10, 50, 100, 300, 700개의 대학이 포함된 LUBM들이 Turtle 포멧으로 생성되었다. 이러한 입력 데이터의 크기와 Triple 수는 도 8과 같다. The RDF used for the experiment was created using the LUBM data generator to generate benchmark data. These generators can specify the number of universities that will go into the LUBM data. Therefore, in this experiment, LUBMs including 10, 50, 100, 300, and 700 universities were created in Turtle format. The size of the input data and the number of triples are shown in FIG. 8 .

하드웨어 자원으로 총 5대의 Worker로 구성된 클러스터 시스템이며 각각의 Worker는 24Gb의 메모리와 Intel (R) Xeon (R) CPU E3-1220 V2 @ 3.10 GHz 으로 구성되어 있다. 또한 실험을 위해 데이터에서 {fullProfessor, AssociateProfessor, assistantProfessor} 정보와 같은 교수 유형의 데이터만 사용 했다. 그리고 비식별화 과정에서 교수의 ID가 삭제되었고 실제 사용된 속성은 {name, researchInterest, undergraduateDegreeFrom, masterDegreeFrom, doctorDegreeFrom} 이다.It is a cluster system consisting of a total of 5 workers as hardware resources, and each worker is composed of 24Gb memory and Intel (R) Xeon (R) CPU E3-1220 V2 @ 3.10 GHz. Also, for the experiment, only data of professor type such as {fullProfessor, AssociateProfessor, assistantProfessor} information were used in the data. In the de-identification process, the professor's ID was deleted, and the properties actually used are { name, researchInterest, undergraduateDegreeFrom, masterDegreeFrom, doctorDegreeFrom }.

실험에서는 Java InMemory의 비식별화와 Java Spark 분산 클러스터 두 가지를 비교하였다. 우선, 입력 데이터 LUBM의 크기가 작은 경우 InMemory의 작업 시간이 Spark보다 빠르게 실행되었다. 그러나 LUBM의 크기가 증가함에 따라서 InMemory의 계산 속도가 급격히 하락한다. 또한 하드웨어 자원의 부족으로 인해서 LUBM-300부터는 데이터 처리가 불가능하다. 따라서 Spark 클러스터를 사용하여 단일 하드웨어의 자원 부족 및 처리 속도의 저하 등을 극복하였다. Spark는 모든 과정을 관리하는 Driver 와 실제 Task를 수행하는 Executor로 구성된다. 이는 스파크를 실행 할 때 Worker중에서 자동으로 분류된다. 이 중에서 Driver는 실행 중에 발생하는 모든 Task를 분할하여 Executor에 할당하는데, 이러한 할당 시간은 저용량 데이터의 경우 InMemory와 비교하여 불리하다. 하지만 데이터의 크기가 증가함에 따라 분할 시간의 불리함에 비해 처리의 효율성이 증가하여 유리해진다.In the experiment, Java InMemory de-identification and Java Spark distributed cluster were compared. First, when the size of the input data LUBM is small, the working time of InMemory was faster than Spark. However, as the size of LUBM increases, the calculation speed of InMemory sharply decreases. Also, data processing is impossible from LUBM-300 due to lack of hardware resources. Therefore, by using Spark cluster, resource shortage of single hardware and degradation of processing speed were overcome. Spark consists of a driver that manages all processes and an executor that performs actual tasks. It is automatically classified among Workers when Spark is executed. Among them, the driver divides all tasks that occur during execution and allocates them to the executor, and this allocation time is disadvantageous compared to InMemory for low-capacity data. However, as the size of the data increases, the processing efficiency increases compared to the disadvantage of the division time, which is advantageous.

도 9는 Spark-submit을 사용하여 CPU의 코어 수와 사용할 메모리 양을 비롯한 여러 가지 옵션이 추가된 도면이다. 이는 RDF파일의 크기가 증가함에 따라 Worker 당 하나의 Executor만 실행하여 하나의 Executor에 할당된 메모리 양을 최대한 늘리고 드라이버는 하나의 Worker 전체를 사용한다는 내용이다. 또한 Spark에서 데이터를 처리함에 있어서 파일은 HDD 기반 분산처리 플랫폼인 Hadoop의 스토리지 시스템 HDFS에 저장된 다음 엑세스하게 된다. 그렇지만 InMemory 시스템의 경우 일반 파일에 액세스 하게 되어 비교적 액세스 시간이 더욱 오래 걸리기에 비교를 공정히 하기 위하여 액세스 시간은 결과 그래프에 포함되지 않았다.FIG. 9 is a diagram in which various options are added, including the number of cores of the CPU and the amount of memory to be used using Spark-submit. This means that as the size of the RDF file increases, only one executor is executed per worker to increase the amount of memory allocated to one executor as much as possible, and the driver uses all one worker. In addition, when processing data in Spark, files are stored in Hadoop's storage system HDFS, which is an HDD-based distributed processing platform, and then accessed. However, in the case of InMemory system, access time is not included in the result graph in order to make the comparison fair because access time is relatively longer because it accesses a normal file.

도 10의 (a)는 4개의 Worker를 사용하는 Spark와 InMemory를 비교한다. 그래프에서 알 수 있듯이 InMemory는 작은 데이터에서는 더 좋은 성능을 내지만 10Gb 이상의 Java heap 메모리를 사용함에도 불구하고 큰 파일은 실행되지 않는다. 그리고 스파크는 적은 양의 데이터에선 안정적인 결과를 보여주지 않는다. 그러나 Spark를 사용하여 데이터를 처리하는 (b)에서는 대용량 데이터에 대해 Worker의 수가 추가될수록 처리 속도가 증가함을 확인할 수 있다. 증가 된 작업자 수에 정비례하여 속도가 줄어들지는 않지만 단일 CPU 시스템의 성능을 넘어서고 수평적 확장이 가능하다. 또한 현재 실험에 사용된 클러스터의 사양보다 뛰어난 사양의 Worker로 대체한다면 큰 RDF 데이터에 대해서 더욱 빠른 비식별화가 지원 가능하다.Figure 10 (a) compares Spark and InMemory using 4 workers. As can be seen from the graph, InMemory performs better on small data, but large files do not run despite using more than 10 Gb of Java heap memory. And Spark doesn't show stable results for small amounts of data. However, in (b), which processes data using Spark, it can be seen that the processing speed increases as the number of workers is added for large data. Although the speed does not decrease in direct proportion to the increased number of workers, it exceeds the performance of a single CPU system and can scale horizontally. In addition, faster de-identification of large RDF data can be supported by replacing the worker with a specification superior to that of the cluster used in the current experiment.

이와 같은 실험 평가에서는 대규모 RDF를 사용하기 위해 Spark를 사용한 l-다양성 비식별화 모델의 Anatomy 알고리즘 구현을 통한 플랫폼을 제안하였다. 또한 실험 결과에 따르면 Spark 기반 Anatomy 알고리즘이 비교적 큰 RDF 데이터 세트에서 상당한 이점을 나타내었고, 이는 데이터의 크기가 증가함에도 실행시간이 감소하였음을 보여주었다. In this experimental evaluation, we proposed a platform through the implementation of the Anatomy algorithm of l-diversity de-identification model using Spark to use large-scale RDF. In addition, the experimental results showed that the Spark-based Anatomy algorithm showed significant advantages in the relatively large RDF data set, which showed that the execution time decreased even as the data size increased.

그러나 비록 Anatomy 알고리즘이 비식별화와 최종 데이터의 유용성 보존이라는 목표로 사용되었지만 여전히 추론 공격을 비롯한 여러 공격에 취약하다. 이러한 문제를 해결하기 위해서 우리는 t-근접성 모델을 만족하고 Anatomy와 마찬가지로 유용성을 최대한 보존한 알고리즘을 새로이 적용하는 연구를 진행하고자 한다. 또한 동적으로 수정되어야 하는 데이터 세트를 위한 l-다양성 알고리즘에도 가치를 두어 연구를 진행할 수 있다.However, although the Anatomy algorithm has been used with the goal of de-identification and preserving the usefulness of the final data, it is still vulnerable to several attacks, including inference attacks. In order to solve this problem, we would like to proceed with a study to newly apply an algorithm that satisfies the t-proximity model and preserves usefulness as much as possible like Anatomy. You can also value l-diversity algorithms for data sets that need to be dynamically modified to conduct research.

이상과 같이, 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 이것에 의해 한정되지 않으며 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 본 발명의 기술사상과 아래에 기재될 특허청구범위의 균등범위 내에서 다양한 수정 및 변형이 가능함은 물론이다.As described above, although the present invention has been described with reference to limited embodiments and drawings, the present invention is not limited thereto, and the technical idea of the present invention and the following by those of ordinary skill in the art to which the present invention pertains. Of course, various modifications and variations are possible within the scope of equivalents of the claims to be described.

100: 대규모 그래프 데이터에 대한 분산처리 비식별화 장치
110: 프로세서
120: 저장부
130: 메모리
140: 통신부
310: 비식별 어플리케이션100: Distributed processing de-identification device for large-scale graph data
110: processor
120: storage
130: memory
140: communication department
310: non-identifying application

Claims

As a distributed processing de-identification method for large-scale graph data,
(a) converting the RDF data into an attribute table of the RDB;
(b) removing the attribute table of the identifier (ID) after classifying it into an identifier (ID), a quasi-identifier (QI), and a sensitive attribute (SA) in the attribute table converted in step (a);
(c) classifying the quasi-identifier attribute data into a sensitive attribute bucket (SABucket) using the attribute key having the l-diversity value of the sensitive attribute (SA) separated in step (b), and designating it as a group again ; and
(d) combining the groups designated in step (c) after each of them are combined into the quasi-identifier (QI) table and the sensitive attribute table
Distributed processing de-identification method for large-scale graph data including

The method according to claim 1,
before step (a)
Partitioning the RDF data into at least one partition
Distributed processing de-identification method for large-scale graph data comprising further.

The method according to claim 1,
Designation as a group in step (c) is,
(c1) sequentially moving one quasi-identifier attribute data to a group in the sensitive attribute bucket (SABucket);
(c2) determining whether the quasi-identifier attribute data moved in step (c1) includes data greater than or equal to the l-diversity value;
(c3) generating a next group when the quasi-identifier attribute data includes data greater than or equal to l-diversity value as a result of the determination; and
(c4) If quasi-identifier attribute data less than l-diversity value remains in the sensitive attribute bucket after repeating steps (c1) to (c2), quasi-identifier attribute data less than l-diversity value Steps to add to any group
Distributed processing de-identification method for large-scale graph data, characterized in that it comprises a.

The method according to claim 1,
The quasi-identifier (QI) is an attribute that can identify a specific individual when combined with two or more other attribute values
Distributed processing de-identification method for large-scale graph data, characterized by

The method according to claim 1,
The sensitive attribute (SA) is a specific attribute value that can be a target for attack or identification
Distributed processing de-identification method for large-scale graph data, characterized by

As a distributed processing de-identification device for large-scale graph data,
at least one processor; and
at least one memory for storing computer-executable instructions;
The computer-executable instructions stored in the at least one memory are executed by the at least one processor,
(a) converting the RDF data into an attribute table of the RDB;
(b) removing the attribute table of the identifier (ID) after classifying it into an identifier (ID), a quasi-identifier (QI), and a sensitive attribute (SA) in the attribute table converted in step (a);
(c) classifying the quasi-identifier attribute data into a sensitive attribute bucket (SABucket) using the attribute key having the l-diversity value of the sensitive attribute (SA) separated in step (b), and designating it as a group again ; and
(d) combining the groups designated in step (c) after each of them into the quasi-identifier (QI) table and the sensitive attribute table
Distributed processing de-identification device for large graph data that makes it run.

A computer program stored in a non-transitory storage medium for distributed processing de-identification of large-scale graph data,
It is stored in a non-transitory storage medium, and is
(a) converting the RDF data into an attribute table of the RDB;
(b) removing the attribute table of the identifier (ID) after classifying it into an identifier (ID), a quasi-identifier (QI), and a sensitive attribute (SA) in the attribute table converted in step (a);
(c) classifying the quasi-identifier attribute data into a sensitive attribute bucket (SABucket) using the attribute key having the l-diversity value of the sensitive attribute (SA) separated in step (b), and designating it as a group again ; and
(d) combining the groups designated in step (c) after each of them are combined into the quasi-identifier (QI) table and the sensitive attribute table
A computer program stored in a non-transitory storage medium for distributed processing de-identification of large-scale graph data, including instructions to be executed.