KR102311710B1

KR102311710B1 - Key Generating Apparatus and Method for Combining de-Identification Data Set

Info

Publication number: KR102311710B1
Application number: KR1020170054396A
Authority: KR
Inventors: 심기창; 김동례
Original assignee: (주)이지서티
Priority date: 2017-04-27
Filing date: 2017-04-27
Publication date: 2021-10-12
Also published as: KR20180120444A

Abstract

본 발명은 비식별화 데이터 셋 결합용 키 생성 장치 및 방법에 관한 것으로, 본 발명에 따른 비식별화 데이터 셋 결합용 키 생성 장치는 데이터베이스로부터 추출된 원본 데이터의 행과 열로 이루어지는 원본 데이터 셋을 생성하는 원본 데이터 셋 생성부, 원본 데이터 셋에 포함된 원본 데이터 중 적어도 일부가 가공되고, 행 또는 열을 식별할 수 있는 비식별화 데이터 셋 결합용 키가 행 또는 열마다 부가되어 있는 비식별화 데이터 셋을 생성하는 비식별화 데이터 셋 생성부, 그리고 데이터 셋을 식별하기 위한 데이터 셋 아이디를 비식별화 데이터 셋과 원본 데이터 셋에 동일하게 부여하여 관리하는 데이터 관리부를 포함하고, 비식별화 데이터 셋의 행마다 부가되어 있는 비식별화 데이터 셋 결합용 키는 비식별화 데이터 셋의 행에 대응하는 원본 데이터 셋의 행에서 선택된 2개 이상의 열의 일부 값을 가지고 역변환이 불가능한 함수를 적용하여 구해진 값으로 정해진다. 본 발명에 의하면, 원본 데이터 중 일부를 가공한 비식별화 데이터를 사용자에게 제공하되, 필요한 경우 비식별화 데이터의 원본 데이터를 용이하게 확인하고 제공할 수 있다. 특히 본 발명은 원본 데이터를 가공한 비식별화 데이터를 사용자에게 제공하고, 제공한 데이터를 다른 비식별화 데이터와 비식별화 데이터 셋 결합용 키를 통해 결합할 수 있다. The present invention relates to an apparatus and method for generating a key for combining de-identified data sets, and the apparatus for generating a key for combining de-identified data sets according to the present invention generates an original data set comprising rows and columns of original data extracted from a database. De-identified data in which at least part of the original data included in the original data set is processed, and a key for combining de-identified data sets that can identify rows or columns is added to each row or column A de-identified data set generating unit that generates a set, and a data management unit that manages by assigning the same data set ID for identifying the data set to the de-identified data set and the original data set; The de-identified data set combining key added to each row of is a value obtained by applying a function that cannot be inversely transformed with some values of two or more columns selected from the rows of the original data set corresponding to the rows of the de-identified data set. it is decided According to the present invention, the de-identified data obtained by processing some of the original data is provided to the user, but if necessary, the original data of the de-identified data can be easily checked and provided. In particular, according to the present invention, de-identified data processed from original data may be provided to the user, and the provided data may be combined with other de-identified data and de-identified data sets through a key for combining.

Description

Key Generating Apparatus and Method for Combining de-Identification Data Set

본 발명은 비식별화 데이터 셋 결합용 키를 생성하는 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for generating a key for combining de-identified data sets.

빅데이터(big data)는 기존 데이터에 비해 너무 방대해 이전 방법이나 도구로 수집, 저장, 검색, 분석, 시각화 등이 어려운 정형 또는 비정형 데이터 세트를 의미한다. 빅 데이터 처리 및 분석 기술이 적용될 수 있는 분야는 공공, 과학, 의료, 도소매, 제조, 정보통신 등으로 나눌 수가 있는데, 그 가운데 정보통신 분야의 경우 이동 통신의 발전과 개인 단말의 폭증로 인해 생성된 디지털 공간의 사용자 데이터를 기반으로 하여 사용자의 행동 패턴, 이력, 주변상황 등과 관련된 빅 데이터 처리 및 분석 기술이 발전하고 있다.Big data refers to structured or unstructured data sets that are too vast compared to existing data and difficult to collect, store, search, analyze, and visualize with previous methods or tools. The fields to which big data processing and analysis technology can be applied can be divided into public, scientific, medical, wholesale and retail, manufacturing, and information communication. Based on user data in the digital space, big data processing and analysis technologies related to user behavior patterns, histories, and surrounding situations are developing.

그런데 이러한 빅데이터에는 주민등록번호, 주소, 전화번호, 키, 몸무게, 나이 등 개인 정보 등이 포함될 수 있으며, 그 외에도 원본 데이터 내용 그대로 배포되면 안 되는 정보 등이 포함되어 있을 수 있다.However, such big data may include personal information such as resident registration number, address, phone number, height, weight, and age, and in addition, information that should not be distributed as it is in the original data may be included.

따라서 한국등록특허 제1282705호에서와 같이 전자 문서에 포함된 개인 정보 중에서 일부를 다른 문자열로 치환하여 마스킹을 수행하는 기술을 도입하여, 원본 데이터 중에서 일부를 삭제하거나 대체하는 조치를 취한 변환 데이터를 별도로 생성하여 배포하는 방식이 제안되고 있다.Therefore, as in Korean Patent No. 1282705, a technology to perform masking by replacing some of the personal information included in the electronic document with another character string is introduced, and the converted data that has taken measures to delete or replace some of the original data is separately stored. A method of creating and distributing has been proposed.

그러나 변환 데이터를 배포하는 방식에서 데이터 사용자가 원본 데이터 중에서 일부 내용을 삭제되거나 대체되지 않은 상태로 확인해야 할 필요가 있는 경우, 변환된 빅데이터에 포함된 무수히 많은 데이터 중에서 원본 내용의 확인이 필요한 데이터를 특정하여 확인을 요청할 방법이 없었다.However, in the case of distributing transformed data, data users need to confirm that some of the original data has not been deleted or replaced. There was no way to specifically request confirmation.

따라서 본 발명이 해결하고자 하는 기술적 과제는 원본 데이터 중 일부를 가공한 비식별화 데이터를 사용자에게 제공하되, 필요한 경우 비식별화 데이터의 원본 데이터를 용이하게 확인하고 제공할 수 있는 비식별화 데이터 셋 결합용 키 생성 장치 및 방법을 제공하는 것이다.Therefore, the technical problem to be solved by the present invention is to provide the user with de-identified data processed by some of the original data, but if necessary, the de-identified data set that can easily check and provide the original data of the de-identified data An apparatus and method for generating a key for binding are provided.

상기한 기술적 과제를 해결하기 위한 본 발명에 따른 비식별화 데이터 셋 결합용 키 생성 장치는 데이터베이스로부터 추출된 원본 데이터의 행과 열로 이루어지는 원본 데이터 셋을 생성하는 원본 데이터 셋 생성부, 상기 원본 데이터 셋에 포함된 원본 데이터 중 적어도 일부가 가공되고, 행 또는 열을 식별할 수 있는 비식별화 데이터 셋 결합용 키가 행 또는 열마다 부가되어 있는 비식별화 데이터 셋을 생성하는 비식별화 데이터 셋 생성부, 그리고 데이터 셋을 식별하기 위한 데이터 셋 아이디를 상기 비식별화 데이터 셋과 상기 원본 데이터 셋에 동일하게 부여하여 관리하는 데이터 관리부를 포함하고, 상기 비식별화 데이터 셋의 행마다 부가되어 있는 비식별화 데이터 셋 결합용 키는 상기 비식별화 데이터 셋의 행에 대응하는 상기 원본 데이터 셋의 행에서 선택된 2개 이상의 열의 일부 값을 가지고 역변환이 불가능한 함수를 적용하여 구해진 값으로 정해질 수 있다.An apparatus for generating a key for combining de-identified data sets according to the present invention for solving the above technical problem includes an original data set generator for generating an original data set comprising rows and columns of original data extracted from a database, the original data set Create a de-identified data set that generates a de-identified data set in which at least part of the original data included in and a data management unit that manages by assigning the same data set ID for identifying the data set to the de-identified data set and the original data set, wherein the non-identified data set is added to each row The key for combining the identification data set may be determined as a value obtained by applying a function that cannot be inversely transformed with some values of two or more columns selected from the row of the original data set corresponding to the row of the de-identified data set.

상기 2개 이상의 열은 적어도 하나의 식별자 속성과 적어도 하나의 준식별자 속성을 포함할 수 있다.The two or more columns may include at least one identifier attribute and at least one quasi-identifier attribute.

상기 데이터 관리부는 상기 비식별화 데이터 셋을 소정의 사용자에게 제공하고, 상기 소정의 사용자로부터 데이터 셋 아이디와 비식별화 데이터 셋 결합용 키의 조합을 포함하는 원본 데이터 요청을 받으면, 상기 데이터 셋 아이디에 대응하는 원본 데이터 셋에서 상기 비식별화 데이터 셋 결합용 키에 대응하는 행 또는 열의 원본 데이터를 상기 소정의 사용자에게 제공할 수 있다.The data management unit provides the de-identified data set to a predetermined user, and when receiving an original data request including a combination of a data set ID and a de-identified data set combination key from the predetermined user, the data set ID In the original data set corresponding to , original data of a row or column corresponding to the key for combining the de-identified data set may be provided to the predetermined user.

상기 비식별화 데이터 셋 생성부는 상기 원본 데이터 셋에 포함된 원본 데이터 중 적어도 일부를 식별되지 않게 가공할 수 있다.The de-identified data set generator may process at least a portion of the original data included in the original data set to be unidentified.

상기 데이터 셋 아이디는 상기 원본 데이터 셋의 파일에 역변환이 불가능한 함수를 적용하여 구해진 값으로 정해질 수 있다.The data set ID may be determined as a value obtained by applying a function that cannot be inversely transformed to the file of the original data set.

상기한 기술적 과제를 해결하기 위한 본 발명에 따른 비식별화 데이터 셋 결합용 키 생성 방법은, 원본 데이터의 행과 열로 이루어지는 원본 데이터 셋을 생성하는 단계, 상기 원본 데이터 셋에 포함된 원본 데이터 중 적어도 일부가 가공되고, 행 또는 열을 식별할 수 있는 비식별화 데이터 셋 결합용 키가 행 또는 열마다 부가되어 있는 비식별화 데이터 셋을 생성하는 단계, 그리고 데이터 셋을 식별하기 위한 데이터 셋 아이디를 상기 비식별화 데이터 셋과 상기 원본 데이터 셋에 동일하게 부여하여 관리하는 단계를 포함하고, 상기 비식별화 데이터 셋의 행마다 부가되어 있는 비식별화 데이터 셋 결합용 키는 상기 비식별화 데이터 셋의 행에 대응하는 상기 원본 데이터 셋의 행에서 선택된 2개 이상의 열의 일부 값을 가지고 역변환이 불가능한 함수를 적용하여 구해진 값으로 정해질 수 있다.The method for generating a key for combining de-identified data sets according to the present invention for solving the above technical problem includes generating an original data set comprising rows and columns of original data, at least among the original data included in the original data set A step of generating a de-identified data set in which a part is processed, a key for combining de-identified data sets that can identify a row or column is added to each row or column, and a data set ID for identifying the data set and managing the de-identified data set and the original data set in the same way, wherein the de-identified data set combining key added to each row of the de-identified data set is the de-identified data set. It may be determined as a value obtained by applying a function that cannot be inversely transformed with some values of two or more columns selected from the rows of the original data set corresponding to the rows of .

상기 비식별화 데이터 셋을 소정의 사용자에게 제공하는 단계, 상기 소정의 사용자로부터 데이터 셋 아이디와 비식별화 데이터 셋 결합용 키의 조합을 포함하는 원본 데이터 요청을 받는 단계, 그리고 상기 데이터 셋 아이디에 대응하는 원본 데이터 셋에서 상기 비식별화 데이터 셋 결합용 키에 대응하는 행 또는 열의 원본 데이터를 상기 소정의 사용자에게 제공하는 단계를 더 포함할 수 있다.providing the de-identified data set to a predetermined user; receiving an original data request including a combination of a data set ID and a de-identified data set combining key from the predetermined user; and The method may further include providing original data of a row or column corresponding to the key for combining the de-identified data set in the corresponding original data set to the predetermined user.

본 발명에 의하면, 원본 데이터 중 일부를 가공한 비식별화 데이터를 사용자에게 제공하되, 필요한 경우 비식별화 데이터의 원본 데이터를 용이하게 확인하고 제공할 수 있다.According to the present invention, the de-identified data obtained by processing some of the original data is provided to the user, but if necessary, the original data of the de-identified data can be easily checked and provided.

특히 본 발명은 원본 데이터를 가공한 비식별화 데이터를 사용자에게 제공하고, 제공한 데이터를 다른 비식별화 데이터와 비식별화 데이터 셋 결합용 키를 통해 결합할 수 있다. 또한 비식별화 데이터의 데이터 셋 아이디를 통해 무결성 입증 및 추적을 통해 비식별화 데이터의 사후관리를 할 수 있다.In particular, according to the present invention, de-identified data processed from original data may be provided to the user, and the provided data may be combined with other de-identified data and de-identified data sets through a key for combining. In addition, through data set ID of de-identified data, it is possible to post-management of de-identified data through integrity verification and tracking.

도 1은 본 발명의 일 실시예에 따른 비식별화 데이터 셋 결합용 키 생성 장치의 구성을 나타낸 블록도이다.
도 2는 도 1의 데이터 관리 장치의 구성을 보다 자세히 나타낸 블록도이다.
도 3은 본 발명의 일 실시예에 따른 원본 데이터 셋과 비식별화 데이터 셋을 예시한 도면이다.
도 4는 본 발명의 다른 실시예에 따른 비식별화 데이터 셋 결합용 키를 예시한 도면이다.
도 5는 본 발명의 일 실시예에 따른 비식별화 데이터 셋 결합용 키 생성 방법을 설명하기 위해 제공되는 흐름도이다.1 is a block diagram showing the configuration of an apparatus for generating a key for combining de-identified data sets according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating the configuration of the data management apparatus of FIG. 1 in more detail.
3 is a diagram illustrating an original data set and a de-identified data set according to an embodiment of the present invention.
4 is a diagram illustrating a key for combining de-identified data sets according to another embodiment of the present invention.
5 is a flowchart provided to explain a method for generating a key for combining de-identified data sets according to an embodiment of the present invention.

그러면 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다.Then, with reference to the accompanying drawings, embodiments of the present invention will be described in detail so that those of ordinary skill in the art can easily implement them.

도 1은 본 발명의 일 실시예에 따른 비식별화 데이터 셋 결합용 키 생성 장치의 구성을 나타낸 블록도이다.1 is a block diagram showing the configuration of an apparatus for generating a key for combining de-identified data sets according to an embodiment of the present invention.

도 1을 참고하면, 본 발명에 따른 비식별화 데이터 셋 결합용 키 생성 장치는 복수의 사용자 단말(100a, 100b, …, 100n)과 데이터 관리 장치(200)가 통신망(300)을 통해 연결되어 각종 요청, 정보 및 데이터를 교환할 수 있다.1, in the device for generating a key for combining de-identified data sets according to the present invention, a plurality of user terminals 100a, 100b, ..., 100n and a data management device 200 are connected through a communication network 300, Various requests, information and data can be exchanged.

데이터 관리 장치(200)는 사용자에게 행과 열로 이루어진 원본 데이터 셋에 포함된 원본 데이터 중 적어도 일부를 가공한 비식별화 데이터 셋을 생성하여 제공할 수 있다.The data management apparatus 200 may generate and provide a de-identified data set by processing at least some of the original data included in the original data set including rows and columns to the user.

비식별화 데이터 셋은 해당 데이터 셋을 구별하기 위한 데이터 셋 아이디(DATA SET ID)가 부여되어 있을 수 있다. 그리고 비식별화 데이터 셋은 각 행(row)을 식별할 수 있는 비식별화 데이터 셋 결합용 키가 각 행마다 부가되어 있을 수 있다. 비식별화 데이터 셋 결합용 키는 원본 데이터 셋에는 포함되어 있지 않은 것일 수 있다. 물론 실시예에 따라서 비식별화 데이터 셋 결합용 키를 각 행마다 부여하는 대신 비식별화 데이터 셋의 각 열(column)을 식별할 수 있도록 각 열마다 부가되게 할 수도 있다.The de-identified data set may be assigned a data set ID (DATA SET ID) for distinguishing the corresponding data set. In addition, in the de-identified data set, a key for combining the de-identified data set capable of identifying each row may be added to each row. The key for combining the de-identified data set may not be included in the original data set. Of course, according to an embodiment, instead of giving each row a key for combining the de-identification data set, it may be added to each column so that each column of the de-identification data set can be identified.

이하에서는 비식별화 데이터 셋 결합용 키가 비식별화 데이터 셋의 각 행마다 부가된 경우를 예를 들어 설명한다.Hereinafter, a case in which the de-identified data set combining key is added to each row of the de-identified data set will be described as an example.

데이터 관리 장치(200)는 사용자로부터 데이터 셋 아이디(DATA SET ID)와 비식별화 데이터 셋 결합용 키를 전달받으면, 해당 사용자에게 원본 데이터를 확인할 권한이 있는지 확인하고 데이터 셋 아이디(DATA SET ID)와 비식별화 데이터 셋 결합용 키에 대응하는 원본 데이터를 원본 데이터 셋에서 확인하여 제공할 수 있다.When the data management device 200 receives the data set ID and the de-identified data set combination key from the user, the data management device 200 checks whether the user has the authority to check the original data, and the data set ID (DATA SET ID) and the original data corresponding to the key for combining the de-identified data set can be checked and provided in the original data set.

사용자 단말(100a, 100b, …, 100n)은 사용자의 요청에 따라 데이터 관리 장치(200)로부터 데이터를 제공받을 수 있으며, 데스크톱 컴퓨터, 노트북 컴퓨터, 워크스테이션, 팜톱(palmtop) 컴퓨터, UMPC(ultra mobile personal computer), 태블릿 PC, 개인 휴대 정보 단말기(personal digital assistant, PDA), 웹 패드, 휴대전화, 스마트폰 등과 같이 메모리 수단을 구비하고 마이크로 프로세서를 탑재하여 연산 능력을 갖춘 단말기로 구현될 수 있다.The user terminals 100a, 100b, ..., 100n may receive data from the data management device 200 according to a user's request, and may include a desktop computer, a notebook computer, a workstation, a palmtop computer, and an ultra mobile (UMPC) computer. A personal computer), a tablet PC, a personal digital assistant (PDA), a web pad, a mobile phone, a smart phone, etc. may be implemented as a terminal having a memory means and equipped with a microprocessor to have arithmetic capability.

통신망(300)은 구내 정보 통신망(local area network:LAN), 도시권 통신망(metropolitan area network:MAN), 광역 통신망(wide area network:WAN), 인터넷 등을 가리지 않고, 통신 방식도 유선, 무선을 가리지 않으며 어떠한 통신 방식이라도 상관없다.The communication network 300 does not cover a local area network (LAN), a metropolitan area network (MAN), a wide area network (WAN), the Internet, etc., and the communication method does not cover wired or wireless And it doesn't matter which way you communicate.

도 2는 도 1의 데이터 관리 장치의 구성을 보다 자세히 나타낸 블록도이다.FIG. 2 is a block diagram showing the configuration of the data management apparatus of FIG. 1 in more detail.

도 2를 참고하면, 데이터 관리 장치(200)는 데이터베이스부(210), 원본 데이터 셋 생성부(220), 데이터 셋 저장부(230), 데이터 관리부(240) 및 비식별화 데이터 셋 생성부(250)를 포함할 수 있다.Referring to FIG. 2 , the data management apparatus 200 includes a database unit 210 , an original data set generation unit 220 , a data set storage unit 230 , a data management unit 240 , and a de-identified data set generation unit ( 250) may be included.

데이터베이스부(210)는 데이터 관리 장치(200)에서 관리하는 데이터를 데이터베이스로 구축하여 저장하는 기능을 수행하며, 이른바 빅데이터라고 칭하는 대량의 데이터 집합을 수집, 저장 및 관리하는 기능을 제공할 수 있다.The database unit 210 performs a function of constructing and storing data managed by the data management device 200 as a database, and may provide a function of collecting, storing and managing a large amount of data set called big data. .

원본 데이터 셋 생성부(220)는 데이터베이스부(210)로부터 추출된 원본 데이터의 행과 열로 이루어지는 원본 데이터 셋을 생성하는 기능을 수행한다. 원본 데이터 셋 생성부(220)는 원본 데이터 셋을 데이터 셋 저장부(230)에 저장할 수 있다.The original data set generating unit 220 performs a function of generating an original data set including rows and columns of the original data extracted from the database unit 210 . The original data set generator 220 may store the original data set in the data set storage 230 .

데이터 셋 저장부(230)는 원본 데이터 셋 생성부(220)에서 생성되는 원본 데이터 셋과 비식별화 데이터 셋 생성부(250)에서 생성된 비식별화 데이터 셋을 저장하고, 데이터 관리부(240)의 요청에 따라 제공할 수 있다.The data set storage unit 230 stores the original data set generated by the original data set generating unit 220 and the de-identified data set generated by the de-identified data set generating unit 250, and the data management unit 240 can be provided upon request.

데이터 관리부(240)는 사용자의 요청에 따라 원본 데이터 셋 생성부(220)에 의해 원본 데이터 셋을 생성되게 하고, 이를 비식별화 데이터 셋 생성부(250)에서 변환한 비식별화 데이터 셋을 제공하는 기능을 수행한다.The data management unit 240 generates an original data set by the original data set generating unit 220 according to a user's request, and provides the de-identified data set converted by the de-identified data set generating unit 250 . perform the function

데이터 관리부(240)는 비식별화 데이터 셋을 사용자에게 제공할 때 데이터 셋 아이디를 함께 제공할 수 있다. 데이터 셋 아이디는 비식별화 데이터 셋 파일의 파일명 형태로 제공되거나, 파일명과는 별도로 사용자에게 제공될 수도 있다. 가령 이메일로 비식별화 데이터 셋을 제공한다고 하면, 메일 본문에 데이터 셋 아이디를 기재하고, 비식별화 데이터 셋은 첨부 파일 형태로 제공하는 것도 가능하다. 비식별화 데이터 셋과 데이터 셋 아이디를 사용자에게 제공하는 방법은 다양한 방법이 사용될 수 있다.The data management unit 240 may provide a data set ID together when providing the de-identified data set to the user. The data set ID may be provided in the form of a file name of the de-identified data set file or may be provided to the user separately from the file name. For example, if a de-identified data set is provided by e-mail, the data set ID may be written in the body of the e-mail, and the de-identified data set may be provided in the form of an attachment. Various methods may be used to provide the de-identified data set and the data set ID to the user.

한편 데이터 관리부(240)는 비식별화 데이터 셋의 데이터 셋 아이디와 비식별화 데이터 셋 결합용 키의 조합을 포함하는 원본 데이터 요청을 전달받으면, 그에 대응하는 원본 데이터를 원본 데이터 셋에서 확인하여 제공하는 기능을 수행할 수 있다.On the other hand, when the data management unit 240 receives an original data request including a combination of the data set ID of the de-identified data set and the key for combining the de-identified data set, the data management unit 240 checks and provides the corresponding original data from the original data set. function can be performed.

비식별화 데이터 셋 생성부(250)는 원본 데이터 셋 생성부(220)에서 생성된 원본 데이터 셋에 포함된 원본 데이터 중 적어도 일부를 가공하여 비식별화 데이터 셋을 생성한다.The de-identification data set generation unit 250 generates a de-identification data set by processing at least some of the original data included in the original data set generated by the original data set generation unit 220 .

도 3은 본 발명의 일 실시예에 따른 원본 데이터 셋과 비식별화 데이터 셋을 예시한 도면이다.3 is a diagram illustrating an original data set and a de-identified data set according to an embodiment of the present invention.

도 3을 참고하면, 원본 데이터 셋 생성부(220)는 데이터베이스부(210)로부터 원본 데이터를 추출하여 원본 데이터 셋(10)을 생성할 수 있다. 원본 데이터 셋(10)에는 도 3에 예시한 것과 같이 사람의 성명(Name), 주소(Address), 나이(Age), 키(Height), 체중(Weight), 혈액형(Blood) 등의 개인 정보를 포함할 수 있다. Referring to FIG. 3 , the original data set generating unit 220 may generate the original data set 10 by extracting the original data from the database unit 210 . In the original data set 10, personal information such as a person's name, address, age, height, weight, and blood type is stored as illustrated in FIG. 3 . may include

한편 데이터 관리부(240)는 데이터베이스부(210)로부터 추출된 원본 데이터 셋(10)의 파일에 대해 역변환이 불가능한 함수를 적용하여 구해진 값을, 해당 원본 데이터 셋(10)에 대한 데이터 셋 아이디로 관리할 수 있다. 예컨대 역변환이 불가능한 함수 단방향 해쉬(Hash) 알고리즘을 적용하여 해쉬값을 추출하고, 이를 해당 원본 데이터 셋(10)에 대한 데이터 셋 아이디로 관리할 수 있다. 원본 데이터 셋에 대응하는 비식별화 데이터 셋에도 동일한 데이터 셋 아이디로 부여하여 관리할 수 있다. 이하에서는 역변환이 불가능한 함수로 해쉬 함수가 적용되는 경우에 대해서 설명을 하지만, 해쉬 함수 이외의 다른 복원 불가능한 함수를 적용할 수 있는 것으로 이해되어야 한다.Meanwhile, the data management unit 240 manages a value obtained by applying a function that cannot be inversely transformed to the file of the original data set 10 extracted from the database unit 210 as a data set ID for the original data set 10 . can do. For example, a hash value may be extracted by applying a one-way hash algorithm that cannot be inversely transformed, and it may be managed as a data set ID for the corresponding original data set 10 . The same data set ID can be assigned to a de-identified data set corresponding to the original data set and managed. Hereinafter, a case in which a hash function is applied as a function that cannot be inversely transformed will be described, but it should be understood that other non-recoverable functions other than the hash function may be applied.

빅데이터를 활용하는 과정에서 개인 정보 침해를 방지하기 위해서 비식별 조치 등을 취하도록 하고 있다. 따라서 비식별화 데이터 셋 생성부(250)는 원본 데이터 셋에 포함된 원본 데이터 중에서 적어도 일부를 도 3에 예시한 것과 같이 식별이 되지 않게 가공하여 비식별화 데이터 셋(20)을 생성할 수 있다.In order to prevent infringement of personal information in the process of using big data, de-identification measures are taken. Accordingly, the de-identification data set generation unit 250 may generate the de-identification data set 20 by processing at least a portion of the original data included in the original data set so as not to be identified as illustrated in FIG. 3 . .

도 3에서 비식별화 데이터 셋(20)을 사람의 성명(Name), 주소(Address), 나이(Age), 키(Height), 체중(Weight), 혈액형(Blood) 등에 대한 정보가 식별되지 않게 성명의 일부를 "**"으로, 혈액형은 전부를 "*"으로 마스킹하고, 세부 주소를 삭제하고 시 단위 주소로 대표화하는 방법 등이 적용된 것을 예시하였다. 마찬가지로 나이, 키, 체중 등도 일정 구간으로 그룹핑하는 방식으로 개인 정보를 식별되지 않게 처리할 수 있다.In Figure 3, the de-identification data set 20 is a person's name (Name), address (Address), age (Age), height (Height), weight (Weight), such as blood type (Blood) information is not identified Part of the name is masked with "**" and all blood types are masked with "*", and the detailed address is deleted and the method of representing the city unit address is applied. Similarly, by grouping age, height, weight, etc. into a certain section, personal information can be processed without being identified.

특히 본 발명에 따르면 비식별화 데이터 셋(20)은 원본 데이터 셋(10)을 비식별화 데이터 셋(20)으로 가공할 때, 원본 데이터 셋(10)에는 없던 비식별화 데이터 셋 결합용 키(22)를 각 행마다 부가할 수 있다. 그리고 비식별화 데이터 셋(20)에 데이터 셋 아이디(21)도 원본 데이터 셋(10)과 동일하게 부여할 수 있다.In particular, according to the present invention, the de-identified data set 20 is a key for combining the de-identified data set that was not in the original data set 10 when processing the original data set 10 into the de-identified data set 20 . (22) can be added to each row. In addition, the same data set ID 21 as the original data set 10 may be assigned to the de-identified data set 20 .

한편 비식별화 데이터 셋 결합용 키(22)는 각 행마다 아래에서 설명하는 방법에 의해 부여될 수 있다. 비식별화 데이터 셋 생성부(250)는 원본 데이터 셋(10)의 행마다 해당 행에 속하는 복수의 열(column) 중에서 하나 이상을 선택 받고, 선택된 열의 값에 역변환이 불가능한 함수를 적용하여 구해진 값을 해당 행의 비식별화 데이터 셋 결합용 키(22)로 정할 수 있다. 앞서 설명한 것과 같이 물론 앞서 데이터 셋 아이디를 구하는 방법에 대해 설명한 것과 같이 역변환이 불가능한 함수로 해쉬 알고리즘이 사용될 수 있다. 예컨대 'Age'를 비식별화 데이터 셋 결합용 키(22)의 생성을 위한 열로 선택한 경우, 각 행에 대해서 원본 데이터 셋(10)의 'Age' 열의 값에 해쉬 알고리즘을 적용하여 구해지는 해쉬값을 해당 행의 비식별화 데이터 셋 결합용 키(22)로 정할 수 있다. 실시예에 따라서 비식별화 데이터 셋 결합용 키(22) 생성을 위한 열로 복수 개의 열이 선택될 수 있으며, 각 행마다 일정한 기준에 의해 선택되는 열이 다르게 정해질 수도 있다.Meanwhile, the key 22 for combining the de-identified data set may be assigned to each row by a method described below. The de-identification data set generator 250 receives one or more selections from a plurality of columns belonging to a corresponding row for each row of the original data set 10, and applies a function that cannot be inversely transformed to the value of the selected column. may be determined as the key 22 for combining the de-identified data set of the corresponding row. As described above, of course, the hash algorithm can be used as a function that cannot be inversely transformed as described above for the method of obtaining the data set ID. For example, when 'Age' is selected as a column for generating the de-identified data set combination key 22, for each row, a hash value obtained by applying a hash algorithm to the value of the 'Age' column of the original data set 10 may be determined as the key 22 for combining the de-identified data set of the corresponding row. According to an embodiment, a plurality of columns may be selected as a column for generating the key 22 for combining the de-identified data set, and the column selected according to a certain criterion may be differently determined for each row.

도 4는 본 발명의 다른 실시예에 따른 비식별화 데이터 셋 결합용 키를 예시한 도면이다.4 is a diagram illustrating a key for combining de-identified data sets according to another embodiment of the present invention.

도 4를 참조하면, 비식별화 데이터 셋 결합용 키(22)는 아래에서 설명하는 다른 방법에 의해 부여될 수 있다. 비식별화 데이터 셋 생성부(250)는 원본 데이터 셋(10)의 행마다 해당 행에 속하는 복수의 열(column) 중에서 2개 이상의 열을 선택 받고, 선택된 2개 이상의 열의 일부 값에 역변환이 불가능한 함수를 적용하여 구해진 값을 해당 행의 비식별화 데이터 셋 결합용 키(22)로 정할 수 있다. 여기서, 2개 이상의 열은 적어도 하나의 식별자 속성과 적어도 하나의 준식별자 속성을 포함할 수 있는데, 예컨대, 식별자 속성은 주민번호를 포함할 수 있고, 준식별자 속성은 전화번호, 연령 또는 우편번호 등을 포함할 수 있다. 그리고 역변환이 불가능한 함수로 해쉬 알고리즘 등이 사용될 수 있다. 예컨대 식별자 속성인 '주민번호'와 준식별자 속성인 '전화번호'를 비식별화 데이터 셋 결합용 키(22)의 생성을 위한 열로 선택한 경우, 주민번호의 앞자리 6자리수에 해당되는 일부 값(22a)과 전화번호 뒷자리 4자리수에 해당되는 일부 값(22b)에 해쉬 알고리즘을 적용하여 구해지는 해쉬값을 해당 행의 비식별화 데이터 셋 결합용 키(22)로 정할 수 있다. Referring to FIG. 4 , the key 22 for combining the de-identified data set may be assigned by another method described below. The de-identification data set generating unit 250 receives two or more columns selected from among a plurality of columns belonging to a corresponding row for each row of the original data set 10, and inverse transformation is impossible to some values of the selected two or more columns The value obtained by applying the function may be determined as the key 22 for combining the de-identified data set of the corresponding row. Here, the two or more columns may include at least one identifier attribute and at least one quasi-identifier attribute. For example, the identifier attribute may include a resident number, and the quasi-identifier attribute may include a phone number, age or postal code, etc. may include. In addition, a hash algorithm or the like may be used as a function that cannot be inversely transformed. For example, when the identifier attribute 'resident number' and the quasi-identifier attribute 'phone number' are selected as columns for generating the key 22 for combining de-identification data sets, some values 22a corresponding to the first 6 digits of the resident number ) and a hash value obtained by applying a hash algorithm to some values 22b corresponding to the last four digits of the phone number may be determined as the key 22 for combining the de-identified data set of the corresponding row.

데이터 관리부(240)는 비식별화 데이터 셋에 부여된 데이터 셋 아이디를 원본 데이터 셋과 대응시켜 관리할 수 있다.The data management unit 240 may manage the data set ID given to the de-identified data set in correspondence with the original data set.

한편 데이터 관리부(240)는 사용자 단말(100a, 100b, …, 100n)로부터 비식별화 데이터 셋 아이디와 비식별화 데이터 셋 결합용 키를 포함하는 원본 데이터 요청을 전달받으면, 비식별화 데이터 셋 아이디를 통해 어느 비식별화 데이터 셋인지 확인하고, 비식별화 데이터 셋 결합용 키를 통해 해당 비식별화 데이터 셋의 어떤 행에 대한 원본 데이터 요청인지를 확인할 수 있다. 그리고 해당 비식별화 데이터 셋에 대응하는 원본 데이터 셋에서 해당 행의 원본 데이터를 확인하여 사용자 단말(100a, 100b, …, 100n)에 제공할 수 있다.On the other hand, the data management unit 240 receives the original data request including the de-identified data set ID and the de-identified data set combination key from the user terminals 100a, 100b, ..., 100n, the de-identified data set ID You can check which de-identified data set it is through and the original data request for which row of the de-identified data set through the key for combining de-identified data set. In addition, the original data of the corresponding row in the original data set corresponding to the de-identified data set may be checked and provided to the user terminals 100a, 100b, ..., 100n.

도 5는 본 발명의 일 실시예에 따른 비식별화 데이터 셋 결합용 키 생성 방법을 설명하기 위해 제공되는 흐름도이다.5 is a flowchart provided to explain a method for generating a key for combining de-identified data sets according to an embodiment of the present invention.

도 5를 참고하면, 먼저 원본 데이터 셋 생성부(220)는 데이터베이스(210)로부터 추출된 원본 데이터의 행과 열로 이루어지는 원본 데이터 셋을 생성할 수 있다(S510).Referring to FIG. 5 , first, the original data set generating unit 220 may generate an original data set including rows and columns of original data extracted from the database 210 ( S510 ).

다음으로 비식별화 데이터 셋 생성부(250)는 단계(S510)에서 생성된 원본 데이터 셋에 포함된 원본 데이터 중 적어도 일부를 식별되지 않게 가공하고, 행을 식별할 수 있는 비식별화 데이터 셋 결합용 키를 행마다 부가하여 비식별화 데이터 셋을 생성할 수 있다(S520). 단계(S520)에서 열을 식별할 수 있도록 비식별화 데이터 셋 결합용 키를 열마다 부가하여 비식별화 데이터 셋을 생성할 수도 있다.Next, the de-identification data set generation unit 250 processes at least some of the original data included in the original data set generated in step S510 so that it is not identified, and combines the de-identification data set capable of identifying rows. A de-identification data set may be generated by adding a key for each row (S520). In step S520, a de-identified data set may be created by adding a key for combining the de-identified data set to each column so that the column can be identified.

데이터 관리부(240)는 데이터 셋을 식별하기 위한 데이터 셋 아이디를 원본 데이터 셋과 그에 대응하는 비식별화 데이터 셋에 동일하게 부여하여 관리할 수 있다(S530). 구체적으로 데이터 관리부(240)는 원본 데이터 셋과 비식별화 데이터 셋 쌍을 다른 원본 데이터 셋과 비식별화 데이터 셋의 쌍과 식별할 수 있는 데이터 셋 아이디를 원본 데이터 셋과 비식별화 데이터 셋에 부여할 수도 있다. 데이터 셋 아이디는 원본 데이터 셋을 해쉬 알고리즘에 적용하여 구해지는 해쉬값으로 부여할 수 있다.The data management unit 240 may manage by giving the same data set ID for identifying the data set to the original data set and the corresponding de-identified data set ( S530 ). Specifically, the data management unit 240 provides a data set ID that can identify the original data set and the de-identified data set pair from the other original data set and the de-identified data set pair to the original data set and the de-identified data set. may be given. The data set ID can be given as a hash value obtained by applying the original data set to a hash algorithm.

그리고 데이터 관리부(240)는 비식별화 데이터 셋을 소정의 사용자에게 제공할 수 있다(S540). 단계(S540)에서 데이터 관리부(240)는 비식별화 데이터 셋을 사용자에게 제공할 때 데이터 셋 아이디를 함께 제공할 수 있다. 데이터 셋 아이디는 비식별화 데이터 셋 파일의 파일명 형태로 제공되거나, 파일명과는 별도로 사용자에게 제공될 수도 있다. 가령 이메일로 비식별화 데이터 셋을 제공한다고 하면, 메일 본문에 데이터 셋 아이디를 기재하고, 비식별화 데이터 셋은 첨부 파일 형태로 제공하는 것도 가능하다.In addition, the data management unit 240 may provide the de-identified data set to a predetermined user (S540). In step S540 , the data management unit 240 may provide a data set ID together when providing the de-identified data set to the user. The data set ID may be provided in the form of a file name of the de-identified data set file or may be provided to the user separately from the file name. For example, if a de-identified data set is provided by e-mail, the data set ID may be written in the body of the e-mail, and the de-identified data set may be provided in the form of an attachment.

이후 데이터 관리부(240)는 비식별화 데이터 셋의 데이터 셋 아이디와 비식별화 데이터 셋 결합용 키의 조합을 포함하는 원본 데이터 요청을 사용자로부터 전달받으면(S550), 그에 대응하는 원본 데이터를 원본 데이터 셋에서 확인하여 제공할 수 있다(S560).Afterwards, the data management unit 240 receives an original data request including a combination of a data set ID of a de-identified data set and a key for combining the de-identified data set from the user (S550), and converts the corresponding original data to the original data. It can be provided by checking in three (S560).

본 발명의 실시예는 다양한 컴퓨터로 구현되는 동작을 수행하기 위한 프로그램 명령을 포함하는 컴퓨터로 읽을 수 있는 매체를 포함한다. 이 매체는 앞서 설명한 비식별화 데이터 셋 결합용 키 생성 방법을 실행시키기 위한 프로그램을 기록한다. 이 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 이러한 매체의 예에는 하드디스크, 플로피디스크 및 자기 테이프와 같은 자기 매체, CD 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 자기-광 매체, 롬, 램, 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 구성된 하드웨어 장치 등이 있다. 또는 이러한 매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수 있다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.An embodiment of the present invention includes a computer-readable medium including program instructions for performing various computer-implemented operations. This medium records a program for executing the method for generating a key for combining the de-identified data set described above. The medium may contain program instructions, data files, data structures, etc. alone or in combination. Examples of such media include hard disks, magnetic media such as floppy disks and magnetic tapes, optical recording media such as CDs and DVDs, optical disks and magneto-optical media, program instructions such as ROM, RAM, flash memory, etc. hardware devices configured to store and perform Alternatively, such a medium may be a transmission medium such as an optical or metal wire, a waveguide, etc. including a carrier wave for transmitting a signal designating a program command, a data structure, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상에서 본 발명의 바람직한 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the preferred embodiment of the present invention has been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements by those skilled in the art using the basic concept of the present invention as defined in the following claims are also provided. is within the scope of the

Claims

delete

An original data set generator that generates an original data set consisting of rows and columns of original data extracted from the database;
At least a portion of the original data included in the original data set is processed, and a de-identified data set is created in which a key for combining de-identified data sets capable of identifying a row or column is added to each row or column. an image data set generator, and
A data management unit that manages by giving the same data set ID for identifying a data set to the de-identified data set and the original data set,
The de-identified data set combining key added to each row of the de-identified data set is inversely transformed with some values of two or more columns selected from the rows of the original data set corresponding to the rows of the de-identified data set. It is determined by the value obtained by applying this impossible function,
The two or more columns are
An apparatus for generating a key for combining de-identified data sets including at least one identifier attribute and at least one quasi-identifier attribute.

An original data set generator that generates an original data set consisting of rows and columns of original data extracted from the database;
At least a portion of the original data included in the original data set is processed, and a de-identified data set is created in which a key for combining de-identified data sets capable of identifying a row or column is added to each row or column. an image data set generator, and
A data management unit that manages by giving the same data set ID for identifying a data set to the de-identified data set and the original data set,
The de-identified data set combining key added to each row of the de-identified data set is inversely transformed with some values of two or more columns selected from the rows of the original data set corresponding to the rows of the de-identified data set. It is determined by the value obtained by applying this impossible function,
The data management unit,
providing the de-identified data set to a predetermined user,
When receiving an original data request including a combination of a data set ID and a de-identified data set combining key from the predetermined user, the de-identified data set combining key corresponds to the original data set corresponding to the data set ID A device for generating a key for combining de-identified data sets that provides original data of a row or column to the predetermined user.

An original data set generator that generates an original data set consisting of rows and columns of original data extracted from the database;
At least a portion of the original data included in the original data set is processed, and a de-identified data set is created in which a key for combining de-identified data sets capable of identifying a row or column is added to each row or column. an image data set generator, and
A data management unit that manages by giving the same data set ID for identifying a data set to the de-identified data set and the original data set,
The de-identified data set combining key added to each row of the de-identified data set is inversely transformed with some values of two or more columns selected from the rows of the original data set corresponding to the rows of the de-identified data set. It is determined by the value obtained by applying this impossible function,
The de-identified data set generation unit,
A device for generating a key for combining de-identified data sets that processes at least some of the original data included in the original data set so that they are not identified.

An original data set generator that generates an original data set consisting of rows and columns of original data extracted from the database;
At least a portion of the original data included in the original data set is processed, and a de-identified data set is created in which a key for combining de-identified data sets capable of identifying a row or column is added to each row or column. an image data set generator, and
A data management unit that manages by giving the same data set ID for identifying a data set to the de-identified data set and the original data set,
The de-identified data set combining key added to each row of the de-identified data set is inversely transformed with some values of two or more columns selected from the rows of the original data set corresponding to the rows of the de-identified data set. It is determined by the value obtained by applying this impossible function,
The data set ID is,
A device for generating a key for combining de-identified data sets, which is determined as a value obtained by applying a function that cannot be inversely transformed to the file of the original data set.

delete

generating an original data set consisting of rows and columns of the original data;
Generating a de-identified data set in which at least a portion of the original data included in the original data set is processed and a key for combining a de-identified data set capable of identifying a row or column is added to each row or column; and
and administering the same data set ID for identifying the data set to the de-identified data set and the original data set,
The de-identified data set combining key added to each row of the de-identified data set is inversely transformed with some values of two or more columns selected from the rows of the original data set corresponding to the rows of the de-identified data set. It is determined by the value obtained by applying this impossible function,
The two or more columns are
A method for generating a key for combining de-identified data sets comprising at least one identifier attribute and at least one quasi-identifier attribute.

generating an original data set consisting of rows and columns of the original data;
Generating a de-identified data set in which at least a portion of the original data included in the original data set is processed and a key for combining a de-identified data set capable of identifying a row or column is added to each row or column; and
and administering the same data set ID for identifying the data set to the de-identified data set and the original data set,
The de-identified data set combining key added to each row of the de-identified data set is inversely transformed with some values of two or more columns selected from the rows of the original data set corresponding to the rows of the de-identified data set. It is determined by the value obtained by applying this impossible function,
providing the de-identified data set to a predetermined user;
receiving an original data request including a combination of a data set ID and a key for combining a de-identified data set from the predetermined user; and
providing original data of a row or column corresponding to the key for combining the de-identified data set in the original data set corresponding to the data set ID to the predetermined user;
A method of generating a key for combining de-identified data sets further comprising a.

generating an original data set consisting of rows and columns of the original data;
Generating a de-identified data set in which at least a portion of the original data included in the original data set is processed and a key for combining a de-identified data set capable of identifying a row or column is added to each row or column; and
and administering the same data set ID for identifying the data set to the de-identified data set and the original data set,
The de-identified data set combining key added to each row of the de-identified data set is inversely transformed with some values of two or more columns selected from the rows of the original data set corresponding to the rows of the de-identified data set. It is determined by the value obtained by applying this impossible function,
A method of generating a key for combining de-identified data sets by processing at least a portion of the original data included in the original data set to be unidentified to generate the de-identified data set.

generating an original data set consisting of rows and columns of the original data;
Generating a de-identified data set in which at least a portion of the original data included in the original data set is processed and a key for combining a de-identified data set capable of identifying a row or column is added to each row or column; and
and administering the same data set ID for identifying the data set to the de-identified data set and the original data set,
The de-identified data set combining key added to each row of the de-identified data set is inversely transformed with some values of two or more columns selected from the rows of the original data set corresponding to the rows of the de-identified data set. It is determined by the value obtained by applying this impossible function,
The data set ID is,
A method for generating a key for combining de-identified data sets, which is determined by a value obtained by applying a function that cannot be inversely transformed to the file of the original data set.