KR20210110765A

KR20210110765A - Method for providing ai-based big data de-identification solution

Info

Publication number: KR20210110765A
Application number: KR1020200025678A
Authority: KR
Inventors: 김광민
Original assignee: 이지스케이 주식회사
Priority date: 2020-03-01
Filing date: 2020-03-01
Publication date: 2021-09-09

Abstract

Disclosed is a method for providing an artificial intelligence-based big data de-identification solution comprising: a step of collecting big data; a step of identifying personal information included in the big data; and a step of de-identifying the identified personal information. Therefore, the present invention is capable of having an effect of being able to safely process atypical/semi-structured personal information.

Description

{METHOD FOR PROVIDING AI-BASED BIG DATA DE-IDENTIFICATION SOLUTION}

본 발명은 인공지능 기반 빅데이터 비식별화 솔루션 제공방법에 관한 것이다.The present invention relates to a method of providing an artificial intelligence-based big data de-identification solution.

2020년 1월 9일 데이터 3법이 통과하면서 빅데이터 활용을 위한 개인정보비식별화 조치는 필수사항으로 부각되고 있으며 반드시 비식별 처리된 개인정보를 제공해야 한다. With the passage of the Data 3 Act on January 9, 2020, measures to de-identify personal information for the use of big data are emerging as essential, and de-identified personal information must be provided.

데이터 경제를 지향하는 최근 동향에 따르면 빅데이터 활용은 불가피하고, 그로 인해 발생할 수 있는 개인정보 침해의 문제점은 지속적으로 증가하고 있는 추세이다. 이를 위해 개인정보 비식별화 솔루션의 제공이 반드시 필요하며, 기존의 솔루션들이 제공하는 수동적인 처리 방식은 한계가 있어 이를 개선한 지능화된 처리 프로세스가 필요한 상황이다.According to the recent trend toward the data economy, the use of big data is inevitable, and the problems of personal information infringement that may occur as a result are continuously increasing. For this, it is necessary to provide a personal information de-identification solution, and the manual processing method provided by existing solutions has limitations, so an improved intelligent processing process is needed.

현재 개인정보 비식별 솔루션을 포함하는 보안 시장은 초기 단계로 볼 수 있으며, 이는 나아가 데이터를 활용하는 모든 분야에 적용할 수 있는 솔루션으로서 시장에서 활용될 수 있는 가치는 지속적으로 상승할 것으로 예상된다.Currently, the security market including personal information de-identification solutions can be viewed as an early stage, and furthermore, as a solution that can be applied to all fields that use data, the value that can be used in the market is expected to increase continuously.

개인정보 비식별 솔루션은 빅데이터 산업의 발전을 위한 필수 요건이며, 개인정보 비식별이란 개인정보를 활용함에 있어 가명처리, 총계처리, 데이터삭제, 데이터범주화, 데이터마스킹 등의 비식별화 처리를 의미한다.A personal information de-identification solution is an essential requirement for the development of the big data industry, and personal information de-identification means de-identification processing such as pseudonymization, total processing, data deletion, data categorization, and data masking when using personal information. do.

개인정보 비식별화 솔루션은 이러한 개인정보를 개인정보 비식별 가이드라인의 내용을 맞게 인공지능이 탑재된 자동화된 프로세스 절차로 안전하고 신속하게 비식별화 하는 것을 목표로 하는 것이다.The personal information de-identification solution aims to de-identify such personal information safely and quickly with an automated process procedure equipped with artificial intelligence in accordance with the contents of the personal information de-identification guideline.

본 발명이 해결하고자 하는 과제는 인공지능 기반 빅데이터 비식별화 솔루션을 제공하는 것이다.The problem to be solved by the present invention is to provide an artificial intelligence-based big data de-identification solution.

데이터 3법 개정으로 인한 빅데이터 활용에 따른 개인정보 노출이 우려되는 사황에서 개인정보의 비식별화를 통한 안전한 사용 방안을 제시하는 것을 목표로 한다. It aims to present a safe use method through de-identification of personal information in situations where there is concern about personal information exposure due to the use of big data due to the revision of the Data 3 Act.

구체적으로, 비정형화 된 포맷의 데이터를 수집하는 단계에서 전처리를 통한 형태소 구분을 수행하고 처리된 데이터를 이용하여 실시간으로 개인정보를 식별하며, 식별된 개인정보는 설정된 비식별 처리 모델을 적용하여 신속하고 안전하게 비식별 처리하고 저장 및 활용하도록 한다.Specifically, in the step of collecting data in an atypical format, morpheme classification is performed through pre-processing, and personal information is identified in real time using the processed data. and securely de-identify, store and utilize.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the following description.

빅데이터 수집 대상인 정형, 비정형, 반정형 데이터에서 누락되기 쉬운 개인정보를 자연어처리를 통한 AI 기계학습을 통해 정확히 식별하여 비식별 처리하며, 개인정보 탐지속도와 비식별화 처리속도를 향상시키기 위해 인메모리 기반의 Spark 분산처리, CDC 기술, 자연어처리 요소 중에 핵심인 KorBERT 모듈 등을 사용하여 기존 경쟁사 제품보다 우수한 성능을 제시할 수 있다.Personal information that is easily omitted from structured, unstructured, and semi-structured data, which is a big data collection target, is accurately identified and de-identified through AI machine learning through natural language processing. By using memory-based Spark distributed processing, CDC technology, and KorBERT module, which is the core of natural language processing elements, it can present superior performance compared to existing competitor products.

또한, 개인정보 비식별 적정성 평가 모듈을 개발하여 개인정보 유추 공격인 연결공격, 동질성 공격, 배경지식공격, 쏠림공격, 유사성공격 등을 시뮬레이션 하여 적정성을 평가할 수 있으며, 타 솔루션과 연동하기 위한 Restful API 연계를 통해 기존 환경을 변경하지 않고 안정적으로 연동할 수 있다.In addition, by developing a personal information de-identification adequacy evaluation module, it is possible to evaluate the adequacy by simulating personal information inference attacks such as connection attack, homogeneity attack, background knowledge attack, concentration attack, and similarity attack. Restful API for interworking with other solutions Through linkage, stable linkage can be achieved without changing the existing environment.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

개시된 실시 예에 따르면, 정형화된 개인정보뿐 아니라 비정형/반정형화된 개인정보들을 식별하여 안전하게 처리할 수 있는 효과가 있다.According to the disclosed embodiment, there is an effect that not only standardized personal information but also atypical/semi-structured personal information can be identified and safely processed.

또한, 자동화된 시스템에 기반하여 신속하고 안전하게 개인정보를 비식별화 처리할 수 있어, 개인정보의 보호는 물론 높은 경제적 효과도 기대할 수 있는 장점이 있다.In addition, it is possible to quickly and safely de-identify personal information based on an automated system, which has the advantage of not only protecting personal information but also expecting high economic effects.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1 내지 도 3은 개시된 실시 예에 따른 빅데이터 비식별화 솔루션을 도시한 도면이다.
도 4는 일 실시 예에 따른 비식별 모델의 구성을 도시한 도면이다.
도 5는 일 실시 예에 따른 적정성 여부 판단 모델을 도시한 도면이다.1 to 3 are diagrams illustrating a big data de-identification solution according to the disclosed embodiment.
4 is a diagram illustrating a configuration of a non-identification model according to an exemplary embodiment.
5 is a diagram illustrating an appropriateness determination model according to an exemplary embodiment.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only these embodiments allow the disclosure of the present invention to be complete, and those of ordinary skill in the art to which the present invention pertains. It is provided to fully understand the scope of the present invention to those skilled in the art, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for the purpose of describing the embodiments and is not intended to limit the present invention. As used herein, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other components in addition to the stated components. Like reference numerals refer to like elements throughout, and "and/or" includes each and every combination of one or more of the recited elements. Although "first", "second", etc. are used to describe various elements, these elements are not limited by these terms, of course. These terms are only used to distinguish one component from another. Accordingly, it goes without saying that the first component mentioned below may be the second component within the spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used herein will have the meaning commonly understood by those of ordinary skill in the art to which this invention belongs. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless specifically defined explicitly.

명세서에서 사용되는 "부" 또는 “모듈”이라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, "부" 또는 “모듈”은 어떤 역할들을 수행한다. 그렇지만 "부" 또는 “모듈”은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. "부" 또는 “모듈”은 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 "부" 또는 “모듈”은 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 "부" 또는 “모듈”들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 "부" 또는 “모듈”들로 결합되거나 추가적인 구성요소들과 "부" 또는 “모듈”들로 더 분리될 수 있다.As used herein, the term “unit” or “module” refers to a hardware component such as software, FPGA, or ASIC, and “unit” or “module” performs certain roles. However, “part” or “module” is not meant to be limited to software or hardware. A “unit” or “module” may be configured to reside on an addressable storage medium or to reproduce one or more processors. Thus, by way of example, “part” or “module” refers to components such as software components, object-oriented software components, class components and task components, processes, functions, properties, Includes procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays and variables. Components and functionality provided within “parts” or “modules” may be combined into a smaller number of components and “parts” or “modules” or as additional components and “parts” or “modules”. can be further separated.

본 명세서에서, 컴퓨터는 적어도 하나의 프로세서를 포함하는 모든 종류의 하드웨어 장치를 의미하는 것이고, 실시 예에 따라 해당 하드웨어 장치에서 동작하는 소프트웨어적 구성도 포괄하는 의미로서 이해될 수 있다. 예를 들어, 컴퓨터는 스마트폰, 태블릿 PC, 데스크톱, 노트북 및 각 장치에서 구동되는 사용자 클라이언트 및 애플리케이션을 모두 포함하는 의미로서 이해될 수 있으며, 또한 이에 제한되는 것은 아니다.In this specification, a computer means all types of hardware devices including at least one processor, and may be understood as encompassing software configurations operating in the corresponding hardware device according to embodiments. For example, a computer may be understood to include, but is not limited to, smart phones, tablet PCs, desktops, notebooks, and user clients and applications running on each device.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 명세서에서 설명되는 각 단계들은 컴퓨터에 의하여 수행되는 것으로 설명되나, 각 단계의 주체는 이에 제한되는 것은 아니며, 실시 예에 따라 각 단계들의 적어도 일부가 서로 다른 장치에서 수행될 수도 있다.Each step described in this specification is described as being performed by a computer, but the subject of each step is not limited thereto, and at least a portion of each step may be performed in different devices according to embodiments.

도 1 내지 도 3은 개시된 실시 예에 따른 빅데이터 비식별화 솔루션을 도시한 도면이다.1 to 3 are diagrams illustrating a big data de-identification solution according to the disclosed embodiment.

개시된 실시 예에 따른 빅데이터 비식별화 솔루션 제공방법은, 후술되는 여러 기술요소들을 포함할 수 있다. 각각의 기술요소는 서로 개별적으로, 또는 조합된 하나로서 동작할 수 있으며, 그 구체적인 조합방식은 제한되지 않는다.The method of providing a big data de-identification solution according to the disclosed embodiment may include various technical elements to be described later. Each technical element may operate individually or as a combined one, and the specific combination method is not limited.

일 실시 예에서, 컴퓨터는 자연어 처리 기반의 개인정보 식별 동작을 수행할 수 있다. 예를 들어, 컴퓨터는 기존 탐지패턴과 자연어 처리 기반의 학습을 통한 개인정보 식별을 수행한다.In an embodiment, the computer may perform a natural language processing-based personal information identification operation. For example, a computer identifies personal information through learning based on existing detection patterns and natural language processing.

기존의 제품들의 경우 DB에 저장되어 있거나 정형화된 포맷의 데이터 레코드들을 찾아서 비식별 처리하는 것이 일반적이다. 이에 반해, 개시된 실시 예에 따른 솔루션의 경우 정규식 패턴 엔진을 활용한 개인정보 탐지로 1차적인 개인정보 탐지를 수행하며, 이를 위해 NLP(Natural Language Processing)를 통한 비정형, 반정형 데이터 및 검색이 어려운 개인정보를 식별하고, 이미지의 경우 OCR(Optical character recognition)을 통해 문자를 인식한 후 이에 대한 분석을 수행한다.In the case of existing products, it is common to find and de-identify data records stored in the DB or in a standardized format. In contrast, in the case of the solution according to the disclosed embodiment, primary personal information detection is performed by personal information detection using a regular expression pattern engine. Personal information is identified, and in the case of an image, a character is recognized through OCR (Optical Character Recognition) and then analyzed.

1차 개인정보 탐지의 경우, 주민번호, 여권번호, 운전면허번호, 외국인등록번호, 전화번호, 카드번호, 계좌번호 등과 같이 기존에 알려진 정형화된 정보들을 탐지하는 로직이 이용된다.In the case of primary personal information detection, a logic to detect known standardized information such as resident number, passport number, driver's license number, alien registration number, phone number, card number, and account number is used.

2차 개인정보 탐지의 경우, AI 기반 데이터 마이닝을 통한 문맥 분석을 통해 미리 지정된 개인정보 외 비정형으로 발생되는 개인정보를 탐지하는 방법이 이용된다.In the case of secondary personal information detection, a method of detecting atypically generated personal information other than predetermined personal information through context analysis through AI-based data mining is used.

예를 들어, 자연어 처리를 한 구문들을 TDM과 IF-IDF 등의 분류와 빈도수를 분석하여 개인정보를 식별할 수 있다. 정형적인 패턴으로 찾기 어려운 개인정보 항목들의 예시로는 사상, 종교, 취미, 습관, 성생활, 위치, 생활동선, 인관관계, 소비패턴 등 수많은 다양한 형태의 개인정보를 포함할 수 있으나, 이에 제한되지 않으며, 법적으로 명시되지 않은 개인정보라도 개인을 식별할 수 있는 정보라면 비식별화가 필요하다.For example, personal information can be identified by analyzing the classification and frequency of phrases subjected to natural language processing such as TDM and IF-IDF. Examples of personal information items that are difficult to find in a formal pattern include, but are not limited to, personal information in various forms such as ideology, religion, hobbies, habits, sex life, location, life flow, relationships, consumption patterns, etc. Even if it is not legally specified, de-identification is necessary if it is information that can identify an individual.

일 실시 예에서, 컴퓨터는 인공지능 AI 학습기법을 적용한 개인정보 탐지를 수행한다.In one embodiment, the computer performs personal information detection to which artificial intelligence AI learning techniques are applied.

예를 들어, 컴퓨터는 데이터 마이닝 기법을 통한 개인정보 패턴 분석을 수행할 수 있다. 구체적으로, 비정형 데이터의 경우 특정 패턴으로 찾기 어려운 개인정보를 데이터마이닝을 통해 구별할 수 있으며, 자연어 처리로 파싱된 구문들을 의미있는 형태로 가공하여 문맥의 흐름에 맞게 패턴을 추출할 수 있다. 또한, 분석 알고리즘으로 비지도 학습의 대표적인 방법인 GAN(Generative Adversarial Network)이 사용될 수 있으나, 이에 제한되는 것은 아니다.For example, the computer may perform personal information pattern analysis through a data mining technique. Specifically, in the case of unstructured data, personal information that is difficult to find with a specific pattern can be distinguished through data mining, and phrases parsed by natural language processing can be processed into meaningful forms to extract patterns according to the flow of context. In addition, a generative adversarial network (GAN), which is a representative method of unsupervised learning, may be used as the analysis algorithm, but is not limited thereto.

일 실시 예에서, 컴퓨터는 CDC(Change Data Capture) 기술을 적용한 실시간 개인정보 탐지 및 이벤트 처리를 수행할 수 있다.In one embodiment, the computer may perform real-time personal information detection and event processing to which CDC (Change Data Capture) technology is applied.

예를 들어, 컴퓨터는 수집되는 이벤트를 캡처하여 변경된 내용을 실시간으로 분석엔진과 매칭하여 처리여부를 판단할 수 있다. 대용량의 빅데이터 환경에서 실시간 처리는 성능을 좌우하는 중요한 요소이며, 개시된 실시 예에 따른 솔루션은 기존에 출시된 개인정보비식별 솔루션들에 비해 높은 처리량으로 설계가 가능하다. 또한, 이벤트 처리 상황을 실시간으로 대쉬보드나 가시성 도구를 통해 실시간으로 반영하여 표시할 수 있다.For example, the computer may capture the collected event and match the changed content with the analysis engine in real time to determine whether to process it. Real-time processing is an important factor influencing performance in a large-capacity big data environment, and the solution according to the disclosed embodiment can be designed with a high throughput compared to previously released personal information de-identification solutions. In addition, the event processing status can be reflected and displayed in real time through a dashboard or a visibility tool.

CDC 기술을 통해, 대용량 혹은 신규 개인정보 수집시에는 정규식 방법으로 개인정보 탐지를 하고 이후에 변경되는 데이터나 생성되는 데이터들은 변화된 내용만 캡쳐해서 탐지 분석을 수행할 수 있다.With the CDC technology, when collecting large or new personal information, personal information is detected by a regular expression method, and data that is changed or generated afterward can be detected and analyzed by capturing only the changed contents.

대용량의 데이터에 대해 계속해서 개인정보 여부를 탐색하는 것은 비효율적이며 성능이나 시간이 많이 소요되므로 CDC 기술을 적용하여 변경된 이벤트 로그만 확인하여 실시간으로 식별할 수 있다.Continuously searching for personal information for a large amount of data is inefficient and takes a lot of performance or time, so you can apply CDC technology to check only the changed event log and identify it in real time.

일 실시 예에서, 개인정보 비식별 모델의 차별화를 위해 기존의 k-익명성, L-다양성, T-접근성 프라이버시 모델 외 차분 프라이버시 모델이 추가된다.In an embodiment, a differential privacy model is added in addition to the existing k-anonymity, L-diversity, and T-accessibility privacy models to differentiate the personal information non-identification model.

도 4는 일 실시 예에 따른 비식별 모델의 구성을 도시한 도면이다.4 is a diagram illustrating a configuration of a non-identification model according to an exemplary embodiment.

현재 많은 기업들은 k-익명성, L-다양성, T-접근성의 대표적인 비식별 프라이버시 모델만을 적용하여 비식별 처리를 수행하나, 개시된 실시 예에 따른 솔루션의 경우 차분 프라이버시 기법을 추가하여 더 강력한 비식별 처리를 적용한다.Currently, many companies perform de-identification processing by applying only representative de-identification privacy models of k-anonymity, L- diversity, and T-accessibility. apply the treatment.

구체적으로, k·L·T 프라이버시 모델만으로는 안전한 비식별화가 어려우며 추가적인 방법이 요구된다. 이에 따라 차분 프라이버시를 적용, 각각의 응답값에 노이즈를 추가하여 응답값들의 분포가 일정 기준 이하의 차이를 갖도록 하여 보호하는 기법이 적용된다.Specifically, secure de-identification is difficult with only the k·L·T privacy model, and an additional method is required. Accordingly, a technique of protecting by applying differential privacy and adding noise to each response value so that the distribution of response values has a difference of less than a certain standard is applied.

예를 들어, 기존의 프라이버시 모델의 경우 서로 다른 데이터를 연동하여 분석하는 연결공격(예: 공개 의료데이터와 선거인명부를 연결하여 개인정보를 추출하는 연결공격)에 취약할 수 있다. For example, the existing privacy model may be vulnerable to a connection attack that analyzes different data by linking it (eg, a connection attack that extracts personal information by connecting public medical data and electoral lists).

동질성 공격과 배경지식에 의한 공격으로 k-익명성 취약점이 발생할 수 있으며, 이를 해결하기 위하여 L-다양성이 이용된다. 데이터 셋에서 함께 비식별화되어 동질 집합(k-익명성에 의해 같은 값으로 묶인 집합)이 된 레코드들은 적어도 1개의 서로 다른 민감정보를 가진다.A k-anonymity vulnerability can occur due to homogeneity attacks and attacks based on background knowledge, and L-diversity is used to solve this. Records that are de-identified together in a data set to become a homogeneous set (a set grouped with the same value by k-anonymity) have at least one different sensitive information.

단, L-다양성의 경우에도 쏠림공격과 유사성 공격에는 취약한데, L-다양성 공격을 방어하기 위해 T-근접성을 적용한다. 이는 특정 정보의 분포가 동질 집합 내에서와 전체 데이터 셋 내에서 t이하의 차이를 보여야 하는 것을 의미한다.However, even in the case of L-diversity, it is vulnerable to concentration and similarity attacks, and T-proximity is applied to defend against L-diversity attacks. This means that the distribution of specific information must show a difference of less than t within the homogeneous set and within the entire data set.

나아가, 개시된 실시 예에 따른 솔루션은 차분 프라이버시 모델을 적용하며, 이는 아래의 표와 같은 기법들을 이용할 수 있다.Furthermore, the solution according to the disclosed embodiment applies a differential privacy model, which may use the techniques shown in the table below.

유형type 기법technique 설명Explanation 노이즈
추가기법noise
additional technique PCA 베이즈 추정PCA Bayesian Estimation 1. 원본데이터 X 추출
2. PCA 적용
3. Xp로 변환
4. 라플라스 노이즈 추가
5. 교란데이터 Y 생성1. Extract original data X
2. PCA application
3. Convert to Xp
4. Add Laplace Noise
5. Create disturbance data Y 노이즈 평준화Noise leveling 교란된 데이터 내 각각의 엔트리 각 거리를 계산하는 것이 아닌 엔트리 묶음의 평균값에 대한 거리 측정Measuring the distance to the average value of a set of entries rather than calculating the distance to each entry in the perturbed data 영역기반 교란기법Area-based Disturbance Techniques 시계열 데이터를 일치 영역, 불일치 영역으로 나누고 일치 영역의 경우 노이즈를 더 많이 추가Divide the time series data into congruent and inconsistent areas and add more noise for congruent areas 압축기반
교란기법compressor board
Disturbance Techniques DFT(이산푸리에 변환)기법Discrete Fourier Transform (DFT) technique 전체 데이터가 아닌 원본 데이터의 특성을 반영하는 몇 개의 푸리에 계수만을 활용하여 데이터 마이닝Data mining using only a few Fourier coefficients that reflect the characteristics of the original data rather than the entire data DWT(이산웨이블릿변환) 기법DWT (Discrete Wavelet Transform) technique 정해진 임계치보다 큰 진폭을 가지는 중요계수만 교란Only significant coefficients with amplitudes greater than a set threshold perturb 기하학적
교란기법geometric
Disturbance Techniques 회전교란기법rotational disturbance technique 회전, 평행 이동등을 통한 데이터 교란Data perturbation through rotation, translation, etc. 다중 회전 기법Multiple Rotation Techniques 원본 정규화 -> n개의 랜덤 시드 생성 -> 직교행렬화 -> 직교행렬을 이용하여 원본 회전 -> 교란된 데이터 Y생성Original normalization -> Generate n random seeds -> Orthogonal matrix -> Rotate original using orthogonal matrix -> Generate perturbed data Y 응축 교란 기법Condensation Disturbance Techniques 원본 데이터를 k개의 응축그룹으로 나누고 각 그룹의 중심 객체를 선택한 후 나머지 k-1개의 객체들을 원본데이터와 분포가 유사하도록 새로 생성The original data is divided into k condensed groups, and the central object of each group is selected, and the remaining k-1 objects are newly created so that the distribution is similar to the original data.

일 실시 예에서, 컴퓨터는 개인정보 사용이력 추적성을 확보한다. 예를 들어, 프라이버시가 적용된 개인정보의 사용이력을 추적하기 위하여, 비식별화 한 개인정보 활용시 제공된 내역에 대한 감사 추적을 수행하며 관련된 내용을 로깅하여 다양한 통계정보를 제공할 수 있다. 또한, 가시성 툴 활용을 통해 직관적이며 다차원 분석이 가능한 GUI 화면을 제공할 수 있다.In one embodiment, the computer secures traceability of personal information usage history. For example, in order to track the usage history of personal information to which privacy is applied, an audit trail of the provided details may be performed when de-identified personal information is used, and various statistical information may be provided by logging related contents. In addition, it is possible to provide a GUI screen that is intuitive and capable of multidimensional analysis through the use of visibility tools.

또한, 개시된 실시 예에 따르면 비식별화의 적정성 여부를 검토하는 기능이 제공될 수 있다. In addition, according to the disclosed embodiment, a function for examining whether de-identification is appropriate may be provided.

도 5는 일 실시 예에 따른 적정성 여부 판단 모델을 도시한 도면이다.5 is a diagram illustrating an appropriateness determination model according to an exemplary embodiment.

예를 들어, 개인정보 비식별 조치 후 샘플데이터를 적정성 엔진으로 전달할 수 있다. 전달된 데이터의 비식별조치가 적정하면 즉시 활용이 가능하고 적절하지 못하면 재비식별 처리를 수행하도록 할 수 있다. 이러한 적정성 검증 로직을 시스템에 적용하여 자동으로 적정성 여부를 검증할 수 있으며, 또한 비식별데이터 공격인 연결공격, 동질성공격, 배경공격, 쏠림공격, 유사성 공격 등의 공격에 취약한지를 판단하여 식별할 수 있다.For example, after taking measures to de-identify personal information, sample data can be delivered to the adequacy engine. If the de-identification action of the transmitted data is appropriate, it can be used immediately, and if it is not appropriate, re-identification processing can be performed. By applying this adequacy verification logic to the system, adequacy can be automatically verified, and it can be identified by determining whether it is vulnerable to attacks such as non-identifying data attacks such as connection attack, homogeneity attack, background attack, concentration attack, and similarity attack. have.

또한, 개시된 실시 예에 따른 내보내기 기능이 제공될 수 있다. 내보내기 기능은 식별된 개인정보에 총계처리, 평균값, 범주화, 익명화, 삭제, 마스킹 등의 비식별 처리 후 전달 타켓 데이터에 전달하는 것을 의미하며, 전달 방법은 API 연계, FTP, SFTP, DB 마이그레이션 방식 등의 당시 환경에 맞는 방법으로 전달할 수 있다.In addition, an export function according to the disclosed embodiment may be provided. The export function means to transfer the identified personal information to the target data after de-identification such as total processing, average value, categorization, anonymization, deletion, and masking, and the delivery method is API linkage, FTP, SFTP, DB migration method, etc. It can be delivered in a way that suits the environment of the time.

또한, 개시된 실시 예에 따라 완전삭제 기능이 제공될 수 있다. 완전삭제 기능은 비식별화된 개인정보 제공 후 로그를 삭제할지에 대한 옵션 기능이며, 관리자는 완전삭제 여부를 선택할 수 있다. 예를 들어, 개인정보의 생명주기 만료로 인한 폐기시 완전삭제를 해야할 수 있다.In addition, a complete deletion function may be provided according to the disclosed embodiment. The complete deletion function is an optional function to delete the log after providing de-identified personal information, and the administrator can choose whether to completely delete it. For example, it may be necessary to completely delete personal information when it is destroyed due to the expiration of the life cycle of personal information.

이외에도, 개시된 실시 예에 따른 관리화면이 개발되어 제공될 수 있다.In addition, a management screen according to the disclosed embodiment may be developed and provided.

예를 들어, 관리자 화면의 기본 설계 의도는 간결하고 직관적인, 사용하기 쉬운, UI의 통일성, 유지보수 편의성을 중시하는 HTML5 기반의 UI 개발이다. HTML5의 특징으로는 호환성(기존 HTML과의 호환성 보장, RESTful API와의 호환성), 실용적인 설계(안전한 보안과 유용한 신규 기능들을 도입), 모바일 환경 최적화(activeX나 plug-in등의 미설치로 불필요한 agent 설치 불필요), 다차원 그래프 및 가시성 지원(분석 및 가시성을 표현하기에 최적합) 등이 있다. 또한, 관리자 화면을 통해 정책설정 기능이 제공될 수 있다.For example, the basic design intent of the admin screen is to develop a HTML5-based UI that is concise, intuitive, easy to use, and emphasizes UI unity and maintenance convenience. Features of HTML5 include compatibility (guaranteeing compatibility with existing HTML, compatibility with RESTful API), practical design (introducing safe security and useful new functions), and optimization of mobile environment (no need to install unnecessary agents because activeX or plug-in is not installed) ), multidimensional graphing and visibility support (best for presentation of analytics and visibility). In addition, a policy setting function may be provided through the administrator screen.

예를 들어, 관리자 화면의 기능으로 아래의 표와 같은 기능들이 제공될 수 있다.For example, the functions shown in the table below may be provided as functions of the manager screen.

구분division 상세기능Detailed function 특징characteristic 관리기능management function 관리자 인증Administrator authentication - ID/PW, 인증서, 생체인증 등- ID/PW, certificate, biometric authentication, etc. 관리자 권한 부여Grant Administrator Privileges - 수퍼 관리자, 상급, 중간, 초급 등으로 분리하여 권한 차등 부여- Divided into super administrator, advanced, intermediate, beginner, etc. to grant different privileges 설정 백업 기능Settings backup function - 설정된 정책 및 IP 정보 등의 설정 백업 기능- Setting backup function such as set policy and IP information 계정 및 패스워드 규칙Account and password rules - 인증 지침에 맞는 규정 반영- Reflecting the regulations in accordance with the certification guidelines 접근제어 기능access control function - 인가된 계정 및 IP, mac, Device만 접근 가능- Only authorized accounts and IPs, macs, and devices can be accessed 정책설정Policy setting 비식별화를 위한 스케쥴 기능Schedule function for de-identification - 작업 종류 선택 부여(즉시 반영, 특정 시간 수행)- Giving selection of work type (reflected immediately, performed at a specific time) 비식별화 방법 선택Choosing a De-Identification Method - 총계, 범주화, 익명, 가명, 마스킹, 삭제 등의 방법 선택- Select methods such as totals, categorization, anonymity, pseudonym, masking, deletion, etc. 비식별 적정성 검증 기능Non-identifying adequacy verification function - 비식별된 데이터의 검증 수행- 안전성 검증 여부에 따라 재수행- Validation of de-identified data - Re-performed depending on whether safety is verified 데이터 전달 스케쥴data delivery schedule - 비식별된 데이터를 전달 기능 마련- 즉시 전달, 스케쥴링 전달- Provision of delivery function for de-identified data - immediate delivery, scheduled delivery 즉시 비식별 처리 기능Instant de-identification processing capabilities - 비식별 미처리된 내용에 한해서 즉시 비식별화 처리 기능- Immediate de-identification processing function only for un-identified unprocessed contents 완전삭제 기능Complete deletion function - 개인정보 특성상 생명주기 및 사용 용도가 만료된 개인정보에 대해서 완전삭제를 해야함.
- 디가우징이나 완전삭제 툴 사용- Due to the nature of personal information, it is necessary to completely delete personal information whose life cycle and purpose of use have expired.
- Use degaussing or wiping tools 기타etc 감사로깅audit logging - 관리자 행위 로그, 시스템 로그, 비식별화 처리 로그, 데이터 제공 로그 등- Administrator behavior log, system log, de-identification processing log, data provision log, etc. 연계설정Link setting - 타 솔루션과 restful API 연계- Linking restful API with other solutions 통계 및 리포팅Statistics and Reporting - 다양한 분석을 위한 통계 및 리포팅 기능 개발- Development of statistics and reporting functions for various analysis

또한, 개시된 실시 예에 따른 모니터링 기능이 제공될 수 있다. 모니터링 기능으로는 서버상태 모니터링, 정책 위반사항 모니터링, 비식별화 처리 모니터링, 데이터 제공현황 모니터링 등의 다양한 형태의 모니터링 기능이 제공될 수 있으며, 오픈으로 배포된 다양한 툴들을 이용하여 모니터링 기능이 구성될 수도 있다.In addition, a monitoring function according to the disclosed embodiment may be provided. As the monitoring function, various types of monitoring functions such as server status monitoring, policy violation monitoring, de-identification processing monitoring, and data provision status monitoring can be provided. may be

또한, Restful API 연계기능이 제공될 수 있다. HTTP 프로토콜을 사용하여 기존 인프라를 인프라를 그대로 사용하는 것을 목표로 하며, 서버와 클라이언트의 역할을 명확하게 분리한다. 또한, HTTP의 가장 강력한 특징인 캐싱 기능을 적용하여 전체 응답시간, 성능, 서버와의 자원 이용율을 향상시키고, 다중 계층으로 확장 연계가 용이하고 그 앞단에 보안, 로드밸런싱, 암호화, 사용자 인증 등을 추가하여 구조상의 유연성을 확보할 수 있다.In addition, a Restful API linkage function may be provided. It aims to use the existing infrastructure as it is by using the HTTP protocol, and the roles of the server and the client are clearly separated. In addition, by applying the caching function, which is the strongest feature of HTTP, the overall response time, performance, and resource utilization with the server are improved. In addition, structural flexibility can be secured.

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.The steps of a method or algorithm described in relation to an embodiment of the present invention may be implemented directly in hardware, as a software module executed by hardware, or by a combination thereof. A software module may contain random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, hard disk, removable disk, CD-ROM, or It may reside in any type of computer-readable recording medium well known in the art to which the present invention pertains.

본 발명의 구성 요소들은 하드웨어인 컴퓨터와 결합되어 실행되기 위해 프로그램(또는 애플리케이션)으로 구현되어 매체에 저장될 수 있다. 본 발명의 구성 요소들은 소프트웨어 프로그래밍 또는 소프트웨어 요소들로 실행될 수 있으며, 이와 유사하게, 실시 예는 데이터 구조, 프로세스들, 루틴들 또는 다른 프로그래밍 구성들의 조합으로 구현되는 다양한 알고리즘을 포함하여, C, C++, 자바(Java), 어셈블러(assembler) 등과 같은 프로그래밍 또는 스크립팅 언어로 구현될 수 있다. 기능적인 측면들은 하나 이상의 프로세서들에서 실행되는 알고리즘으로 구현될 수 있다.The components of the present invention may be implemented as a program (or application) to be executed in combination with a computer, which is hardware, and stored in a medium. The components of the present invention may be implemented as software programming or software components, and similarly, embodiments may include various algorithms implemented as data structures, processes, routines, or combinations of other programming constructs, including C, C++ , Java, assembler, etc. may be implemented in a programming or scripting language. Functional aspects may be implemented in an algorithm running on one or more processors.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다. As mentioned above, although embodiments of the present invention have been described with reference to the accompanying drawings, those skilled in the art to which the present invention pertains can realize that the present invention can be embodied in other specific forms without changing its technical spirit or essential features. you will be able to understand Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

Claims

collecting big data;
identifying personal information included in the big data; and
de-identifying the identified personal information; containing,
A method of providing an artificial intelligence-based big data de-identification solution.

According to claim 1,
The step of identifying the personal information,
using natural language processing to identify not only standardized personal information but also unstructured/semi-structured personal information; containing,
A method of providing an artificial intelligence-based big data de-identification solution.

According to claim 1,
The step of de-identifying the personal information,
de-identifying the personal information using a k-anonymity model, L-diversity model, T-proximity model and differential privacy model; containing,
A method of providing an artificial intelligence-based big data de-identification solution.