KR102192235B1

KR102192235B1 - Device for providing digital document de-identification service based on visual studio tools for office

Info

Publication number: KR102192235B1
Application number: KR1020200055713A
Authority: KR
Inventors: 임성진
Original assignee: 지엔소프트(주)
Priority date: 2020-05-11
Filing date: 2020-05-11
Publication date: 2020-12-17

Abstract

Provided is a VSTO-based electronic document de-identification service providing device, which comprises: a search unit which searches for at least one type of personal information when a de-identification menu is selected in an electronic document including at least one of the text, image, and hyperlink; a display unit which collects the number of cases of searched at least one type of personal information by type and displays the result; a format selection unit which selects any one of at least one de-identification format for de-identifying the searched at least one type of personal information; and a processing unit which de-identifies at least one type of personal information in the selected format.

Description

VTS-based electronic document de-identification service providing device {DEVICE FOR PROVIDING DIGITAL DOCUMENT DE-IDENTIFICATION SERVICE BASED ON VISUAL STUDIO TOOLS FOR OFFICE}

본 발명은 VSTO 기반 전자문서 비식별화 서비스 제공 장치에 관한 것으로, 전자문서 작성 프로그램 내에 포함된 기능으로 전자문서 내 개인정보를 식별, 추출 및 비식별화할 수 있는 플랫폼을 제공한다.The present invention relates to an apparatus for providing a VSTO-based electronic document de-identification service, and provides a platform capable of identifying, extracting, and de-identifying personal information in an electronic document with a function included in an electronic document creation program.

전 세계적으로 데이터의 활용과 관련하여 개인정보 이슈에 맞닥뜨리고 있다. 유럽의 GDPR(General Data Protection Regulation)은 개인정보보호 적용대상의 확대 및 위반시 처벌의 강화, 개인정보의 수집 및 처리시 고지 및 사전동의, 잊혀질 권리의 보장 등 강력한 개인정보 조치를 요구하고 있으며, 익명화(Anonymisation)와 가명화(Pseudonymisation)를 구분하여 개인에 관한 정보를 통해 특정인이 식별되지 않도록 정보를 삭제하거나 알아볼 수 없는 형태로 변환하도록 한다. 한국은 비실명 처리기준을 마련하고 있으나, 너무 많은 정보를 삭제하거나 비식별화한 경우에는 빅데이터로서의 가치가 떨어지기 때문에, 최근 데이터 3법을 통과시킴으로써 개인정보를 비식별화하여 가명 처리한 가명정보는 정보주체의 사전 동의가 없어도 이를 활용하거나 제3자에게 가명정보를 제공할 수 있게 되었다. The world is facing personal information issues related to the use of data. The European General Data Protection Regulation (GDPR) requires strong personal information measures such as expanding the subject of personal information protection and strengthening penalties for violations, notification and prior consent when collecting and processing personal information, and guaranteeing the right to be forgotten. , Anonymisation and Pseudonymisation are distinguished so that information about an individual does not identify a specific person, so that the information is deleted or converted into an unrecognizable form. Korea has established standards for processing non-real names, but if too much information is deleted or de-identified, its value as big data is degraded, so the pseudonymized information by de-identifying personal information by passing the recent Data 3 Act Even without the prior consent of the information subject, it can be used or provided with pseudonym information to a third party.

이때, 비식별화를 위한 방법론들이 공개되었는데, 이와 관련하여, 선행기술인 한국등록특허 제10-2067926호(2020년01월17일 공고)에는, 사용자로부터 텍스트 정보를 포함한 전자문서를 입력받고, 입력받은 전자문서로부터 패턴매칭 프로그램, 사전매칭 프로그램, 기계학습 프로그램, 딥러닝 프로그램을 이용하여 각 프로그램별로 비식별 정보를 식별하고, 설정된 가중치를 이용하여 식별된 비식별 정보의 정확률 값을 계산하여, 계산된 정확률 값이 미리 설정된 기준값 이상이면 비식별 정보를 대체 문자열로 전환하여 입력받은 전자문서의 문장과, 문장에 포함된 비식별 대상 단어와 함께 대상 전자문서에 대한 하나의 목록으로 취합하여 목록화 하는 구성이 개시되어 있다.At this time, methodologies for de-identification were disclosed. In this regard, Korean Patent Registration No. 10-2067926 (announced on January 17, 2020), which is a prior art, receives an electronic document including text information from a user and inputs it. From the received electronic document, using a pattern matching program, a pre-matching program, a machine learning program, and a deep learning program to identify non-identifying information for each program, and calculating the accuracy value of the identified non-identifying information using a set weight. If the set accuracy value is more than the preset reference value, the non-identifying information is converted into a replacement string, and the text of the received electronic document and the non-identifying target word included in the sentence are collected and listed as a list for the target electronic document. The configuration is disclosed.

다만, 상술한 구성을 이용하는 경우 전자문서를 작성하는 오피스 프로그램 내에서 비식별화를 수행하는 것이 아니기 때문에 별도의 프로그램을 다시 설치해야 하고, 전자문서를 작성한 후 바로 해당 프로그램 내에서 비식별화 작업을 하는 것이 아니라, 별도의 프로그램을 구동하여 작업을 해야하는 번거로움이 있다. 또한, 사무실과 같이 복수의 컴퓨터가 하나의 인트라넷으로 연결되었을지라도 각각의 컴퓨터에 별도의 프로그램을 하나하나 설치해야 하기 때문에 설치에 비용 및 시간이 부과되고 유지보수를 진행하는 경우에도 다시 복수의 컴퓨터에 일일이 작업을 해야 하는 과정이 필수적으로 뒤따르게 된다. 이에, 별도의 프로그램을 설치하거나 구동하지 않더라도 오피스 프로그램 자체 내에서 구동할 수 있으며 하나의 인트라넷으로 연결된 네트워크 내의 클라이언트를 한 번에 유지보수할 수 있는 방법의 개발이 요구된다.However, in the case of using the above configuration, since de-identification is not performed within the office program that creates the electronic document, a separate program must be reinstalled, and the de-identification operation is performed within the program immediately after creating the electronic document. Instead of doing it, there is a hassle of having to work by running a separate program. In addition, even if multiple computers are connected to one intranet, such as an office, since separate programs must be installed on each computer, installation costs and time are levied. The process of having to work one by one is essentially followed. Accordingly, there is a need to develop a method capable of running within the office program itself even without installing or running a separate program and maintaining a client in a network connected by one intranet at a time.

본 발명의 일 실시예는, VSTO(Visual Studio Tools for Office) 기반으로 오피스 프로그램 자체 내에서 구동될 수 있도록 비식별화 프로그램을 개발함으로써 별도의 프로그램을 설치하거나 구동하지 않아도 전자문서 내에서 그 기능을 수행할 수 있도록 하고, 하나의 인트라넷으로 연결된 복수의 클라이언트에 대한 유지보수가 가능하도록 설정하며, 각 개인정보별 패턴을 분석하여 결과를 표출하되, 식별가능한 개인정보의 범위를 설정하고 분석된 개인정보의 형태에 따라 비식별화 알고리즘을 적용함으로써 개인정보를 기재하는 포맷이 달라짐으로 인하여 발생하는 비식별화 처리 불가 또는 스킵(Skip) 현상을 방지함으로써 고의가 아닌 과실로 개인정보가 유출될 가능성을 최소화하는, VSTO 기반 전자문서 비식별화 서비스 제공 방법을 제공할 수 있다. 다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.According to an embodiment of the present invention, by developing a de-identification program so that it can be operated within the office program itself based on VSTO (Visual Studio Tools for Office), the function can be performed in an electronic document without installing or running a separate program. It is possible to perform, set to enable maintenance for multiple clients connected through one intranet, and display the results by analyzing patterns of each personal information, but set the range of identifiable personal information and analyzed personal information By applying a de-identification algorithm according to the form of the personal information, the possibility of leakage of personal information due to unintentional negligence is minimized by preventing de-identification processing impossibility or skip phenomenon caused by the change in the format in which personal information is recorded. It is possible to provide a VSTO-based electronic document de-identification service provision method. However, the technical problem to be achieved by the present embodiment is not limited to the technical problem as described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 일 실시예는, 텍스트, 이미지 및 하이퍼링크 중 적어도 하나를 포함한 전자문서 내에서 비식별화 메뉴가 선택되는 경우 적어도 하나의 종류의 개인정보를 탐색하는 탐색부, 탐색된 적어도 하나의 종류의 개인정보를 종류별로 건 수를 취합하여 결과를 표출하는 표출부, 탐색된 적어도 하나의 종류의 개인정보를 비식별화하기 위한 적어도 하나의 비식별화 포맷 중 어느 하나의 비식별화 포맷을 선택받는 포맷선택부 및 선택된 포맷으로 적어도 하나의 종류의 개인정보를 비식별화 처리하는 처리부를 포함한다.As a technical means for achieving the above-described technical problem, an embodiment of the present invention provides at least one type of personal information when a de-identification menu is selected in an electronic document including at least one of text, image, and hyperlink. A search unit that searches for, a display unit that collects the number of cases of searched at least one type of personal information by type and displays a result, at least one non-identification unit for de-identifying the searched at least one type of personal information And a format selection unit for selecting one of the de-identification formats among the image formats, and a processing unit for de-identifying at least one type of personal information in the selected format.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, VSTO(Visual Studio Tools for Office) 기반으로 오피스 프로그램 자체 내에서 구동될 수 있도록 비식별화 프로그램을 개발함으로써 별도의 프로그램을 설치하거나 구동하지 않아도 전자문서 내에서 그 기능을 수행할 수 있도록 하고, 하나의 인트라넷으로 연결된 복수의 클라이언트에 대한 유지보수가 가능하도록 설정하며, 각 개인정보별 패턴을 분석하여 결과를 표출하되, 식별가능한 개인정보의 범위를 설정하고 분석된 개인정보의 형태에 따라 비식별화 알고리즘을 적용함으로써 개인정보를 기재하는 포맷이 달라짐으로 인하여 발생하는 비식별화 처리 불가 또는 스킵(Skip) 현상을 방지함으로써 고의가 아닌 과실로 개인정보가 유출될 가능성을 최소화할 수 있다.According to any one of the above-described problem solving means of the present invention, by developing a de-identification program to be driven within the office program itself based on VSTO (Visual Studio Tools for Office), electronic It allows the user to perform its function within the document, sets it to enable maintenance for multiple clients connected through one intranet, analyzes the patterns of each personal information, and displays the results, but determines the range of identifiable personal information. Personal information through unintentional negligence by preventing de-identification processing impossibility or skip phenomenon that occurs due to changes in the format in which personal information is written by applying a de-identification algorithm according to the type of personal information set and analyzed It can minimize the possibility of leakage.

도 1은 본 발명의 일 실시예에 따른 VSTO 기반 전자문서 비식별화 서비스 제공 시스템을 설명하기 위한 도면이다.
도 2는 도 1의 시스템에 포함된 비식별화 서비스 제공 장치를 설명하기 위한 블록 구성도이다.
도 3 및 도 4는 본 발명의 일 실시예에 따른 VSTO 기반 전자문서 비식별화 서비스가 구현된 일 실시예를 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따른 VSTO 기반 전자문서 비식별화 서비스 제공 방법을 설명하기 위한 동작 흐름도이다.1 is a view for explaining a VSTO-based electronic document de-identification service providing system according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating an apparatus for providing a de-identification service included in the system of FIG. 1.
3 and 4 are diagrams for explaining an embodiment in which a VSTO-based electronic document de-identification service is implemented according to an embodiment of the present invention.
5 is a flowchart illustrating a method of providing a VSTO-based electronic document de-identification service according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and similar reference numerals are assigned to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Throughout the specification, when a part is said to be "connected" to another part, this includes not only "directly connected" but also "electrically connected" with another element interposed therebetween. . In addition, when a part "includes" a certain component, it means that other components may be further included, and one or more other features, not excluding other components, unless specifically stated to the contrary. It is to be understood that it does not preclude the presence or addition of any number, step, action, component, part, or combination thereof.

명세서 전체에서 사용되는 정도의 용어 "약", "실질적으로" 등은 언급된 의미에 고유한 제조 및 물질 허용오차가 제시될 때 그 수치에서 또는 그 수치에 근접한 의미로 사용되고, 본 발명의 이해를 돕기 위해 정확하거나 절대적인 수치가 언급된 개시 내용을 비양심적인 침해자가 부당하게 이용하는 것을 방지하기 위해 사용된다. 본 발명의 명세서 전체에서 사용되는 정도의 용어 "~(하는) 단계" 또는 "~의 단계"는 "~ 를 위한 단계"를 의미하지 않는다. The terms "about", "substantially" and the like, as used throughout the specification, are used in or close to the numerical value when manufacturing and material tolerances specific to the stated meaning are presented, and are used in the sense of the present invention. To assist, accurate or absolute figures are used to prevent unfair use of the stated disclosure by unscrupulous infringers. As used throughout the specification of the present invention, the term "step (to)" or "step of" does not mean "step for".

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다. 한편, '~부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니며, '~부'는 어드레싱 할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '~부'는 소프트웨어 구성요소들, 객체 지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 '~부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '~부'들로 결합되거나 추가적인 구성요소들과 '~부'들로 더 분리될 수 있다. 뿐만 아니라, 구성요소들 및 '~부'들은 디바이스 또는 보안 멀티미디어카드 내의 하나 또는 그 이상의 CPU들을 재생시키도록 구현될 수도 있다.In the present specification, the term "unit" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. Further, one unit may be realized using two or more hardware, or two or more units may be realized using one hardware. Meanwhile,'~ unit' is not meant to be limited to software or hardware, and'~ unit' may be configured to be in an addressable storage medium or configured to reproduce one or more processors. Thus, as an example,'~ unit' refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, and procedures. , Subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays and variables. The components and functions provided in the'~ units' may be combined into a smaller number of elements and'~ units', or may be further divided into additional elements and'~ units'. In addition, components and'~ units' may be implemented to play one or more CPUs in a device or a security multimedia card.

본 명세서에 있어서 단말, 장치 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말, 장치 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말, 장치 또는 디바이스에서 수행될 수도 있다. In this specification, some of the operations or functions described as being performed by the terminal, device, or device may be performed instead in a server connected to the terminal, device, or device. Likewise, some of the operations or functions described as being performed by the server may also be performed by a terminal, device, or device connected to the server.

본 명세서에서 있어서, 단말과 매핑(Mapping) 또는 매칭(Matching)으로 기술된 동작이나 기능 중 일부는, 단말의 식별 정보(Identifying Data)인 단말기의 고유번호나 개인의 식별정보를 매핑 또는 매칭한다는 의미로 해석될 수 있다.In this specification, some of the operations or functions described as mapping or matching with the terminal means mapping or matching the unique number of the terminal or the identification information of the individual, which is the identification information of the terminal. Can be interpreted as.

이하 첨부된 도면을 참고하여 본 발명을 상세히 설명하기로 한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 VSTO 기반 전자문서 비식별화 서비스 제공 시스템을 설명하기 위한 도면이다. 도 1을 참조하면, VSTO 기반 전자문서 비식별화 서비스 제공 시스템(1)은, 적어도 하나의 비식별화 서비스 제공 장치(100), 비식별화 서비스 제공 서버(300)를 포함할 수 있다. 다만, 이러한 도 1의 VSTO 기반 전자문서 비식별화 서비스 제공 시스템(1)은, 본 발명의 일 실시예에 불과하므로, 도 1을 통하여 본 발명이 한정 해석되는 것은 아니다.1 is a view for explaining a VSTO-based electronic document de-identification service providing system according to an embodiment of the present invention. Referring to FIG. 1, a VSTO-based electronic document de-identification service providing system 1 may include at least one de-identification service providing device 100 and a de-identification service providing server 300. However, since the VSTO-based electronic document de-identification service providing system 1 of FIG. 1 is only an embodiment of the present invention, the present invention is not limitedly interpreted through FIG. 1.

이때, 도 1의 각 구성요소들은 일반적으로 네트워크(network, 200)를 통해 연결된다. 예를 들어, 도 1에 도시된 바와 같이, 적어도 하나의 비식별화 서비스 제공 장치(100)는 네트워크(200)를 통하여 비식별화 서비스 제공 서버(300)와 연결될 수 있다. 그리고, 비식별화 서비스 제공 서버(300)는, 네트워크(200)를 통하여 적어도 하나의 비식별화 서비스 제공 장치(100)와 연결될 수 있다. In this case, each component of FIG. 1 is generally connected through a network 200. For example, as shown in FIG. 1, at least one de-identification service providing apparatus 100 may be connected to the de-identification service providing server 300 through a network 200. In addition, the de-identification service providing server 300 may be connected to at least one de-identification service providing device 100 through the network 200.

여기서, 네트워크는, 복수의 단말 및 서버들과 같은 각각의 노드 상호 간에 정보 교환이 가능한 연결 구조를 의미하는 것으로, 이러한 네트워크의 일 예에는 근거리 통신망(LAN: Local Area Network), 광역 통신망(WAN: Wide Area Network), 인터넷(WWW: World Wide Web), 유무선 데이터 통신망, 전화망, 유무선 텔레비전 통신망 등을 포함한다. 무선 데이터 통신망의 일례에는 3G, 4G, 5G, 3GPP(3rd Generation Partnership Project), 5GPP(5th Generation Partnership Project), LTE(Long Term Evolution), WIMAX(World Interoperability for Microwave Access), 와이파이(Wi-Fi), 인터넷(Internet), LAN(Local Area Network), Wireless LAN(Wireless Local Area Network), WAN(Wide Area Network), PAN(Personal Area Network), RF(Radio Frequency), 블루투스(Bluetooth) 네트워크, NFC(Near-Field Communication) 네트워크, 위성 방송 네트워크, 아날로그 방송 네트워크, DMB(Digital Multimedia Broadcasting) 네트워크 등이 포함되나 이에 한정되지는 않는다.Here, the network refers to a connection structure in which information exchange is possible between respective nodes such as a plurality of terminals and servers, and examples of such networks include a local area network (LAN) and a wide area communication network (WAN: Wide Area Network), Internet (WWW: World Wide Web), wired/wireless data communication network, telephone network, wired/wireless television communication network, etc. Examples of wireless data networks include 3G, 4G, 5G, 3GPP (3rd Generation Partnership Project), 5GPP (5th Generation Partnership Project), LTE (Long Term Evolution), WIMAX (World Interoperability for Microwave Access), and Wi-Fi. , Internet, LAN(Local Area Network), Wireless LAN(Wireless Local Area Network), WAN(Wide Area Network), PAN(Personal Area Network), RF(Radio Frequency), Bluetooth(Bluetooth) network, NFC( Near-Field Communication) network, satellite broadcasting network, analog broadcasting network, DMB (Digital Multimedia Broadcasting) network, etc. are included, but are not limited thereto.

하기에서, 적어도 하나의 라는 용어는 단수 및 복수를 포함하는 용어로 정의되고, 적어도 하나의 라는 용어가 존재하지 않더라도 각 구성요소가 단수 또는 복수로 존재할 수 있고, 단수 또는 복수를 의미할 수 있음은 자명하다 할 것이다. 또한, 각 구성요소가 단수 또는 복수로 구비되는 것은, 실시예에 따라 변경가능하다 할 것이다.In the following, the term “at least one” is defined as a term including the singular number and the plural number, and even if the term “at least one” does not exist, each component may exist in the singular or plural, and may mean the singular or plural. It will be self-evident. In addition, it will be possible to change according to the embodiment that each component is provided in a singular or plural.

적어도 하나의 비식별화 서비스 제공 장치(100)는, VSTO 기반 전자문서 비식별화 서비스 관련 웹 페이지, 앱 페이지, 프로그램 또는 애플리케이션을 이용하여 전자문서 내에서 전자문서에 포함된 텍스트, 이미지 및 하이퍼링크에 포함된 개인정보를 식별, 검출 및 비식별화하는 장치일 수 있다. 이때, 비식별화 서비스 제공 장치(100)는, 오피스 프로그램 내에 포함된 메뉴 또는 아이콘을 선택받으면, 전자문서 내 포함된 적어도 하나의 종류의 개인정보를 추출하여 종류별로 개인정보를 건 수로 취합하여 출력하고, 개인정보를 비식별화하기 위한 포맷을 선택받은 후, 미리보기로 비식별화된 텍스트, 이미지 및 하이퍼링크를 제공하고, 비식별화 처리 확인 메뉴 또는 아이콘을 선택받는 경우, 선택된 비식별화 포맷으로 개인정보를 비식별화하는 장치일 수 있다.At least one de-identification service providing device 100 includes text, images, and hyperlinks included in the electronic document in the electronic document using a web page, app page, program, or application related to the VSTO-based electronic document de-identification service. It may be a device that identifies, detects, and de-identifies personal information included in. At this time, when a menu or icon included in the office program is selected, the de-identification service providing device 100 extracts at least one type of personal information included in the electronic document, collects and outputs the number of personal information for each type. And, after selecting a format for de-identifying personal information, providing de-identified text, image, and hyperlink as a preview, and selecting de-identification processing confirmation menu or icon, selected de-identification It may be a device that de-identifies personal information in a format.

여기서, 적어도 하나의 비식별화 서비스 제공 장치(100)는, 네트워크를 통하여 원격지의 서버나 단말에 접속할 수 있는 컴퓨터로 구현될 수 있다. 여기서, 컴퓨터는 예를 들어, 네비게이션, 웹 브라우저(WEB Browser)가 탑재된 노트북, 데스크톱(Desktop), 랩톱(Laptop) 등을 포함할 수 있다. 이때, 적어도 하나의 비식별화 서비스 제공 장치(100)는, 네트워크를 통해 원격지의 서버나 단말에 접속할 수 있는 단말로 구현될 수 있다. 적어도 하나의 비식별화 서비스 제공 장치(100)는, 예를 들어, 휴대성과 이동성이 보장되는 무선 통신 장치로서, 네비게이션, PCS(Personal Communication System), GSM(Global System for Mobile communications), PDC(Personal Digital Cellular), PHS(Personal Handyphone System), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division Multiple Access), Wibro(Wireless Broadband Internet) 단말, 스마트폰(Smartphone), 스마트 패드(Smartpad), 타블렛 PC(Tablet PC) 등과 같은 모든 종류의 핸드헬드(Handheld) 기반의 무선 통신 장치를 포함할 수 있다.Here, the at least one de-identification service providing apparatus 100 may be implemented as a computer that can access a remote server or terminal through a network. Here, the computer may include, for example, a navigation system, a notebook equipped with a web browser, a desktop, a laptop, and the like. In this case, the at least one de-identification service providing apparatus 100 may be implemented as a terminal capable of accessing a remote server or terminal through a network. The at least one de-identification service providing device 100 is, for example, a wireless communication device that guarantees portability and mobility, and includes a navigation system, a personal communication system (PCS), a global system for mobile communications (GSM), and a personal communication system (PDC). Digital Cellular), PHS (Personal Handyphone System), PDA (Personal Digital Assistant), IMT (International Mobile Telecommunication)-2000, CDMA (Code Division Multiple Access)-2000, W-CDMA (W-Code Division Multiple Access), Wibro (Wireless Broadband Internet) It may include all kinds of handheld-based wireless communication devices such as terminals, smartphones, smartpads, and tablet PCs.

비식별화 서비스 제공 서버(300)는, VSTO 기반 전자문서 비식별화 서비스 웹 페이지, 앱 페이지, 프로그램 또는 애플리케이션을 제공하는 서버일 수 있다. 그리고, 비식별화 서비스 제공 서버(300)는, 비식별화 서비스 제공 장치(100)가 인트라넷과 같은 폐쇄형 네트워크로 연결된 경우, 업데이트가 필요하면 해당 인트라넷에 포함된 적어도 하나의 비식별화 서비스 제공 장치(100)로 업데이트 신호를 전송하여 업데이트를 하도록 하는 서버일 수 있다.The de-identification service providing server 300 may be a server that provides a VSTO-based electronic document de-identification service web page, an app page, a program, or an application. In addition, when the de-identification service providing server 300 is connected to a closed network such as an intranet, the de-identification service providing server 300 provides at least one de-identification service included in the intranet when an update is required. It may be a server that transmits an update signal to the device 100 to perform an update.

여기서, 비식별화 서비스 제공 서버(300)는, 네트워크를 통하여 원격지의 서버나 단말에 접속할 수 있는 컴퓨터로 구현될 수 있다. 여기서, 컴퓨터는 예를 들어, 네비게이션, 웹 브라우저(WEB Browser)가 탑재된 노트북, 데스크톱(Desktop), 랩톱(Laptop) 등을 포함할 수 있다.Here, the de-identification service providing server 300 may be implemented as a computer that can access a remote server or terminal through a network. Here, the computer may include, for example, a navigation system, a notebook equipped with a web browser, a desktop, a laptop, and the like.

도 2는 도 1의 시스템에 포함된 비식별화 서비스 제공 장치를 설명하기 위한 블록 구성도이고, 도 3 및 도 4는 본 발명의 일 실시예에 따른 VSTO 기반 전자문서 비식별화 서비스가 구현된 일 실시예를 설명하기 위한 도면이다.FIG. 2 is a block diagram illustrating an apparatus for providing a de-identification service included in the system of FIG. 1, and FIGS. 3 and 4 are a VSTO-based electronic document de-identification service implemented according to an embodiment of the present invention. It is a diagram for explaining an embodiment.

도 2를 참조하면, 비식별화 서비스 제공 장치(100)는, 탐색부(310), 표출부(320), 포맷선택부(330), 처리부(340), 미리보기 제공부(350), 이미지부(360) 및 오버레이부(370)를 포함할 수 있다.2, the de-identification service providing device 100 includes a search unit 310, a display unit 320, a format selection unit 330, a processing unit 340, a preview providing unit 350, and an image A unit 360 and an overlay unit 370 may be included.

본 발명의 일 실시예에 따른 비식별화 서비스 제공 서버(300)나 연동되어 동작하는 다른 서버(미도시)가 적어도 하나의 비식별화 서비스 제공 장치(100)로 VSTO 기반 전자문서 비식별화 서비스 애플리케이션, 프로그램, 앱 페이지, 웹 페이지 등을 전송하는 경우, 적어도 하나의 비식별화 서비스 제공 장치(100)는, VSTO 기반 전자문서 비식별화 서비스 애플리케이션, 프로그램, 앱 페이지, 웹 페이지 등을 설치하거나 열 수 있다. 또한, 웹 브라우저에서 실행되는 스크립트를 이용하여 서비스 프로그램이 적어도 하나의 비식별화 서비스 제공 장치(100)에서 구동될 수도 있다. 여기서, 웹 브라우저는 웹(WWW: World Wide Web) 서비스를 이용할 수 있게 하는 프로그램으로 HTML(Hyper Text Mark-up Language)로 서술된 하이퍼 텍스트를 받아서 보여주는 프로그램을 의미하며, 예를 들어 넷스케이프(Netscape), 익스플로러(Explorer), 크롬(Chrome) 등을 포함한다. 또한, 애플리케이션은 단말 상의 응용 프로그램(Application)을 의미하며, 예를 들어, 모바일 단말(스마트폰)에서 실행되는 앱(App)을 포함한다.The de-identification service providing server 300 or another server (not shown) operating in conjunction with the de-identification service providing device 100 according to an embodiment of the present invention provides a VSTO-based electronic document de-identification service to at least one de-identification service providing device 100 When transmitting an application, program, app page, web page, etc., at least one de-identification service providing device 100 may install a VSTO-based electronic document de-identification service application, program, app page, web page, etc. I can open it. In addition, a service program may be driven in at least one de-identification service providing device 100 by using a script executed in a web browser. Here, the web browser is a program that allows you to use the Web (WWW: World Wide Web) service, which means a program that receives and displays hypertext described in HTML (Hyper Text Mark-up Language). For example, Netscape , Explorer, Chrome, etc. In addition, the application refers to an application program on the terminal, and includes, for example, an app running on a mobile terminal (smartphone).

도 2를 참조하면, 탐색부(110)는, 텍스트, 이미지 및 하이퍼링크 중 적어도 하나를 포함한 전자문서 내에서 비식별화 메뉴가 선택되는 경우 적어도 하나의 종류의 개인정보를 탐색할 수 있다. 여기서, 개인정보를 탐색할 때, 자연어처리(NLP), 개체명인식, 인공지능, 데이터마이닝 등의 기술이 이용될 수도 있으나 이는 종래기술에 기재된 내용과 유사하거나 공지기술과 같으므로 상세히 설명하지는 않는다.Referring to FIG. 2, the search unit 110 may search for at least one type of personal information when a de-identification menu is selected in an electronic document including at least one of text, image, and hyperlink. Here, when searching for personal information, technologies such as natural language processing (NLP), entity name recognition, artificial intelligence, and data mining may be used, but this is similar to the content described in the prior art or is the same as known technology, so it is not described in detail. .

이때, 적어도 하나의 종류의 개인정보는, 주민등록번호, 전화번호, 이름, 주소, 신용카드번호, 여권번호, 운전등록증번호 및 자동차번호를 포함하지만, 이에 한정되지는 않고 실시예에 따라 달라질 수 있다. 또, 개인정보에 해당하는지의 여부는 실시예에 따라 달라질 수 있지만, 본 발명의 일 실시예에서는, 출원 당시의 현행법 기준, 즉 개인정보보호법 제23조, 제24조 및 동법 시행령 제19조를 통해 민감정보 및 고유식별정보 등에 명시된 개인정보항목을 처리하는 것을 예로 들도록 한다. 다만, 상술한 것에 한정되는 것은 아니고 실시예에 따라 변경가능함은 자명하다 할 것이다.At this time, the at least one type of personal information includes, but is not limited to, a social security number, a phone number, a name, an address, a credit card number, a passport number, a driver's registration card number, and a vehicle number, and may vary according to embodiments. In addition, whether or not it corresponds to personal information may vary depending on the embodiments, but in one embodiment of the present invention, the current law standards at the time of filing, that is, Articles 23 and 24 of the Personal Information Protection Act, and Article 19 of the Enforcement Decree of the same law Take as an example the processing of personal information items specified in sensitive information and unique identification information. However, it will be apparent that it is not limited to the above and can be changed according to embodiments.

한국과학기술정보연구원의 국가과학기술지식정보서비스(NTIS)에서는 개인정보 보호 영향평가를 실시하여 개인정보 영향정도에 따라 등급을 부여하여 관리하고 있다. 이하, 표 1은 NTIS의 개인정보 영향도 등급으로, 1등급은 그 자체로 개인식별이 가능하거나, 조합되면 개인식별이 가능한 매우 중요한 개인정보 항목이다.The National Science and Technology Knowledge Information Service (NTIS) of the Korea Institute of Science and Technology Information conducts an impact assessment on personal information protection and manages it by assigning a grade according to the level of impact on personal information. Hereinafter, Table 1 is a personal information impact level of NTIS, and the first grade is a very important personal information item that can be personally identified by itself or, when combined, can be personally identified.

등급rank 분류Classification 개인정보항목Personal information items 설명Explanation 1등급Level 1 고유식별정보Unique identification information 주민등록번호, 여권번호, 운전면허번호, 외국인등록번호 *개인정보보호법 제24조 및 동법 시행령 제19조Resident registration number, passport number, driver's license number, foreigner registration number * Article 24 of the Personal Information Protection Act and Article 19 of the Enforcement Decree of the same Act 그 자체로 개인의 식별이 가능하거나 매우 민감한 개인정보 또는 관련 법령에 따라 처리가 엄격하게 제한된 개인정보Personal information that is personally identifiable or highly sensitive, or personal information whose processing is strictly restricted in accordance with relevant laws and regulations 민감정보Sensitive information 사상·신념, 노동조합·정당의 가입·탈퇴, 정치적 견해, 병력, 신체적·정신적 장애, 성적 취향, 유전자검사정보, 범죄경력정보 등 사생활을 현저하게 침해할 수 있는 정보
*개인정보보호법 제23조 및 동법 시행령 제18조Information that can significantly infringe on privacy, such as ideas and beliefs, membership or withdrawal of trade unions and political parties, political opinions, medical history, physical and mental disabilities, sexual orientation, genetic test information, criminal history information, etc.
* Article 23 of the Personal Information Protection Act and Article 18 of the Enforcement Decree of the same Act 인증정보Certification information 비밀번호, 바이오정보(지문, 홍채, 정맥 등)
*개인정보의 안정성 확보조치 기준고시 제7조Password, bio information (fingerprint, iris, vein, etc.)
*Article 7 of the standard notice of measures to ensure the stability of personal information 신용정보
금융정보Credit information
Financial information 신용정보, 신용카드번호, 계좌번호 등
*신용정보의 이용 및 보호에 관한 법률 제2조, 제19조 및 동법 시행령 제2조, 제16조, 제21조 등Credit information, credit card number, account number, etc.
* Articles 2, 19 of the Act on the Use and Protection of Credit Information and Articles 2, 16 and 21 of the Enforcement Decree of the same Act 위치정보Location information *위치정보의 보호 및 이동 등에 한 법률 제2조, 제16조 등* Articles 2 and 16 of the Act on Protection and Movement of Location Information 2등급Level 2 개인식별정보 Personal identification information 이름, 주소, 전화번호, 핸드폰번호, 이메일, 생년월일, 성별 등Name, address, phone number, mobile phone number, email, date of birth, gender, etc. 조합되면 명확히 개인의 식별이 가능한 개인정보Personal information that can be clearly identified when combined 개인관련정보Personal information 학력, 직업, 키, 몸무게, 혼인여부, 가족상황, 취미 등Education, occupation, height, weight, marital status, family situation, hobbies, etc. 3등급Level 3 자동생성정보Automatically generated information IP정보, MAC정보, 사이트 방문기록, 쿠키 등IP information, MAC information, site visit records, cookies, etc. 개인 식별정보와 조합되면 부가적인 정보를제공하는 간접개인정보Indirect personal information that provides additional information when combined with personally identifiable information 가공정보Processing information 통계성 정보, 가입자 성향 등Statistical information, subscriber propensity, etc. 제한적 본인식별정보Limited identification information 회원정보, 사번, 내부용 개인식별정보 등
Member information, company number, personal identification information for internal use, etc.

이때, 고유식별정보는 대부분 1 등급에 속하는 것으로 정의되어 있고, 그 밖의 개인정보도 2등급 및 3등급으로 보안대상인 것을 알 수 있다. 여기서, 개인정보를 비식별처리, 예를 들어, 마스킹하는 것 자체도 중요하지만, 개인정보 중 어느 항목을 마스킹 처리할 것인지를 미리 선정 및 정의하는 것도 중요하다. 개인정보보호할 항목을 선정하고 정의하기 위해서 데이터프로파일링 방법을 이용할 수 있으나, 이에 한정되는 것은 아니다. 데이터프로파일링은 원천 데이터와 메타데이터에 대한 통계적 분석을 통해 데이터 품질문제를 이슈화하고 개선점을 찾기 위한 것으로 이용되는데, 본 발명의 일 실시예에서는 정형 데이터와 비정형 데이터의 메타데이터를 분석하고 데이터의 특성을 파악하는데 이용하기로 한다. 따라서, 본 발명의 일 실시예에서는, 데이터프로파일링을 이용하여 데이터에 존재하는 구조, 내용, 품질을 파악하기 위해 다양한 분석을 통해 데이터에 대한 정보를 추출할 수 있다. At this time, it can be seen that most of the unique identification information is defined as belonging to the first grade, and other personal information is also subject to security in the second and third grades. Here, de-identification processing, for example, masking of personal information itself is also important, but it is also important to select and define in advance which items of personal information to be masked. The data profiling method may be used to select and define items to protect personal information, but is not limited thereto. Data profiling is used to issue data quality problems and find improvements through statistical analysis of source data and metadata. In one embodiment of the present invention, metadata of structured data and unstructured data is analyzed and characteristics of data I will use it to grasp. Accordingly, in an embodiment of the present invention, information on data may be extracted through various analyzes in order to grasp the structure, content, and quality existing in the data by using data profiling.

데이터프로파일링은 데이터베이스의 방대한 정보로부터 숨어있는 지식을 자동으로 추출하는 데이터마이닝 또는 지식발견과 유사하다. 즉 통계적 분석방법을 이용하여 데이터와 관련된 오류 현상을 발견할 수 있으며, 이렇게 발견된 것을 토대로 관리 문서와 시스템간의 불일치 사항을 제시하여 고품질 데이터를 관리한다. 데이터프로파일링에는 발견과 검증 절차가 있으며, 발견 절차를 통하여 오류 가능성이 있는 부정확한 데이터 현상을 발견하고, 발견한 현상은 업무 담당자들과 협의하여 오류 여부를 판단한다. Data profiling is similar to data mining or knowledge discovery, which automatically extracts hidden knowledge from vast amounts of information in a database. In other words, it is possible to detect errors related to data by using statistical analysis methods, and based on these findings, a discrepancy between management documents and systems is presented to manage high-quality data. In data profiling, there are discovery and verification procedures. Through the discovery process, inaccurate data phenomena with possible errors are discovered, and the discovered phenomena are determined by consulting with business managers.

이때, 칼럼 분석은 특정 칼럼의 Null, Space, 유효 값, Min, Max, Rank 등 데이터 분포도를 측정하고, 날짜유형 분석은 문자타입의 날짜에 대해 표준포맷을 벗어난 데이터 분포를 분석하는 방법이다. 패턴분석은 문자, 숫자 등으로 구성된 특정 패턴이 일관된 패턴을 갖는지 여부를 판단한다. 정규 표현식은 특정 규칙을 가진 문자열을 가장 효과적으로 표현할 수 있는 언어로, 애플리케이션과 프로그래밍 언어에 사용할 수 있는 특수한 텍스트 패턴이다. 즉, 특정내용이 텍스트 패턴에 일치하는지 여부를 검사하거나, 텍스트 내에서 특정패턴과 일치하는 텍스트를 찾아내거나, 해당 패턴과 일치하는 텍스트를 다른 텍스트로 치환하거나, 일치하는 텍스트의 일부를 재배치하거나, 텍스트를 더 작은 텍스트 구간으로 분할할 때 사용할 수 있다. At this time, the column analysis measures the distribution of data such as Null, Space, effective value, Min, Max, and Rank of a specific column, and the date type analysis is a method of analyzing the data distribution outside the standard format for the date of the character type. Pattern analysis determines whether a specific pattern composed of letters, numbers, etc. has a consistent pattern. Regular expressions are special text patterns that can be used in applications and programming languages as the language that can most effectively express strings with specific rules. In other words, it checks whether a specific content matches a text pattern, finds text that matches a specific pattern within the text, replaces text that matches that pattern with another text, or rearranges part of the matching text, This can be used to split text into smaller text segments.

한편, 정형 데이터인 메타데이터에 대응하는 개인정보 데이터유형별로, 마스킹을 수행할 보안대상항목을 선정하고 데이터유형별로 정의가 우선되어야 한다. 이를 위하여, 본 발명의 일 실시예에서는, 마스킹될 개인정보를 주민등록번호, 전화번호, 이메일, 주소명으로 선정하고, 이에 대응하는 메타데이터를 마스킹처리하도록 하나, 이에 한정되는 것은 아니고 실시예에 따라 달라질 수 있다. 예를 들어, 주민등록번호를 데이터프로파일링으로 분석하면, 표기 형식이 매우 다양한 것이 나타날 수 있다. Meanwhile, for each type of personal information data corresponding to metadata, which is structured data, a security target item to be masked should be selected, and definition for each data type should be prioritized. To this end, in one embodiment of the present invention, personal information to be masked is selected as a social security number, phone number, e-mail, address name, and metadata corresponding thereto is masked, but it is not limited thereto and varies depending on the embodiment. I can. For example, when analyzing a social security number by data profiling, a wide variety of display formats can appear.

즉, 주민등록번호 중간에 문자가 포함된 경우, 뒷부분 일부만 표기한 경우, 주민등록번호 앞부분과 뒷부분을 다양한 구분자(하이픈, 공백, 구분자 생략 등)로 연결한 경우 등이다. 따라서 데이터프로파일링 없이 예측만으로 주민등록번호를 추출하면 누락될 확률이 존재한다. 예를 들면 주민등록번호에 문자가 포함되거나(ZZZZZZ-CZZZZZZ), 뒷부분 중 일부만 표기한 경우(ZZZZZZ-Z), 다양한 구분자로 연결된 경우는 예측만으로 추출하기 어렵다. 데이터프로파일링 결과를 토대로 주민등록번호 추출 알고리즘을 설계하는 경우, 표기 형태가 다양한 모든 유형의 주민등록번호가 추출될 수 있고, 모든 유형의 주민등록번호가 마스킹처리되어 마스킹되지 않고 외부로 유출될 확률을 제로화할 수 있다. 이와 마찬가지로, 전화번호, 이메일, 주소명에도 데이터프로파일링으로 모든 종류의 개인식별정보를 추출할 수 있도록 한다.In other words, when a character is included in the middle of the resident registration number, only a part of the back is indicated, and the front and back parts of the resident registration number are connected with various separators (hyphen, space, separator omitted, etc.). Therefore, if the resident registration number is extracted only by prediction without data profiling, there is a probability of being omitted. For example, when the resident registration number contains characters (ZZZZZZ-CZZZZZZ), only part of the back part (ZZZZZZ-Z), or connected by various separators, it is difficult to extract only by prediction. When designing a resident registration number extraction algorithm based on the data profiling result, all types of resident registration numbers with various display formats can be extracted, and all types of resident registration numbers are masked and the probability of leaking to the outside without being masked can be zero. Likewise, all kinds of personally identifiable information can be extracted from phone numbers, emails, and address names by data profiling.

표출부(120)는, 탐색된 적어도 하나의 종류의 개인정보를 종류별로 건 수를 취합하여 결과를 표출할 수 있다. 예를 들어, 개인정보의 종류가 주소라면, 하나의 전자문서(파일) 내에서 몇 개의 주소가 존재하는지를 합산하여 보여줄 수 있다. 또, 어느 종류의 개인정보가 있는지, 예를 들어, 주소는 몇 건이 존재하고, 주민등록번호는 몇 건이 존재하는지를 보여줄 수 있다. 또, 어디에 있는지를 알려줄 수도 있다.The display unit 120 may collect the number of cases of the searched at least one type of personal information by type and display a result. For example, if the type of personal information is an address, how many addresses exist in one electronic document (file) can be summed and shown. In addition, it is possible to show which types of personal information exist, for example, how many addresses exist and how many resident registration numbers exist. It can also tell you where you are.

포맷선택부(130)는, 탐색된 적어도 하나의 종류의 개인정보를 비식별화하기 위한 적어도 하나의 비식별화 포맷 중 어느 하나의 비식별화 포맷을 선택받을 수 있다. 여기서, 적어도 하나의 비식별화 포맷은, 삭제, 부분삭제, 공백 후 대체 및 잡음 추가를 포함할 수 있으나 이에 한정되지는 않는다. 이 외에도, 총계처리(Aggregation)가 가능할 수 있다. 데이터의 총합을 보임으로서 개별 데이터의 값을 보이지 않도록 하는 것으로 예를 들면, 학생 개인정보의 집합(김펭수 180 cm, 박동방 170 cm, 이신기 160 cm, 최윤호 150 cm)은 학생 키의 총합인 660 cm 또는 평균 키 165 cm 등으로 나타낼 수 있다. 이때 중요한 사항은 특정 속성을 지닌 개인으로 구성된 단체의 속성 정보를 공개하는 것은 그 집단에 속한 개인의 정보를 공개하는 것과 같은 결과가 나타나므로 이는 비식별화 처리로 볼 수 없고, 단체의 속성 정보도 비식별화 처리할 수 있다.The format selection unit 130 may receive a selection of any one of at least one de-identification format for de-identifying the searched at least one type of personal information. Here, the at least one de-identification format may include, but is not limited to, deletion, partial deletion, post-space replacement, and noise addition. In addition to this, aggregation may be possible. By showing the total of the data, the individual data values are not displayed.For example, the set of student personal information (Pengsu Kim 180 cm, Dongbang Park 170 cm, Shinki Lee 160 cm, Yunho Choi 150 cm) is the total height of the student, 660 It can be expressed in cm or an average height of 165 cm. At this time, the important point is that disclosure of the attribute information of a group consisting of individuals with specific attributes results in the same result as disclosure of the information of individuals belonging to the group, so this cannot be considered as de-identification, and the attribute information of the group is also Can be de-identified.

또, 데이터값 삭제(Data Reduction)는, 데이터 공유·개방 목적에 따라 데이터 집합에 구성된 값 중에 필요 없는 값 또는 개인의 식별과 관련하여 중요한 값을 삭제하는 것이다. 예를 들면, [김펭수, 10세, 서울 거주, 남극대 졸업]의 개인정보는 (10세, 서울 거주)의 개인정보로 간단하게 할 수 있다. 또한, (주민등록번호 100808-1234567)의 개인정보는 (2010년대 생, 남자)의 개인정보로 할 수 있다. 이 외에도 개인과 관련된 날짜 정보(자격 취득일, 합격일 등)는 연 단위로 처리할 수도 있다.In addition, data reduction is to delete unnecessary values or important values related to personal identification among values configured in a data set according to the purpose of data sharing and opening. For example, the personal information of [Pengsoo Kim, 10 years old, living in Seoul, graduating from Antarctic University] can be simplified with the personal information of (10 years old, living in Seoul). In addition, the personal information of (Resident Registration No. 100808-1234567) can be used as the personal information of (born in the 2010s, male). In addition to this, date information related to individuals (qualification acquisition date, pass date, etc.) may be processed annually.

범주화(Data Suppression)는, 데이터의 값을 범주의 값으로 변환하여 명확한 값을 감추는 것으로 예를 들면, 개인정보 [김펭수, 10세]는 개인정보 (김씨, 1세-10세)로 감출 수 있다. 이때, 성이 특이한 경우가 존재하는데, 희귀성인 "평"씨는 전국에 500명 정도인데, 희귀성에 2급 개인정보, 예를 들어 나이와 지역이 합쳐지는 경우에는 그 특정성이 꽤 높아지게 된다. 이에, 다른 성씨에 비해 희귀성인 경우, 이를 데이터베이스화하여 성씨 자체를 다른 성씨로 변경하는 방법도 적용할 수 있다. 데이터 마스킹(Data Masking)은, 공개된 정보 등과 결합하여 개인을 식별하는 데 기여할 확률이 높은 주요 개인 식별자가 보이지 않도록 처리하여 개인을 식별하지 못하도록 한다. 예를 들면, 개인정보 [김펭수, 10세, 서울 거주, 남극대 재학]은 개인정보 (김**, 10세, 서울 거주, **대학 재학)으로 식별을 막을 수 있다. 이때, 중요한 사항은 다른 값으로 대체하는 일정한 규칙이 노출되어 역으로 개인을 쉽게 식별할 수 없도록 해야 한다는 것이다.Data Suppression is to hide clear values by converting data values into category values. For example, personal information [Pengsoo Kim, age 10] can be hidden as personal information (Mr. Kim, age 1-10). have. At this time, there is a case where the surname is peculiar, and there are about 500 people in the country of "pyeong", which is a rarity, and if the rarity is combined with level 2 personal information, for example, age and region, the specificity is quite high. Accordingly, if the surname is rarer than other surnames, a method of converting the surname itself to another surname by converting it into a database may also be applied. Data masking prevents the identification of an individual by processing a key personal identifier with a high probability of contributing to identifying an individual by combining with public information, etc. For example, personal information [Pengsoo Kim, 10 years old, living in Seoul, attending Antarctic University] can be prevented from being identified with personal information (Kim**, 10 years old, living in Seoul, ** attending university). At this time, the important point is that certain rules for substituting different values should be exposed so that individuals cannot be easily identified.

처리부(140)는, 선택된 포맷으로 적어도 하나의 종류의 개인정보를 비식별화 처리할 수 있다. 처리부(140)는 가명화 처리(Pseudonymizatoin)를 수행할 수도 있다. 이때, 처리부(140), 적어도 하나의 개인정보가 가명화 처리로 변환되는 값이 유일(Distinguishability)하도록 설정하고, 가명화 처리로 변환된 값으로부터 적어도 하나의 정보인 원본을 유추할 수 없도록 설정할 수 있고, 시스템 상 원본이 존재하지 않기 때문에 고가용성을 갖추도록 설정할 수 있다. 이때, 기 저장된 개인식별정보에 대응하는 적어도 하나의 정보는, 익명화 처리(Anyonymisation)인 데이터 마스킹이 적용된 정보일 수 있다. 이때, 가명화 처리를 하기 위해서는, 즉 익명화 기술(데이터 마스킹)이 가명화 조건까지 충족하기 위해서는, 적어도 하나의 정보가 마스킹으로 변환되는 값이 유일(Distinguishability)하도록 설정되어야 하고, 마스킹으로 변환된 값으로부터 적어도 하나의 정보인 원본을 유추할 수 없도록 설정되어야 하며, 개인정보에 접근하는 페이지, 프로그램, 및 애플리케이션 중 적어도 하나의 내부에서도 복원되지 않는 동적 데이터 마스킹을 적용하고, 서비스 고가용성을 갖추도록 해야 한다. The processing unit 140 may de-identify at least one type of personal information in a selected format. The processing unit 140 may also perform pseudonymization processing (Pseudonymizatoin). At this time, the processing unit 140 may set the value converted to the pseudonymization processing to be unique (Distinguishability), and set so that the original, which is at least one information, cannot be inferred from the value converted by the pseudonymization processing. And, since the original does not exist on the system, it can be set to have high availability. At this time, the at least one piece of information corresponding to the previously stored personal identification information may be information to which data masking, which is anonymisation, is applied. At this time, in order to perform pseudonymization processing, that is, in order for the anonymization technology (data masking) to satisfy the pseudonymization condition, the value converted to at least one piece of information must be set to be unique (Distinguishability), and the value converted to masking It must be set so that the original, which is at least one piece of information, cannot be inferred from, and dynamic data masking that is not restored even inside at least one of pages, programs, and applications that access personal information must be applied, and service high availability must be provided. do.

가명화 처리는, 휴리스틱 가명화, 암호화 및 교환방법 중 적어도 하나의 방법에 의해 수행될 수 있으나, 이에 한정되지는 않는다. 또한, 가명화 처리는, 총계 처리(Aggregation), 데이터 삭제(Data Reduction), 데이터 범주화(Data Suppression), 및 데이터 마스킹(Data Masking) 중 어느 하나 또는 적어도 하나의 조합과 함께 복합적으로 또는 단독으로 수행될 수도 있다. 가명화 처리(Pseudonymizatoin)를 수행한 후, 가명화 처리된 적어도 하나의 정보에 대하여 프라이버시 노출에 대한 정량적인 위험성을 평가하는 프라이버시 모델을 구동할 수 있다.The pseudonymization process may be performed by at least one of heuristic pseudonymization, encryption, and exchange methods, but is not limited thereto. In addition, the pseudonymization processing is performed in combination or alone with any one or a combination of at least one of Aggregation, Data Reduction, Data Suppression, and Data Masking. It could be. After the pseudonymization process (Pseudonymizatoin) is performed, a privacy model that evaluates the quantitative risk of privacy exposure with respect to at least one piece of pseudonymized information can be driven.

프라이버시 모델의 구동 결과, 가명화 처리된 적어도 하나의 정보가, 기 설정된 확률수준 이상 비식별화가 되었는지를 체크하고, 체크 결과 가명화 처리된 적어도 하나의 정보가 기 설정된 확률수준 이상 비식별화가 되지 못한 경우, 관리자 단말로 전송하여 피드백을 요청하고, 기 설정된 확률수준 이상 비식별화가 된 경우 빅데이터를 구축하기 위한 입력 데이터로 분류할 수 있다. 이때, 프라이버시 모델은, k-익명성(anonymity), l-다양성(diversity), 및 t-근접성(closeness) 중 어느 하나 또는 적어도 하나의 조합일 수 있으나, 이에 한정되는 것은 아니다.As a result of driving the privacy model, it is checked whether at least one pseudonymized information is de-identified by a predetermined probability level or higher, and as a result of the check, at least one pseudonymized information cannot be de-identified by a predetermined probability level or higher. In this case, it is transmitted to the manager terminal to request feedback, and when de-identification is performed above a preset probability level, it can be classified as input data for constructing big data. In this case, the privacy model may be any one or a combination of at least one of k-anonymity, l-diversity, and t-closeness, but is not limited thereto.

이때, 비식별화된 데이터가 재식별되는 것을 막기 위하여, 차분 프라이버시 모델(Differential Privacy Model)을 더 이용할 수 있다. 이는, k-익명성과 l-다양성의 취약한 부분을 보완하기 위해 C. Dwork가 제안한 모형으로, 단순한 숫자의 변화가 아니라 레코드들 자체의 확률적 변형을 통해 식별 가능성을 제한하는 접근법이다. 차분 프라이버시 모델은 ① 어떤 특정인에 대한 정보가 포함되지 않은 데이터 집합에서 차분적인 알고리즘의 적용을 통해 획득한 결과와 ② 그 특정인에 대한 정보가 포함된 데이터 집합에서 얻은 결과가 구별되지 못하게 하는 체계를 구축하는 것이 기본적인 목표이다. 이 목표의 달성을 위해 정확하게 계산된 양의 노이즈를 통계 기록에 넣어서 개인의 식별성을 없애는 방법을 이용한다. 민감한 정보의 보호를 위해 차분 프라이버시 모델은 체계적으로 무작위 수치를 넣게 되고 이 무작위 수치는 일종의 노이즈 역할을 한다. 이 노이즈의 삽입을 통해 어떤 데이터셋에 특정인에 관한 정보가 포함되어 있는지 여부에 관계없이 동일한 결과물을 산출할 수 있다.In this case, in order to prevent the de-identified data from being re-identified, a differential privacy model may be further used. This is a model proposed by C. Dwork to compensate for the weak point of k-anonymity and l-diversity, and it is an approach that limits the possibility of identification through probabilistic transformation of the records themselves rather than simply changing numbers. The differential privacy model establishes a system that makes it impossible to distinguish between ① the result obtained through the application of a differential algorithm in the data set that does not contain information about a specific person and ② the result obtained from the data set containing the information about that specific person. To do is the basic goal. To achieve this goal, we use a method that eliminates the individual's discrimination by adding an accurately calculated amount of noise to the statistical record. In order to protect sensitive information, the differential privacy model systematically inserts random numbers, and this random number acts as a kind of noise. Through the insertion of this noise, the same result can be produced regardless of whether the data set contains information about a specific person.

데이터 익명화 기법들은 공격자가 사전지식을 갖고 있을 때 최소성 공격(Minimality Attack)에 취약하다는 것이 입증되었고, 차분 프라이버시는 공격자의 사전 지식과 관계없이 각 개인의 정보를 확률적으로 보호할 수 있다. 레코드(Record) 값이 하나만 차이 나는 두 이웃 데이터베이스 D1, D2를 입력으로 하는 질의 결과가 S이고, 임의의 함수 A가 이하의 수학식 1을 만족할 때, 함수 A가 ε -차분 프라이버시를 제공한다고 정의한다. Data anonymization techniques have been proven to be vulnerable to minimality attacks when an attacker has prior knowledge, and differential privacy can probably protect each individual's information regardless of the attacker's prior knowledge. When the query result of two neighboring databases D1 and D2 with only one difference in record value is S, and an arbitrary function A satisfies Equation 1 below, it is defined that function A provides ε-differential privacy. do.

이러한 차분 프라이버시의 정의를 만족하는 여러 기법들 중 숫자 데이터에 널리 쓰이는 라플라스 기법(Laplace Mechanism)은 라플라스 분포를 통해 난수를 생성하여 기존 데이터에 더하는 방식이다. 라플라스 분포의 식은 이하 수학식 2와 같다.Among the various techniques that satisfy this definition of differential privacy, the Laplace Mechanism, which is widely used for numeric data, is a method that generates random numbers through Laplace distribution and adds them to existing data. The equation of the Laplace distribution is shown in Equation 2 below.

라플라스 기법에서는 μ=0, b=△f/ε의 라플라스 분포를 사용하며 이때, △f는 두 이웃 데이터베이스 D1과 D2에 대한 질의 결과의 최대 차이이다.In the Laplace method, a Laplace distribution of μ=0 and b=△f/ε is used, where Δf is the maximum difference between the query results of the two neighboring databases D1 and D2.

다음, 차분 프라이버시 k-평균 알고리즘(이하, EUGkM)에 대해 설명한다. 차분 프라이버시를 만족하는 K-평균 알고리즘을 히스토그램 기반으로 이용할 수 있는데, 데이터를 여러 개의 격자(Grid)로 나눈 뒤 각 격자 내에 있는 데이터 포인트 개수를 히스토그램으로 만든다. 그 뒤에 각 격자 내 포인트 개수에 라플라스 기법을 적용하고, 각 격자의 중심점을 이용해 K-평균 알고리즘을 실행한다. 각 클러스터의 중심점을 계산할 때는 각 격자가 갖고 있는 포인트 개수에 비례한 가중치를 두어 계산한다. EUGkM 알고리즘의 정보 보호 수준 ε과 총 데이터 개수 N, 데이터 차원 d를 이용해 최적의 격자 개수 m을 계산하는 식은, 이하 수학식 3과 같고, θ는 사용자가 사전에 지정해주는 값이다.Next, the differential privacy k-means algorithm (hereinafter, EUGkM) will be described. A K-means algorithm that satisfies the difference privacy can be used as a histogram basis. After dividing the data into several grids, the number of data points in each grid is made into a histogram. After that, the Laplace method is applied to the number of points in each grid, and the K-means algorithm is executed using the center point of each grid. When calculating the center point of each cluster, it is calculated by giving a weight proportional to the number of points each grid has. An equation for calculating the optimal number of grids m using the information protection level ε of the EUGkM algorithm, the total number of data N, and the data dimension d is as shown in Equation 3 below, and θ is a value previously designated by the user.

EUGkM 알고리즘에서 각 격자 내에서 라플라스 기법을 적용하는 과정은 격자 간에 주고받는 영향이 없이 독립적으로 이루어진다. 따라서 본 발명의 일 실시예서는 이 라플라스 기법을 적용하여 히스토그램을 만드는 과정을 하둡 맵리듀스를 이용해 분산병렬화하는 방법을 이용할 수 있다. 재식별화가 불가하도록 역추적을 방지하는 방법은, 2 개의 맵리듀스 페이즈(Phase)로 이루어져 있으며, 첫 번째 페이즈에서는 병렬적으로 차분 프라이버시가 적용된 히스토그램을 생성하고 두 번째 페이즈에서 히스토그램의 각 격자를 데이터 포인트로 생각하고 K-평균 클러스터링을 병렬적으로 실행한다. In the EUGkM algorithm, the process of applying the Laplace technique within each grid is performed independently without the influence of exchange between the grids. Accordingly, in an embodiment of the present invention, a method of creating a histogram by applying the Laplace technique can be distributed and parallelized using Hadoop MapReduce. The method to prevent backtracking so that re-identification is impossible is composed of two MapReduce phases, and in the first phase, a histogram with differential privacy applied in parallel is created, and each grid of the histogram is data in the second phase. Think of it as a point and execute K-means clustering in parallel.

첫 번째, 히스토그램 생성 과정을 설명한다. 먼저 맵퍼(Mapper)는 전체 데이터를 포함하는 공간에 대해 수학식 3을 이용하여 구한 격자 수로 나눈다. 그 뒤 각각의 데이터 포인트가 속한 격자 위치를 구하고 여러 개의 격자를 포함하는 하나의 파티션(Partition)으로 묶어서 리듀서(Reducer)로 보낸다. 이때 데이터 포인트를 하나도 포함하지 않는 파티션의 경우는 더미(Dummy) 격자 하나만 전송하도록 한다. DPHist-MR.reduce 함수에서는 각 파티션의 데이터 영역 내에 모든 격자들이 포함하는 데이터 개수에 대해 라플라스 기법을 적용하고 격자의 중심점을 구해 내보낸다. 이때, 점의 개수가 0인 격자의 경우는 맵퍼로부터 전송받지 않고 리듀스 함수 내에서 생성하도록 함으로써 분산병렬 효과를 높일 수 있다. 두 번째 과정에서, K-평균 클러스터링은 각 격자의 중심점을 하나의 데이터로 간주하고 클러스터링을 진행한다. 맵리듀스를 활용한 K-평균 클러스터링에서 맵(Map)에서는 각각의 점에 대해 가까운 클러스터 중심점을 계산하고 리듀스에서는 각 클러스터에 대해 새로운 중심점을 계산하도록 구현할 수 있다. First, the histogram creation process will be described. First, a mapper divides a space including all data by the number of grids obtained using Equation 3. After that, the location of the grid to which each data point belongs is obtained, grouped into one partition including several grids, and sent to the reducer. In this case, in the case of a partition that does not contain any data points, only one dummy grid is transmitted. In the DPHist-MR.reduce function, the Laplace method is applied to the number of data contained in all the grids in the data area of each partition, and the center point of the grid is obtained and exported. In this case, in the case of a grid in which the number of points is 0, it is not transmitted from the mapper and is generated within the reduce function, thereby increasing the distributed parallel effect. In the second process, K-means clustering considers the center point of each grid as one data and performs clustering. In K-means clustering using MapReduce, a map can be implemented to calculate a cluster center point close to each point, and in Reduce, a new center point can be calculated for each cluster.

이를 통하여, 상대적으로 적은 노이즈 삽입만을 수행하기 위해 차분 프라이버시를 만족하는 히스토그램을 만들고, K-평균 클러스터링을 수행했을 때의 문제점, 즉 히스토그램의 크기가 커질 때 한 대의 머신에서 처리하기 위해 긴 시간이 필요한 문제점을 해결할 수 있고, 히스토그램의 크기를 크게 하더라도, 즉 노이즈를 다량 삽입하여 공격이 불가한 데이터 보호를 만들더라도, 병렬적으로 노이즈를 삽입함으로써 대량의 데이터에 대한 개인정보 보호가 빠른 시간 내에 이루어질 수 있도록 하고, 궁극적으로 K-평균 클러스터링 알고리즘을 수행할 때 차분 프라이버시를 보장할 수 있도록 히스토그램을 만들 수 있다.Through this, a histogram that satisfies the difference privacy is created to perform only relatively small noise insertion, and the problem of performing K-means clustering, that is, a long time is required to process in one machine when the size of the histogram increases. Even if the problem can be solved, and even if the size of the histogram is increased, that is, even if a large amount of noise is inserted to make data protection impossible to attack, by inserting noise in parallel, privacy protection for a large amount of data can be achieved in a short time. And, ultimately, a histogram can be created to ensure differential privacy when performing the K-means clustering algorithm.

미리보기제공부(150)는, 탐색부에서 탐색한 적어도 하나의 종류의 개인정보 중 비식별화 처리를 할 종류를 선택받고, 선택된 종류의 개인정보가 비식별화 처리된 미리보기를 제공할 수 있다. 만약, 삭제 후 대체, 예를 들어, 비식별화 포맷으로 마스킹이 선택된 경우에는, 개인정보가 마스킹 처리된 후의 전자문서를 출력할 수 있고, 이렇게 변경을 할 것인지를 사용자로부터 확인받기 위하여, 미리보기를 제공하는 것이다. 만약, 사용자가 확인 또는 승인을 한다면 마스킹 처리된 결과를 보여주며, 이 상태를 저장한다. 이때, 전자문서는 오피스 프로그램으로 작성된 문서이고, 비식별화는 VSTO(Visual Studio Tools for Office) 기반으로 개발되어 오피스 프로그램 내에서 구동될 수 있다. 오피스 프로그램은, MS 오피스(MicroSofte Office)를 일컫는다.The preview providing unit 150 receives a selection of a type to be de-identified from among at least one type of personal information searched by the search unit, and provides a preview in which the selected type of personal information is de-identified. have. If, for example, if masking is selected as the de-identification format after deletion, the electronic document after the personal information has been masked can be output, and in order to receive confirmation from the user whether to make such changes, preview Is to provide. If the user confirms or approves, the masked result is displayed and this state is saved. In this case, the electronic document is a document created with an office program, and de-identification is developed based on Visual Studio Tools for Office (VSTO) and can be driven within the office program. Office program refers to Microsoft Office (MicroSofte Office).

그리고, 비식별화 서비스 제공 장치(100)는 복수이고, 복수의 비식별화 서비스 제공 장치가 인트라넷(Intranet)으로 폐쇄형 네트워크를 이루는 경우, 업데이트가 발생하는 경우 인트라넷에 포함된 복수의 복수의 비식별화 서비스 제공 장치(100)로 배포되어 일괄 업데이트가 될 수 있다. 이는, 하나의 인트라넷이 하나의 사용자로 관리되기 때문에 가능할 수 있고, IaaS(Infra as a Service)를 통하여 가능할 수 있다.In addition, when the de-identification service providing device 100 is a plurality, when the plurality of de-identification service providing devices form a closed network through an intranet, when an update occurs, a plurality of non-identifying service providing devices are included in the intranet. Distributed to the identification service providing device 100 may be batch updated. This may be possible because one intranet is managed as one user, and may be possible through IaaS (Infra as a Service).

이미지부(160)는, 전자문서 내에 이미지가 포함된 경우, 이미지 내에 포함된 텍스트가 개인정보에 대응하는 텍스트인 경우 이미지 내에 포함된 텍스트를 비식별화처리하고, 이미지 내 포함된 사진이 인물사진인 경우 모자이크 처리를 수행할 수 있다. 이를 위하여, 이미지부(160)는, 이미지 내 인물사진, 특히 얼굴사진이 포함된 경우에는, GAN을 이용하여 만든 가상의 얼굴 이미지로 변환하거나 모자이크 처리할 수 있다. 이미지부(160)는, MTCNN(Multi-Task Cascaded Convolutional Neural Network)을 이용하여 높은 정확도를 가지는 얼굴 영역의 좌표를 얻는다. MTCNN은 한 장의 이미지에 대해 세 단계로 CNN을 적용하여 얼굴의 위치를 알아내는 방법이다. 첫 번째 단계의 CNN 네트워크에서는 대략적인 얼굴의 위치를 파악하고 점점 세밀한 CNN 네트워크를 이용하여 탐색된 얼굴의 위치를 검사한다.When an image is included in the electronic document, the image unit 160 de-identifies the text included in the image when the text included in the image is a text corresponding to personal information, and the photo included in the image is a portrait photograph. If yes, mosaic treatment can be performed. To this end, the image unit 160 may convert or perform mosaic processing into a virtual face image created using a GAN when a portrait picture, particularly a face picture, is included in the image. The image unit 160 obtains coordinates of a face region with high accuracy using a Multi-Task Cascaded Convolutional Neural Network (MTCNN). MTCNN is a method of finding the location of a face by applying CNN to one image in three steps. In the CNN network in the first step, the approximate position of the face is determined, and the position of the discovered face is examined using the increasingly detailed CNN network.

그 다음은, 이미지부(160)는 얼굴 인식(Face Recognition)을 시작하는데, 사람의 얼굴인지 동물의 동물인지 등을 구분하는 것이다. 눈, 코, 입의 배치가 유사한 경우만 찾으면 예를 들어 침팬치의 얼굴도 모자이크 처리하게 될 수 있다. 얼굴 인식을 위한 다양한 알고리즘은 공지기술을 이용할 수 있으므로 본 발명에서는 어느 하나로 한정하지 않는다. 그리고 나서, 이미지부(160)는 가짜 얼굴을 생성하는 단계를 진행한다. 모자이크를 하는 경우 간단하게 얼굴을 가릴 수 있지만, 언뜻 얼굴의 전체 라인이나 눈과 코의 위치 및 입의 위치를 대략적으로 인지할 수 있기 때문에, 자기 자신이나 지인이 보는 경우 누구인지를 특정하는 경우도 많다. 예를 들어, 성범죄 피해자를 인터뷰한 결과를 정리한 전자문서라면, 성범죄 피해자의 신분은 철저히 숨겨져야 하기 때문에 모자이크만으로는 신분노출의 위험을 배제할 수 없다.Next, the image unit 160 starts face recognition, which distinguishes whether it is a human face or an animal animal. A chimpanzee's face, for example, can be mosaicized as long as the eyes, nose, and mouth are similarly arranged. Since various algorithms for face recognition can use known techniques, the present invention is not limited to any one. Then, the image unit 160 proceeds to generate a fake face. In the case of mosaic, it is easy to cover the face, but at first glance, it is possible to recognize roughly the entire line of the face, the position of the eyes and nose, and the position of the mouth. many. For example, if it is an electronic document that summarizes the results of interviews with victims of sex crimes, the risk of identity exposure cannot be ruled out by mosaic alone, since the identity of the victims of sex crimes must be thoroughly hidden.

따라서, 이미지부(160)는, StarGAN(Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation) 모델을 이용해 실제 사람의 얼굴 속성을 변형한 가상의 새로운 얼굴을 생성할 수 있다. 네트워크를 학습시키기 위해 전 세계 각 나라의 유명인 얼굴의 데이터 셋을 이용할 수 있고, 유명인 얼굴의 데이터 셋은 사진으로 이루어질 수 있다. 각 사진에는 머리 색, 얼굴형, 성별, 안경 유무 등 얼굴과 관련된 속성에 대해 각각의 속성값이 바이너리 (-1, 1) 형태로 어노테이션(Annotation)할 수 있다. 복수의 속성 중 변형 전의 사람 얼굴을 알아볼 수 없게끔 만드는 속성인 머리색, 수염, 안경으로 한정지어 모델을 학습시킬 수도 있다. 동일한 속성값을 가지는 이미지의 집합을 하나의 도메인이라고 정의한다.Accordingly, the image unit 160 may generate a virtual new face in which the face property of a real person is modified using a StarGAN (Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation) model. In order to train the network, a data set of celebrity faces from countries around the world can be used, and a data set of celebrity faces can be made of photos. In each photo, each attribute value can be annotated in the form of binary (-1, 1) for attributes related to the face such as hair color, face type, gender, and the presence of glasses. Among the plurality of attributes, the model can be trained by limiting the attributes to the hair color, beard, and glasses that make the face of the person before transformation unrecognizable. A set of images having the same attribute value is defined as one domain.

크로스 도메인(Cross-Domain) 모델을 사용하는 DIAT(Deep Identity-Aware Transfer of Facial Attributes), CycleGAN(Unpaired Image-to-Image Translation using Cycleconsistent Adversarial Networks)은, 한 도메인에 속하는 이미지를 다른 여러도메인으로 변환하기 위해 각각의 도메인으로 변환할 수 있는 서로 다른 생성기를 독립적으로 학습시켜야 했지만, 본 발명의 일 실시예에서 이용하는 StarGAN은 다중 도메인을 연결한 단일 생성기를 기반으로 동작하기에, 하나의 생성기만을 이용해 이미지를 여러 도메인으로 변환할 수 있다. 두 개 이상의 속성을 동시에 변형시키면 초상권 보호를 보다 정확하게 할 수 있기에 StarGAN 모델을 이용하기로 한다. Deep Identity-Aware Transfer of Facial Attributes (DIAT) using a cross-domain model and Unpaired Image-to-Image Translation using Cycleconsistent Adversarial Networks (CycleGAN) convert images belonging to one domain to several other domains. In order to do so, it was necessary to independently learn different generators that can be converted into respective domains, but the StarGAN used in an embodiment of the present invention operates based on a single generator connecting multiple domains. Can be converted to multiple domains. If two or more properties are modified at the same time, the model right can be protected more accurately, so the StarGAN model is used.

StarGAN은 상반되는 목적을 가진 두 개의 모듈이 서로 경쟁하며 이미지를 생성하는 네트워크로, 생성기(Generator)와 판별기(Discriminator)로 구성된다. 생성기는 가짜 이미지를 생성하며, 판별기는 생성된 이미지와 실제 이미지를 구별하는 모듈이다. 생성기는 판별기가 구별하지 못할 만큼 실제와 같은 가짜 이미지를 생성하는 것이 목적이며, 판별기는 생성기가 생성한 이미지와 실제 이미지를 구별해내는 것이 목적이다. 이렇게 학습된 생성기는 판별기로 하여금 실제인지 생성된 이미지인지 구분하지 못할 만큼 실제와 같은 가짜 이미지 생성이 가능해진다. 여기서, 네트워크를 학습시킬 때 Wasserstein GAN의 손실값(Loss)를 사용하여 학습할 수 있다. 이렇게 학습된 StarGAN을 통해 변형시키면 초상권도 보호하면서 모자이크 처리로 자칫 알아볼 수 있는 가능성을 제거할 수 있다.StarGAN is a network in which two modules with opposite purposes compete with each other to generate an image, and is composed of a generator and a discriminator. The generator creates a fake image, and the discriminator is a module that distinguishes the generated image from the real image. The purpose of the generator is to create a fake image that is as real as the discriminator cannot distinguish, and the purpose of the discriminator is to distinguish the image generated by the generator from the real image. The learned generator can create a real-like fake image so that the discriminator cannot distinguish whether it is a real image or a generated image. Here, when training the network, it can be learned using the loss value (Loss) of Wasserstein GAN. If you transform it through the learned StarGAN, you can protect the portrait right and remove the possibility of being recognizable by mosaic treatment.

오버레이부(170)는, 전자문서 내에 하이퍼링크가 포함된 경우, 하이퍼링크로 엑세스하여 출력된 페이지 내에 개인정보에 대응하는 적어도 하나의 텍스트가 존재하는 경우, 적어도 하나의 텍스트가 존재하는 적어도 하나의 좌표 및 영역을 추출하고, 추출된 적어도 하나의 좌표 및 영역을 가리도록 블랙박스를 오버레이할 수 있다. 이때, 하이퍼링크가 하나의 전자문서 내로 이동하는 링크가 아니라, 웹사이트와 같은 외부링크인 경우, 또 외부링크인데 개인정보가 포함된 경우에는 제3자인 본 발명의 장치(100)에서 이를 변경할 방법은 없다. 따라서, 하이퍼링크 내에 포함된 개인정보의 위치를 추출하고, 해당 위치 및 영역을 "가림처리"하도록, 웹사이트의 크기(가로세로, 해상도)에 대응하는 투명레이어를 해당 웹사이트 상에 씌우고, 투명레이어 내에 해당 위치 및 영역을 가리도록 블랙박스가 포함시켜 이를 웹사이트 상에 오버레이하는 것이다. 물론, 해당 하이퍼링크 주소를 다른 브라우저에 넣고 검색하는 경우 개인정보가 그대로 드러날 수는 있겠지만, GPDR로 인한 책임은 다하였으므로 고의 및 과실에 의한 개인정보노출에 의한 책임은 면할 수 있다.When a hyperlink is included in the electronic document, when at least one text corresponding to personal information exists in a page that is accessed and output by a hyperlink, the overlay unit 170 includes at least one text containing at least one text. The coordinates and regions may be extracted, and the black box may be overlaid to cover at least one of the extracted coordinates and regions. At this time, when the hyperlink is not a link that moves into one electronic document, but is an external link such as a website, or when the hyperlink is an external link and includes personal information, the method to change it in the device 100 of the present invention as a third party There is no. Therefore, to extract the location of personal information included in the hyperlink, and to "cover" the location and area, a transparent layer corresponding to the size (width and height, resolution) of the website is put on the website, and transparent A black box is included to cover the location and area within the layer and overlayed on the website. Of course, if you put the hyperlink address in another browser and search it, your personal information may be revealed as it is, but you are not liable for personal information exposure due to intentional or negligence as we have fulfilled our responsibilities due to GPDR.

이하, 상술한 도 2의 비식별화 서비스 제공 서버의 구성에 따른 동작 과정을 도 3 및 도 4를 예로 들어 상세히 설명하기로 한다. 다만, 실시예는 본 발명의 다양한 실시예 중 어느 하나일 뿐, 이에 한정되지 않음은 자명하다 할 것이다.Hereinafter, the operation process according to the configuration of the de-identification service providing server of FIG. 2 will be described in detail with reference to FIGS. 3 and 4 as examples. However, it will be apparent that the embodiment is only any one of various embodiments of the present invention, and is not limited thereto.

도 3을 참조하면, (a) 비식별화 서비스 제공 서버(300)는 적어도 하나의 개인정보의 종류, 종류에 따른 개인정보 포맷과, 개인정보를 비식별화하기 위한 포맷을 각각 연결하여 데이터베이스화할 수 있다. 그리고, (b) 비식별화 서비스 제공 서버(100)에서 이를 구동할 때, 전자문서 내에서 처리될 수 있도록 MS 오피스 기반으로 개발 및 MS 오피스 내 메뉴 중 어느 하나로 해당 기능이 포함될 수 있게 한다. 그리고, 사용자가 문서를 작성하고, 비식별화 처리 버튼을 눌렀을 때, 비식별화 포맷을 선택하도록 하고, (c) 텍스트, 이미지 및 하이퍼링크에 따라 각각 비식별화되는 과정을 거쳐서 비식별화가 되도록 한다. (d)와 같은 경우에는 하이퍼링크가 외부링크인 경우의 비식별화 방법이다. 여기서, 블랙박스를 오버레이하는 것 이외에도, 상술한 것과 같은 비식별화 과정을 거친 비식별화 데이터(가명화 또는 익명화)가 오버레이되도록 할 수도 있다.Referring to FIG. 3, (a) the de-identification service providing server 300 connects at least one type of personal information, a personal information format according to the type, and a format for de-identifying personal information to form a database. I can. And, (b) when the de-identification service providing server 100 drives it, the corresponding function can be included in any of the menus developed based on MS Office and within the MS Office so that it can be processed in an electronic document. And, when a user writes a document and presses the de-identification process button, the user selects a de-identification format, and (c) de-identification is performed through a process of de-identifying each according to text, image, and hyperlink. do. In the case of (d), this is the de-identification method when the hyperlink is an external link. Here, in addition to overlaying the black box, de-identification data (pseudonymized or anonymized) that has undergone the de-identification process as described above may be overlaid.

또, 도 4의 (a)와 같이 비식별화 대상이 몇 건인지, 그 종류는 무엇인지를 볼 수 있도록 출력하고, (b)와 같이 비식별화 포맷에 따른 결과를 미리보기로 제공하며, (c) 인트라넷으로 묶인 적어도 하나의 클라이언트는 개별 업데이트를 하는 것이 아니라, 비식별화 서비스 제공 서버(300)와 연결되어 클라이언트의 단체, 기업 등이 확인된 경우, 인트라넷으로 묶인 모든 컴퓨터를 일괄 업데이트할 수 있도록 함으로써 유지보수가 용이하도록 설정할 수 있다.In addition, as shown in (a) of FIG. 4, the number of de-identification targets and their types are output to be viewed, and the result according to the de-identification format is provided as a preview as shown in (b). (c) At least one client bound to the intranet does not perform individual updates, but is connected to the de-identification service providing server 300 and if the client's organization, company, etc. is confirmed, all computers bound to the intranet must be updated at once. It can be set so that maintenance is easy by making it possible.

이와 같은 도 2 내지 도 4의 VSTO 기반 전자문서 비식별화 서비스 제공 방법에 대해서 설명되지 아니한 사항은 앞서 도 1을 통해 VSTO 기반 전자문서 비식별화 서비스 제공 방법에 대하여 설명된 내용과 동일하거나 설명된 내용으로부터 용이하게 유추 가능하므로 이하 설명을 생략하도록 한다.The matters not described in the method of providing the VSTO-based electronic document de-identification service of FIGS. 2 to 4 are the same as or described above with respect to the method of providing the VSTO-based electronic document de-identification service through FIG. Since it can be easily inferred from the contents, the description will be omitted below.

도 5는 본 발명의 일 실시예에 따른 도 1의 VSTO 기반 전자문서 비식별화 서비스 제공 시스템에 포함된 각 구성들 상호 간에 데이터가 송수신되는 과정을 나타낸 도면이다. 이하, 도 5를 통해 각 구성들 상호간에 데이터가 송수신되는 과정의 일 예를 설명할 것이나, 이와 같은 실시예로 본원이 한정 해석되는 것은 아니며, 앞서 설명한 다양한 실시예들에 따라 도 5에 도시된 데이터가 송수신되는 과정이 변경될 수 있음은 기술분야에 속하는 당업자에게 자명하다.FIG. 5 is a diagram illustrating a process in which data is transmitted and received between components included in the VSTO-based electronic document de-identification service providing system of FIG. 1 according to an embodiment of the present invention. Hereinafter, an example of a process in which data is transmitted/received between each component will be described with reference to FIG. 5, but the present application is not limitedly interpreted as such an embodiment, and is illustrated in FIG. 5 according to various embodiments described above. It is apparent to those skilled in the art that the process of transmitting and receiving data may be changed.

도 5를 참조하면, 비식별화 서비스 제공 장치는, 텍스트, 이미지 및 하이퍼링크 중 적어도 하나를 포함한 전자문서 내에서 비식별화 메뉴가 선택되는 경우 적어도 하나의 종류의 개인정보를 탐색한다(S5100).Referring to FIG. 5, the apparatus for providing a de-identification service searches for at least one type of personal information when a de-identification menu is selected in an electronic document including at least one of text, image, and hyperlink (S5100). .

그리고, 비식별화 서비스 제공 장치는, 탐색된 적어도 하나의 종류의 개인정보를 종류별로 건 수를 취합하여 결과를 표출하고(S5200), 탐색된 적어도 하나의 종류의 개인정보를 비식별화하기 위한 적어도 하나의 비식별화 포맷 중 어느 하나의 비식별화 포맷을 선택받는다(S5300).In addition, the de-identification service providing apparatus collects the number of the searched at least one type of personal information and displays the result (S5200), and is used to de-identify the searched at least one type of personal information. Any one of the at least one de-identification format is selected (S5300).

또, 비식별화 서비스 제공 장치는, 선택된 포맷으로 적어도 하나의 종류의 개인정보를 비식별화 처리한다(S5400).In addition, the device for providing a de-identification service de-identifies at least one type of personal information in a selected format (S5400).

상술한 단계들(S5100~S5400)간의 순서는 예시일 뿐, 이에 한정되지 않는다. 즉, 상술한 단계들(S5100~S5400)간의 순서는 상호 변동될 수 있으며, 이중 일부 단계들은 동시에 실행되거나 삭제될 수도 있다.The order between the above-described steps S5100 to S5400 is only an example and is not limited thereto. That is, the order of the above-described steps (S5100 to S5400) may be mutually changed, and some of the steps may be executed or deleted at the same time.

이와 같은 도 5의 VSTO 기반 전자문서 비식별화 서비스 제공 방법에 대해서 설명되지 아니한 사항은 앞서 도 1 내지 도 4를 통해 VSTO 기반 전자문서 비식별화 서비스 제공 방법에 대하여 설명된 내용과 동일하거나 설명된 내용으로부터 용이하게 유추 가능하므로 이하 설명을 생략하도록 한다.The matters not described with respect to the method of providing the VSTO-based electronic document de-identification service of FIG. 5 are the same as or described above with respect to the method of providing the VSTO-based electronic document de-identification service through FIGS. 1 to 4 above. Since it can be easily inferred from the contents, the description will be omitted below.

도 5를 통해 설명된 일 실시예에 따른 VSTO 기반 전자문서 비식별화 서비스 제공 방법은, 컴퓨터에 의해 실행되는 애플리케이션이나 프로그램 모듈과 같은 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. The method of providing a VSTO-based electronic document de-identification service according to an embodiment described with reference to FIG. 5 is also implemented in the form of a recording medium including instructions executable by a computer such as an application or program module executed by a computer. Can be. Computer-readable media can be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. Further, the computer-readable medium may include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 일 실시예에 따른 VSTO 기반 전자문서 비식별화 서비스 제공 방법은, 단말기에 기본적으로 설치된 애플리케이션(이는 단말기에 기본적으로 탑재된 플랫폼이나 운영체제 등에 포함된 프로그램을 포함할 수 있음)에 의해 실행될 수 있고, 사용자가 애플리케이션 스토어 서버, 애플리케이션 또는 해당 서비스와 관련된 웹 서버 등의 애플리케이션 제공 서버를 통해 마스터 단말기에 직접 설치한 애플리케이션(즉, 프로그램)에 의해 실행될 수도 있다. 이러한 의미에서, 전술한 본 발명의 일 실시예에 따른 VSTO 기반 전자문서 비식별화 서비스 제공 방법은 단말기에 기본적으로 설치되거나 사용자에 의해 직접 설치된 애플리케이션(즉, 프로그램)으로 구현되고 단말기에 등의 컴퓨터로 읽을 수 있는 기록매체에 기록될 수 있다.The method for providing a VSTO-based electronic document de-identification service according to an embodiment of the present invention described above is applied to an application basically installed in a terminal (this may include a program included in a platform or an operating system basically installed in the terminal). It may be executed by, and may be executed by an application (ie, a program) directly installed on the master terminal by a user through an application providing server such as an application store server, an application, or a web server related to the service. In this sense, the VSTO-based electronic document de-identification service providing method according to an embodiment of the present invention described above is implemented as an application (i.e., a program) installed basically in a terminal or directly installed by a user, and Can be recorded on a recording medium that can be read by

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The above description of the present invention is for illustrative purposes only, and those of ordinary skill in the art to which the present invention pertains will be able to understand that other specific forms can be easily modified without changing the technical spirit or essential features of the present invention will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as being distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the claims to be described later rather than the detailed description, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention. do.

Claims

In the VSTO (Visual Studio Tools for Office)-based electronic document de-identification service providing device,
A search unit for searching at least one type of personal information when a de-identification menu is selected in an electronic document including at least one of text, image, and hyperlink;
A display unit for collecting the number of cases of the searched at least one type of personal information by type and displaying a result;
A format selection unit configured to select one of at least one de-identified format for de-identifying the searched at least one type of personal information;
A processing unit for de-identifying at least one type of personal information in the selected format;
When an image is included in the electronic document, when the text included in the image is a text corresponding to personal information, the text included in the image is de-identified, and when the photo included in the image is a portrait photo, mosaic An image unit that performs processing; And
When a hyperlink is included in the electronic document, when at least one text corresponding to personal information exists in the page accessed and output by the hyperlink, at least one coordinate and area in which the at least one text exists Including; extracting, and overlaying the black box to cover the extracted at least one coordinate and area;
The electronic document is a document created with an office program,
The de-identification is developed based on VSTO and is driven within the office program,
The de-identification service providing device is plural,
When a plurality of de-identification service providing devices form a closed network through an intranet, when an update occurs, it is distributed to a plurality of de-identification service providing devices included in the intranet for batch update,
The search unit,
Use data profiling techniques to preselect and define which items of personal information to de-identify,
The processing unit,
Pseudonymizatoin is performed, and the at least one type of personal information is set so that the value converted to the pseudonymization process is unique (Distinguishability), and at least one piece of information from the value converted by the pseudonymization process It is set so that the original cannot be inferred, and the pseudonymization process is performed by a combination of at least one of Aggregation, Data Reduction, Data Suppression, and Data Masking, After the pseudonymization process is performed, a privacy model that evaluates the quantitative risk of privacy exposure on at least one piece of information is driven, and as a result of driving the privacy model, at least one pseudonymized information is a preset probability level It checks whether abnormality has been de-identified, and if at least one pseudonymized information is not de-identified by more than a preset probability level as a result of the check, it is transmitted to the manager terminal to request feedback. If so, it is classified as input data to build big data,
The image unit,
When a face photo is included in the image, it is converted into a virtual face image created using GAN (Generative Adversarial Networks) or mosaic-processed, and the face area with high accuracy is performed using MTCNN (Multi-Task Cascaded Convolutional Neural Network). Obtaining coordinates
VSTO-based electronic document de-identification service providing device.

delete

The method of claim 1,
The at least one type of personal information includes a resident registration number, a phone number, a name, an address, a credit card number, a passport number, a driver registration card number, and a vehicle number,
The at least one de-identification format includes deletion, partial deletion, post-space replacement, and noise addition,
By analyzing the resident registration number by the data profiling method, it is possible to extract when a character is included in the middle of the resident registration number, only a part of the rear part of the resident registration number is marked, and when the front part and the back part of the resident registration number are connected by a separator.
VSTO-based electronic document de-identification service providing device.

delete

The method of claim 1,
The de-identification service providing device,
Further comprising: a preview providing unit that receives a selection of a type to be de-identified from among at least one type of personal information searched by the search unit and provides a preview in which the selected type of personal information is de-identified;
VSTO-based electronic document de-identification service providing device.

delete