KR102533008B1

KR102533008B1 - Method for detecting private information and measuring data exposure possibility from unstructured data

Info

Publication number: KR102533008B1
Application number: KR1020220189627A
Authority: KR
Inventors: 김태종
Original assignee: 월드버텍 주식회사
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-05-17
Also published as: KR102533008B9

Abstract

본 발명은 인공지능 모델러에 의해, 기설정된 개인정보 탐지 알고리즘에 학습데이터를 제공하여 개인정보 탐지 모델을 구축하는 단계, 상기 개인정보 탐지 모델에 의해, 비정형 데이터를 입력받아 개인정보를 탐지하는 단계, 리스크 분석부에 의해, 상기 탐지된 개인정보와 그에 대응하는 대상정보를 기반으로 개인정보의 대상별로 탐지된 개인정보를 군집화하는 단계 및 각 군집별로 개인정보 노출 위험도를 수치화하는 단계를 포함하는, 비정형 데이터로부터 개인정보를 탐지하고 노출 위험도를 측정하는 방법에 관한 것으로, 개체명 인식 기술을 활용하여 일반적인 형태에서 벗어난 변형된 형태의 개인정보를 탐지할 수 있고, 개인 식별 가능성이 낮은 각각의 개인정보들의 결합을 통해 개인정보의 대상별로 개인정보의 노출 위험도를 산출할 수 있는 효과가 있다.The present invention provides learning data to a predetermined personal information detection algorithm by an artificial intelligence modeler to build a personal information detection model, detecting personal information by receiving unstructured data by the personal information detection model, Atypical, including the step of clustering the detected personal information by object of personal information based on the detected personal information and the corresponding target information by the risk analysis unit and the step of quantifying the risk of personal information exposure for each cluster. It relates to a method of detecting personal information from data and measuring the risk of exposure. It is possible to detect personal information in a transformed form that deviate from the general form by using entity name recognition technology, and to identify each personal information with low personal identification probability. Combination has the effect of calculating the exposure risk of personal information for each subject of personal information.

Description

How to detect personal information from unstructured data and measure exposure risk

본 발명은 비정형 데이터로부터 개인정보를 탐지하고 노출 위험도를 측정하는 방법에 관한 것으로, 보다 구체적으로는 기계학습을 통해 생성된 개인정보 탐지 모델을 이용하여 개인정보 및 그에 대응하는 대상정보를 기반으로 개인정보의 대상별로 군집화하고, 각 군집별로 개인정보 노출 위험도를 수치화하는 방법에 관한 것이다.The present invention relates to a method for detecting personal information from unstructured data and measuring the risk of exposure, and more specifically, based on personal information and corresponding target information using a personal information detection model generated through machine learning. It relates to a method of clustering information by target and quantifying the risk of personal information exposure for each cluster.

고도화된 현대사회에서 개인정보의 중요성은 점차 커지고 있으며, 이에 따라 개인정보를 탈취하기 위한 해킹 기술이 날로 정교해지고 있다. 정형 데이터의 경우, 개인정보가 내포되었음이 비교적 명시적으로 드러나기 때문에 정형 데이터를 해킹으로부터 보호하기 위한 예방책을 적용하기 쉬운 편이나, 비정형 데이터는 그 안에 개인정보가 존재하는지 여부를 판단하는 것이 쉽지 않다. 여기서 비정형 데이터는 특정 스키마가 없거나 다양한 포맷의 데이터일 수 있어 특정 데이터의 검색이 까다로운 데이터로, 카카오톡, 라인, 네이트온 등의 모바일 메신저와 인스타그램, 페이스북, 네이버 카페 등의 SNS 게시물 및 댓글을 포함하는 텍스트 데이터를 의미할 수 있다. 특히 사용자들의 개인정보 침해에 대한 민감도가 높아지면서 모바일 메신저 및 소셜미디어 게시물, 댓글에서는 개인정보 입력 시 "공1공-1234-오6칠8", "o1o-①234-5육78" 등과 같이 숫자 대신 한글, 영어, 특수문자 등으로 대체하는 것과 같이 다양한 형태로 변형하는 사례들이 급증하고 있다. 기존의 개인정보 탐지 기술은 규칙 기반의 탐지가 주를 이루고 있어 특히 텍스트의 변형, 축약이 빈번한 모바일 메신저 및 소셜미디어 게시물, 댓글에서는 개인정보를 탐지하기 어려워지고 있다.The importance of personal information is gradually increasing in the sophisticated modern society, and accordingly, hacking techniques for stealing personal information are becoming more sophisticated day by day. In the case of structured data, it is easy to apply preventive measures to protect structured data from hacking because it is relatively clear that personal information is contained in it. However, it is not easy to determine whether personal information exists in unstructured data. . Here, unstructured data is data that does not have a specific schema or may be data in various formats, so it is difficult to search for specific data. It may mean text data including comments. In particular, as users' sensitivity to privacy infringement increases, mobile messengers and social media posts and comments, when entering personal information, numbers such as "gong1gong-1234-5678", "o1o-①234-5678", etc. Instead, cases of transforming into various forms, such as replacing them with Korean, English, or special characters, are rapidly increasing. Existing personal information detection technologies are mainly based on rule-based detection, making it difficult to detect personal information in mobile messengers and social media posts and comments, in which text is frequently altered and shortened.

등록특허 제10-1727139호(등록일자: 2017년04월10일)Registered Patent No. 10-1727139 (Date of registration: April 10, 2017) 등록특허 제10-2196508호(등록일자: 2020년12월22일)Registered Patent No. 10-2196508 (Date of registration: December 22, 2020)

본 발명은 전술한 문제점을 개선하기 위해 안출된 것으로, 본 발명은 개체명 인식 기술을 활용하여 비정형 데이터로부터 개인정보를 탐지하고 노출 위험도를 측정하는 방법을 제공하는데 과제가 있다.The present invention has been made to improve the above problems, and an object of the present invention is to provide a method of detecting personal information from unstructured data and measuring the risk of exposure by using entity name recognition technology.

또한 본 발명의 실시예에 따르면, 화자와 화행 정보를 함께 고려하여 보다 향상된 개인정보 탐지 성능을 제공하는데 과제가 있다.In addition, according to an embodiment of the present invention, there is a problem in providing improved personal information detection performance by considering both speaker and dialogue act information.

또한 본 발명의 실시예에 따르면, 식별된 개인정보 각각이 가리키는 대상을 고려하여 개인보 노출 위험 정도를 측정하는 방법을 제공하는데 과제가 있다.In addition, according to an embodiment of the present invention, there is a task to provide a method for measuring the degree of exposure risk of personal information in consideration of the target indicated by each identified personal information.

상기와 같은 목적을 달성하기 위하여, 본 발명의 일 실시예에 의한 비정형 데이터로부터 개인정보를 탐지하고 노출 위험도를 측정하는 방법은 인공지능 모델러에 의해, 기설정된 개인정보 탐지 알고리즘에 학습데이터를 제공하여 개인정보 탐지 모델을 구축하는 단계, 상기 개인정보 탐지 모델에 의해, 비정형 데이터를 입력받아 개인정보를 탐지하는 단계, 리스크 분석부에 의해, 상기 탐지된 개인정보와 그에 대응하는 대상정보를 기반으로 개인정보의 대상별로 탐지된 개인정보를 군집화하는 단계 및 상기 리스크 분석부에 의해, 각 군집별로 개인정보 노출 위험도를 수치화하는 단계를 포함하는 것을 특징으로 한다.In order to achieve the above object, a method for detecting personal information from unstructured data and measuring exposure risk according to an embodiment of the present invention provides learning data to a predetermined personal information detection algorithm by an artificial intelligence modeler Building a personal information detection model, detecting personal information by receiving unstructured data by the personal information detection model, and detecting personal information based on the detected personal information and corresponding target information by the risk analysis unit It is characterized in that it includes the step of clustering the detected personal information for each information object, and the step of quantifying the risk of personal information exposure for each cluster by the risk analysis unit.

일 실시예에 따르면, 상기 인공지능 모델러에 의해, 상기 개인정보 탐지 모델을 구축하는 단계는, 개인정보가 포함된 텍스트 데이터를 수집하여 전처리를 수행하는 학습데이터 구축 단계, 상기 기설정된 개인정보 탐지 알고리즘을 설정하여 상기 개인정보 탐지 모델을 초기 세팅하는 단계, 상기 학습데이터를 상기 개인정보 탐지 모델에 제공하여 학습을 수행하도록 하는 단계, 학습을 수행한 상기 개인정보 탐지 모델을 테스트하는 모델 테스트 단계, 상기 모델 테스트 단계에서 테스트 결과를 분석하여 상기 개인정보 탐지 모델의 복수의 파라미터 중 적어도 어느 하나를 튜닝하는 모델 튜닝 단계 및 튜닝된 상기 개인정보 탐지 모델에 검증데이터를 제공하여 탐지된 개인정보의 정확도를 평가하는 모델 검증 단계를 포함하는 것을 특징으로 한다.According to an embodiment, the step of building the personal information detection model by the artificial intelligence modeler includes the step of building learning data of collecting and pre-processing text data including personal information, and the preset personal information detection algorithm. Initial setting of the personal information detection model by setting, the step of providing the learning data to the personal information detection model to perform learning, the step of testing the personal information detection model that has been trained, the step of testing the personal information. In the model testing step, a model tuning step of tuning at least one of a plurality of parameters of the personal information detection model by analyzing the test result, and evaluating the accuracy of the detected personal information by providing verification data to the tuned personal information detection model. It is characterized in that it comprises a model verification step to.

일 실시예에 따르면, 상기 학습데이터 구축 단계는 상기 수집된 텍스트 데이터로부터 개인정보 개체명 태깅을 수행하고, 상기 개인정보가 포함된 텍스트 데이터는 개인정보 또는 변형된 형태의 개인정보가 포함된 텍스트 데이터인 것을 특징으로 한다.According to an embodiment, the step of constructing the learning data performs tagging of individual information object names from the collected text data, and the text data including the personal information is text data including personal information or modified personal information. It is characterized by being

일 실시예에 따르면, 상기 학습데이터 구축 단계에서 상기 수집된 텍스트 데이터로부터 개인정보 개체명 태깅을 수행할 때, 해당 개인정보의 대상정보를 함께 라벨링하는 것을 특징으로 한다.According to one embodiment, when tagging the object name of personal information from the collected text data in the learning data construction step, subject information of the corresponding personal information is characterized in that it is labeled together.

일 실시예에 따르면, 상기 개인정보 탐지 모델에 의해, 비정형 데이터를 입력받아 개인정보를 탐지하는 단계에서 상기 비정형 데이터로부터 화행 정보를 분석하여 개인정보의 포함 가능성을 판단하는 것을 특징으로 한다.According to an embodiment, in the step of receiving unstructured data and detecting personal information by the personal information detection model, dialog act information is analyzed from the unstructured data to determine the possibility of including personal information.

일 실시예에 따르면, 상기 리스크 분석부에 의해, 상기 탐지된 개인정보와 그에 대응하는 대상정보를 기반으로 개인정보의 대상별로 탐지된 개인정보를 군집화하는 단계는, 개인식별 위험도를 기반으로 상기 탐지된 개인정보를 분류하는 단계 및 개인정보 클래스 간 속성 관계를 정의하는 단계를 포함하는 것을 특징으로 한다.According to an embodiment, the step of clustering the detected personal information by object of personal information based on the detected personal information and corresponding target information by the risk analyzer includes the detection based on the risk of personal identification. It is characterized in that it includes the step of classifying the identified personal information and the step of defining the attribute relationship between the personal information classes.

일 실시예에 따르면, 상기 개인정보 클래스 간 속성 관계는 인과관계, 포함관계 및 직간접관계 중 적어도 하나를 포함하는 것을 특징으로 한다.According to an embodiment, the attribute relationship between the personal information classes includes at least one of a causal relationship, an inclusive relationship, and a direct/indirect relationship.

일 실시예에 따르면, 상기 개인정보 노출 위험도는 점수 또는 등급으로 산출되는 것을 특징으로 한다.According to one embodiment, the personal information exposure risk is characterized in that it is calculated as a score or grade.

본 발명의 일 실시예에 따른 비정형 데이터로부터 개인정보를 탐지하고 노출 위험도를 측정하는 방법은 개체명 인식 기술을 활용하여 일반적인 형태에서 벗어난 변형된 형태의 개인정보를 탐지할 수 있다.The method for detecting personal information from unstructured data and measuring the risk of exposure according to an embodiment of the present invention may detect personal information in a form deviating from the general form by utilizing entity name recognition technology.

또한, 본 발명의 일 실시예는 개인 식별 가능성이 낮은 각각의 개인정보들의 결합을 통해 개인정보의 대상별로 개인정보의 노출 위험도를 산출할 수 있다.In addition, an embodiment of the present invention can calculate the risk of exposure of personal information for each subject of personal information through a combination of individual pieces of personal information having a low possibility of personal identification.

도 1은 본 발명의 일 실시예에 따른 비정형 데이터로부터 개인정보를 탐지하고 노출 위험도를 측정하는 방법을 개략적으로 나타내는 순서도이다.
도 2는 본 발명의 일 실시예에 따른 개인정보 탐지 모델 구축 단계를 보다 구체적으로 나타내는 순서도이다.
도 3은 본 발명의 일 실시예에 따른 대상별 개인보 군집화 단계를 보다 구체적으로 나타내는 순서도이다.
도 4는 본 발명의 일 실시예에 따른 비정형 데이터로부터 개인정보를 탐지하고 노출 위험도를 측정하는 방법을 수행하는 시스템을 개략적으로 나타내는 도면이다.
도 5는 본 발명의 일 실시예에 따른 서버의 구성요소를 나타내는 블록도이다.
도 6은 본 발명의 일 실시예에 따른 개인정보 개체명 태깅의 예시를 나타내는 도면이다.
도 7은 본 발명의 일 실시예에 따른 화행정보의 항목과 정의를 나타내는 도면이다.
도 8은 본 발명의 일 실시예에 따른 개인정보 군집화 및 개인정보 노출 위험도를 산출하는 예시를 시각적으로 나타내는 도면이다.1 is a flowchart schematically illustrating a method of detecting personal information from unstructured data and measuring exposure risk according to an embodiment of the present invention.
2 is a flowchart illustrating a step of building a personal information detection model in more detail according to an embodiment of the present invention.
3 is a flowchart illustrating a step of clustering individual information for each subject in more detail according to an embodiment of the present invention.
4 is a diagram schematically illustrating a system for performing a method of detecting personal information from unstructured data and measuring a risk of exposure according to an embodiment of the present invention.
5 is a block diagram showing components of a server according to an embodiment of the present invention.
6 is a diagram illustrating an example of object name tagging of personal information according to an embodiment of the present invention.
7 is a diagram showing items and definitions of dialog act information according to an embodiment of the present invention.
8 is a diagram visually illustrating an example of calculating personal information clustering and personal information exposure risk according to an embodiment of the present invention.

상기한 바와 같은 본 발명을 첨부된 도면들과 실시예들을 통해 상세히 설명하도록 한다.The present invention as described above will be described in detail through the accompanying drawings and embodiments.

본 발명에서 사용되는 기술적 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아님을 유의해야 한다. 또한, 본 발명에서 사용되는 기술적 용어는 본 발명에서 특별히 다른 의미로 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 의미로 해석되어야 하며, 과도하게 포괄적인 의미로 해석되거나, 과도하게 축소된 의미로 해석되지 않아야 한다. 또한, 본 발명에서 사용되는 기술적인 용어가 본 발명의 사상을 정확하게 표현하지 못하는 잘못된 기술적 용어일 때에는, 당업자가 올바르게 이해할 수 있는 기술적 용어로 대체되어 이해되어야 할 것이다. 또한, 본 발명에서 사용되는 일반적인 용어는 사전에 정의되어 있는 바에 따라, 또는 전후 문맥상에 따라 해석되어야 하며, 과도하게 축소된 의미로 해석되지 않아야 한다.It should be noted that technical terms used in the present invention are only used to describe specific embodiments and are not intended to limit the present invention. In addition, technical terms used in the present invention should be interpreted in terms commonly understood by those of ordinary skill in the art to which the present invention belongs, unless specifically defined otherwise in the present invention, and are excessively inclusive. It should not be interpreted in a positive sense or in an excessively reduced sense. In addition, when the technical terms used in the present invention are incorrect technical terms that do not accurately express the spirit of the present invention, they should be replaced with technical terms that those skilled in the art can correctly understand. In addition, general terms used in the present invention should be interpreted as defined in advance or according to context, and should not be interpreted in an excessively reduced sense.

또한, 본 발명에서 사용되는 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함한다. 본 발명에서, "구성된다" 또는 "포함한다" 등의 용어는 발명에 기재된 여러 구성 요소들, 또는 여러 단계를 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다.Also, singular expressions used in the present invention include plural expressions unless the context clearly dictates otherwise. In the present invention, terms such as "consisting of" or "comprising" should not be construed as necessarily including all of the various elements or steps described in the invention, and some of the elements or steps are included. It should be construed that it may not be, or may further include additional components or steps.

또한, 본 발명에서 사용되는 제 1, 제 2 등과 같이 서수를 포함하는 용어는 구성 요소들을 설명하는데 사용될 수 있지만, 구성 요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제 1 구성 요소는 제 2 구성 요소로 명명될 수 있고, 유사하게 제 2 구성 요소도 제 1 구성 요소로 명명될 수 있다.In addition, terms including ordinal numbers such as first and second used in the present invention may be used to describe components, but components should not be limited by the terms. Terms are used only to distinguish one component from another. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element, without departing from the scope of the present invention.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성 요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, a preferred embodiment according to the present invention will be described in detail with reference to the accompanying drawings, but the same or similar components are assigned the same reference numerals regardless of reference numerals, and redundant description thereof will be omitted.

또한, 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 발명의 사상을 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 발명의 사상이 제한되는 것으로 해석되어서는 아니 됨을 유의해야 한다.In addition, in describing the present invention, if it is determined that a detailed description of a related known technology may obscure the gist of the present invention, the detailed description will be omitted. In addition, it should be noted that the accompanying drawings are only for easily understanding the spirit of the present invention, and should not be construed as limiting the spirit of the present invention by the accompanying drawings.

이하의 설명에서, 본 발명의 '비정형 데이터로부터 개인정보를 탐지하고 노출 위험도를 측정하는 방법을 수행하는 시스템'을 가리키는 용어는, 설명의 편의상 '시스템'으로 약식 표기될 수 있다.In the following description, the term 'system for performing a method for detecting personal information from unstructured data and measuring exposure risk' of the present invention may be abbreviated as 'system' for convenience of description.

이하, 도면을 참조하여 본 발명의 실시예에 따른 비정형 데이터로부터 개인정보를 탐지하고 노출 위험도를 측정하는 방법을 상세히 설명한다.Hereinafter, a method of detecting personal information from unstructured data and measuring the risk of exposure according to an embodiment of the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 비정형 데이터로부터 개인정보를 탐지하고 노출 위험도를 측정하는 방법을 개략적으로 나타내는 순서도이고, 도 2는 본 발명의 일 실시예에 따른 개인정보 탐지 모델 구축 단계(S110)를 보다 구체적으로 나타내는 순서도이고, 도 3은 본 발명의 일 실시예에 따른 대상별 개인보 군집화 단계(S130)를 보다 구체적으로 나타내는 순서도이고, 도 4는 본 발명의 일 실시예에 따른 비정형 데이터로부터 개인정보를 탐지하고 노출 위험도를 측정하는 방법을 수행하는 시스템을 개략적으로 나타내는 도면이고, 도 5는 본 발명의 일 실시예에 따른 서버(100)의 구성요소를 나타내는 블록도이고, 도 6은 본 발명의 일 실시예에 따른 개인정보 개체명 태깅의 예시를 나타내는 도면이고, 도 7은 본 발명의 일 실시예에 따른 화행정보의 항목과 정의를 나타내는 도면이며, 도 8은 본 발명의 일 실시예에 따른 개인정보 군집화 및 개인정보 노출 위험도를 산출하는 예시를 시각적으로 나타내는 도면이다.1 is a flowchart schematically illustrating a method of detecting personal information from unstructured data and measuring exposure risk according to an embodiment of the present invention, and FIG. 2 is a step of building a personal information detection model according to an embodiment of the present invention ( S110) in more detail, FIG. 3 is a flowchart more specifically showing the individual information clustering step for each subject (S130) according to an embodiment of the present invention, and FIG. 4 is unstructured data according to an embodiment of the present invention. 5 is a block diagram showing the components of the server 100 according to an embodiment of the present invention, and FIG. 6 is A diagram showing an example of object name tagging of personal information according to an embodiment of the present invention. FIG. 7 is a diagram showing items and definitions of dialog act information according to an embodiment of the present invention. FIG. 8 is an embodiment of the present invention. It is a diagram visually showing an example of calculating personal information clustering and personal information exposure risk according to the example.

먼저 도 4를 참조하면, 본 발명의 실시예에 따른 시스템은 사용자 단말(200)과 서버(100)의 형태로 구현될 수 있다.Referring first to FIG. 4 , a system according to an embodiment of the present invention may be implemented in the form of a user terminal 200 and a server 100 .

사용자 단말(200)은 비정형 데이터를 제공하고, 서버(100)로부터 개인정보 노출 위험도를 수신하여 사용자에게 표시할 수 있다. 여기서 비정형 데이터란, 대화형 텍스트 데이터를 의미한다. 대화형 텍스트 데이터의 예로는 메신저나 문자 메시지의 텍스트 데이터, 소셜 미디어나 인터넷 까페의 게시글이나 댓글 등이 될 수 있다.The user terminal 200 may provide unstructured data, receive personal information exposure risk from the server 100, and display the information to the user. Here, unstructured data means interactive text data. Examples of interactive text data may include text data of messengers or text messages, posts or comments on social media or internet cafes.

모바일 메신저 및 소셜 미디어, 인터넷 카페 게시글이나 댓글에서는 개인정보를 "공1공-1234-오6칠8", "o1o-①234-5육78" 등과 같이 숫자 대신 한글, 영어, 특수문자 등으로 대체하는 것과 같이 다양한 형태로 변형하는 경우가 다수 포함되어 있으므로, 인공지능 모델러(110)는 이와 같은 변형 데이터도 개인정보 탐지 모델(120)에 학습데이터로 포함하여 제공할 수 있다.In mobile messengers, social media, and internet cafe posts or comments, personal information is replaced with Korean, English, special characters, etc. Since there are many cases of transformation in various forms, the artificial intelligence modeler 110 may include and provide such transformation data to the personal information detection model 120 as learning data.

일 실시예에 따르면, 비정형 데이터는 사진이나 이미지를 포함할 수 있다. 메신저나 문자 메시지, 소셜 미디어, 인터넷 까페의 게시글이나 댓글에서 첨부된 사진이나 이미지를 저장하고 해당 사진이나 이미지에 대하여 OCR(Optical Character Recognition)을 통해 주민등록증과 같은 신분증이나 성적증명서, 졸업증명서 등의 개인 신상과 관련된 서류에서 인식되는 텍스트를 기반으로 개인정보를 탐지할 수도 있다.According to one embodiment, unstructured data may include photos or images. A photo or image attached from a message, text message, social media, or Internet cafe post or comment is saved, and OCR (Optical Character Recognition) of the photo or image is used to obtain personal identification such as a resident registration card, transcript, or graduation certificate. Personal information can be detected based on text recognized in documents related to personal information.

사용자 단말(200)은 스마트폰(smartphone), 스마트 패드(smartpad), 타블렛 PC(Tablet PC), PCS(Personal Communication System), GSM(Global System for Mobile communications), PDC(Personal Digital Cellular), PHS(Personal Handyphone System), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division Multiple Access), Wibro(Wireless Broadband Internet) 단말 등 사용자가 휴대 가능한 무선 통신 장치 및 거치형 PC, 노트북과 같은 휴대용 컴퓨팅 장치를 포괄한다.The user terminal 200 is a smartphone (smartphone), smart pad (smartpad), tablet PC (Tablet PC), PCS (Personal Communication System), GSM (Global System for Mobile communications), PDC (Personal Digital Cellular), PHS ( Personal Handyphone System), PDA (Personal Digital Assistant), IMT (International Mobile Telecommunication)-2000, CDMA (Code Division Multiple Access)-2000, W-CDMA (W-Code Division Multiple Access), Wibro (Wireless Broadband Internet) terminal It covers portable computing devices such as portable wireless communication devices and stationary PCs and notebooks.

서버(100)는 사용자 단말(200)과 통신하여 정보나 데이터를 서로 주고 받으며 개인정보 탐지, 노출 위험도 측정 및 노출 위험도 제공을 담당한다. 이러한 서버(100)는, 하드웨어적으로 통상적인 웹 서버(100)와 동일한 구성을 가지며, 소프트웨어적으로는 C, C++, Java, Visual Basic, Visual C 등과 같은 다양한 형태의 언어를 통해 구현되어 여러 가지 기능을 하는 프로그램 모듈을 포함할 수 있다. 또한, 일반적인 서버(100)용 하드웨어에 도스(dos), 윈도우 (window), 리눅스(linux), 유닉스(unix), 매킨토시(macintosh), 안드로이드(Android), 아이오에스(iOS) 등의 운영 체제에 따라 다양하게 제공되고 있는 웹 서버(100) 프로그램을 이용하여 구현될 수 있다.The server 100 communicates with the user terminal 200 to exchange information or data, and is responsible for detecting personal information, measuring exposure risk, and providing exposure risk. This server 100 has the same configuration as the conventional web server 100 in terms of hardware, and in terms of software, it is implemented through various types of languages such as C, C++, Java, Visual Basic, Visual C, etc. It may include program modules that perform functions. In addition, in the hardware for the general server 100, operating systems such as DOS, Windows, Linux, Unix, Macintosh, Android, and iOS It can be implemented using various web server 100 programs provided according to the present invention.

본 실시예에 따른 사용자 단말(200) 및 서버(100)를 연결하는 네트워크 통신의 일 예로는, 이동 통신을 위한 기술표준들 또는 통신방식(예를 들어, GSM(Global System for Mobile communication), CDMA(Code Division Multi Access), CDMA2000(Code Division Multi Access 2000), EV-DO(Enhanced Voice-Data Optimized or Enhanced Voice-Data Only), WCDMA(Wideband CDMA), HSDPA(High Speed Downlink Packet Access), HSUPA(High Speed Uplink Packet Access), LTE(Long Term Evolution), LTE-A(Long Term Evolution-Advanced), 5G 등)에 따라 구축된 이동 통신망을 포함할 수 있으나, 특별히 한정하는 것은 아니다. 또한, 유선 통신망의 일 예로는, LAN(Local Area Network), WAN(Wide Area Network)등의 폐쇄형 네트워크일 수 있으며, 인터넷과 같은 개방형 네트워크인 것이 바람직하다. 인터넷은 TCP/IP 프로토콜 및 그 상위계층에 존재하는 여러 서비스, 즉 HTTP(HyperText Transfer Protocol), Telnet, FTP(File Transfer Protocol), DNS(Domain Name System), SMTP(Simple Mail Transfer Protocol), SNMP(Simple Network Management Protocol), NFS(Network File Service), NIS(Network Information Service)를 제공하는 전세계적인 개방형 컴퓨터 네트워크 구조를 의미한다.An example of network communication connecting the user terminal 200 and the server 100 according to this embodiment is technical standards or communication methods for mobile communication (eg, Global System for Mobile communication (GSM), CDMA (Code Division Multi Access), CDMA2000 (Code Division Multi Access 2000), EV-DO (Enhanced Voice-Data Optimized or Enhanced Voice-Data Only), WCDMA (Wideband CDMA), HSDPA (High Speed Downlink Packet Access), HSUPA ( High Speed Uplink Packet Access), Long Term Evolution (LTE), Long Term Evolution-Advanced (LTE-A), 5G, etc.), but is not particularly limited. In addition, an example of the wired communication network may be a closed network such as a local area network (LAN) and a wide area network (WAN), preferably an open network such as the Internet. The Internet is based on the TCP/IP protocol and several services that exist on its upper layer, such as HTTP (HyperText Transfer Protocol), Telnet, FTP (File Transfer Protocol), DNS (Domain Name System), SMTP (Simple Mail Transfer Protocol), SNMP ( Simple Network Management Protocol), Network File Service (NFS), and Network Information Service (NIS).

전술한 구조에 따라, 본 발명의 실시예에 따른 시스템은, 사용자 단말(200)로부터 비정형 데이터를 제공받아 개인정보를 탐지하고 노출 위험도를 측정하여 사용자 단말(200)로 노출 위험도를 수치화하여 제공할 수 있다.According to the above structure, the system according to an embodiment of the present invention receives unstructured data from the user terminal 200, detects personal information, measures the risk of exposure, quantifies the risk of exposure to the user terminal 200, and provides it. can

도 5를 참조하면, 본 발명의 실시예에 따른 서버(100)는 인공지능 모델러(110), 개인정보 탐지 모델(120) 및 리스크 분석부(130)를 포함할 수 있다.Referring to FIG. 5 , the server 100 according to an embodiment of the present invention may include an artificial intelligence modeler 110, a personal information detection model 120, and a risk analysis unit 130.

인공지능 모델러(110)는 기설정된 개인정보 탐지 알고리즘에 학습데이터를 제공하여 개인정보 탐지 모델(120)을 구축한다.The artificial intelligence modeler 110 builds the personal information detection model 120 by providing learning data to a preset personal information detection algorithm.

보다 구체적으로 인공지능 모델러(110)가 개인정보 탐지 모델(120)을 구축하는 과정을 설명하면, 인공지능 모델러(110)는 먼저 개인정보가 포함된 텍스트 데이터를 수집하여 전처리를 수행한다. 이를 위해 인공지능 모델러(110)는 텍스트 마이닝을 통해 텍스트 데이터를 수집할 수 있다.In more detail, the artificial intelligence modeler 110 will explain the process of building the personal information detection model 120. The artificial intelligence modeler 110 first collects text data including personal information and performs preprocessing. To this end, the artificial intelligence modeler 110 may collect text data through text mining.

이 과정에서, 텍스트 데이터가 수집되면 인공지능 모델러(110)는 데이터 정규화, 데이터 분리, 형태소 분석(토큰화), 개체명 인식, 원형 복원, 불용어 제거, 단어 빈도 분석 등의 데이터 전처리를 수행한다.In this process, when text data is collected, the artificial intelligence modeler 110 performs data preprocessing such as data normalization, data separation, morphological analysis (tokenization), object name recognition, original restoration, stopword removal, and word frequency analysis.

데이터 정규화는 표현 방법이 다른 단어들을 통합하는 것으로, 예를 들어 같은 표현이지만 오타가 섞여 있거나 하는 경우, 이를 동일한 단어로 처리하는 것이다. '반값습니다'와 '반갑습니다'는 데이터 정규화를 통해 동일한 단어로 처리될 수 있다. Data normalization is to integrate words with different expression methods, for example, to treat as the same word when errors are mixed in the same expression. 'Half price' and 'Nice to meet you' can be treated as the same word through data normalization.

데이터 분리는 데이터를 특성에 따라 분리할 필요가 있을 경우에 진행할 수 있다.Data separation may be performed when it is necessary to separate data according to characteristics.

형태소 분석(토큰화)는 일정한 의미가 있는 가장 작은 말의 단위로 변환하는 것으로, 품사 태깅이라고도 한다. 보통 단어를 기준으로 하거나, 문자 또는 구, 문장, 단락 등을 기준으로 토큰화 할 수 있다.Morphological analysis (tokenization) converts words into the smallest unit of speech that has a certain meaning, and is also called part-of-speech tagging. You can tokenize, usually by word, or by letter or phrase, sentence, paragraph, etc.

다음으로 개체명 인식은 특정 단어를 이름을 가진 개체로 인식하는 것을 말한다. 인공지능 모델러(110)는 수집된 텍스트 데이터로부터 개인정보 개체명 태깅을 수행한다. 개인정보 개체명 태깅은 수집된 텍스트 데이터 중에서 개인정보를 나타낼 수 있는 데이터에 대하여 태깅을 하는 것으로, 개인정보 개체명으로는 도 6에서 볼 수 있는 바와 같이 이름, 별명/애칭, 생년월일, 나이, 성별, 키, 몸무게, 혈액형, 종교, 국적, 동아리/동호회, 주소, 건물이름, 의료보험번호, 주민등록번호, 외국인번호, 여권번호, 운전면허번호, 핸드폰번호, 일반전화/FAX번호, 카드번호, 계좌번호, 이메일주소, 차량번호, 직장명, 부서명, 직책/직급, 학교명, 학년, 전공, ID, URL, IP정보, 군번, 근무부대 등을 포함할 수 있고, 이에 대해 각각 개체명 태그가 영어 알파벳으로 된 축약어와 언더바 등으로 이루어진 것을 볼 수 있다.Next, entity name recognition refers to recognizing a specific word as an entity having a name. The artificial intelligence modeler 110 performs personal information object name tagging from the collected text data. Personal information object name tagging is to tag data that can represent personal information among the collected text data. As shown in FIG. , height, weight, blood type, religion, nationality, club/club, address, building name, medical insurance number, resident registration number, foreigner number, passport number, driver's license number, cell phone number, landline/fax number, card number, account number , e-mail address, vehicle number, workplace name, department name, title/position, school name, grade, major, ID, URL, IP information, service number, service unit, etc., and each object name tag is written in English alphabet You can see that it consists of abbreviations and underscores.

인공지능 모델러(110)는 트랜스포머(Transformer) 기반의 개체명 인식기를 내재하여 개체명 인식을 수행할 수 있다.The artificial intelligence modeler 110 may perform entity name recognition by incorporating a transformer-based entity name recognizer.

원형 복원은 형용사나 동사의 표현형을 원형으로 바꾸거나, 어간이나 어미를 표현형으로 변환하는 것으로, 예를 들어 '쉬고 싶다', '쉬고 싶은'은 '쉬다'로, '보다', '보니', '보고'는 '보-'로 바꿀 수 있다.Restoration of the original form is to change the expression of an adjective or verb to its original form, or to convert the stem or ending into an expression, for example, 'I want to rest' and 'I want to rest' become 'rest', 'Report' can be replaced with 'Bo-'.

불용어 제거는 은, 는, 이, 가와 같은 조사나 접미사를 제거하는 과정이다.Stopword elimination is the process of removing postpositions or suffixes such as silver, silver, i, and ga.

단어 빈도 분석은 불용어 및 빈출어의 제거 여부를 결정하거나 필요한 단어들이 올바르게 추출되었는지를 확인한다.Word frequency analysis determines whether to remove stopwords and frequent words or checks whether required words are correctly extracted.

다음으로, 기설정된 개인정보 탐지 알고리즘을 설정하여 개인정보 탐지 모델(120)을 초기 세팅한다.Next, the personal information detection model 120 is initially set by setting a predetermined personal information detection algorithm.

여기서 기설정된 개인정보 탐지 알고리즘은 개인정보 탐지 모델(120)에 적용될 수 있는 인공 신경망 학습 모델로, 순환 신경망(RNN, Recurrent Neural Network) 모델, 합성곱 신경망(CNN, Convolution Neural Network), BERT(Bidirectional Encoder Representations from Transformers) 중 어느 하나이거나 하나 이상이 조합된 것일 수 있다.Here, the preset personal information detection algorithm is an artificial neural network learning model that can be applied to the personal information detection model 120, a Recurrent Neural Network (RNN) model, a Convolution Neural Network (CNN), a Bidirectional BERT (BERT) Encoder Representations from Transformers), or a combination of more than one.

이후, 인공지능 모델러(110)는 학습데이터를 개인정보 탐지 모델(120)에 제공하여 학습을 수행하도록 한다.Then, the artificial intelligence modeler 110 provides learning data to the personal information detection model 120 to perform learning.

그 다음, 인공지능 모델러(110)는 학습을 수행한 개인정보 탐지 모델(120)에 대하여 테스트를 진행한다.Next, the artificial intelligence modeler 110 performs a test on the personal information detection model 120 that has been trained.

테스트 수행 결과가 기설정된 정확도에 미치지 못하는 경우, 인공지능 모델러(110)는 개인정보 탐지 모델(120)의 복수의 파라미터 중 적어도 어느 하나를 튜닝할 수 있다.When the test execution result does not reach the preset accuracy, the artificial intelligence modeler 110 may tune at least one of a plurality of parameters of the personal information detection model 120 .

개인정보 탐지 모델(120)은 다시 학습데이터를 통해 학습을 수행하게 된다.The personal information detection model 120 again performs learning through learning data.

그 이후, 인공지능 모델러(110)는 복수의 파라미터 중 적어도 어느 하나를 튜닝한 개인정보 탐지 모델(120)을 다시 테스트할 수 있다.After that, the artificial intelligence modeler 110 may retest the personal information detection model 120 tuned to at least one of a plurality of parameters.

이때, 테스트 수행 결과가 기설정된 정확도를 넘어서는 경우 인공지능 모델러(110)는 검증데이터를 개인정보 탐지 모델(120)에 제공하여 탐지된 개인정보의 정확도를 검증한다.At this time, if the test execution result exceeds the preset accuracy, the artificial intelligence modeler 110 provides verification data to the personal information detection model 120 to verify the accuracy of the detected personal information.

여기서 검증 결과가 기설정된 정확도에 미치지 못하는 경우, 인공지능 모델러(110)는 검증 결과를 분석하여 다시 개인정보 탐지 모델(120)의 복수의 파라미터 중 적어도 어느 하나를 튜닝할 수 있다.Here, when the verification result does not reach the preset accuracy, the artificial intelligence modeler 110 may analyze the verification result and tune at least one of the plurality of parameters of the personal information detection model 120 again.

검증 결과가 기설정된 정확도를 넘어서는 경우 개인정보 탐지 모델(120)은 실질적인 개인정보 탐지를 위한 준비가 완료된 것으로 판단될 수 있다.When the verification result exceeds the preset accuracy, it may be determined that the personal information detection model 120 is ready for actual personal information detection.

개인정보 탐지 모델(120)은 비정형 데이터를 입력받아 개인정보를 탐지한다. 이때, 개인정보 탐지 모델(120)은 개체명 인식 기술을 활용할 수 있다.The personal information detection model 120 receives unstructured data and detects personal information. At this time, the personal information detection model 120 may utilize object name recognition technology.

리스크 분석부(130)는 탐지된 개인정보와 그에 대응하는 대상정보를 기반으로 개인정보의 대상별로 탐지된 개인정보를 군집화하고, 각 군집별로 개인정보 노출 위험도를 수치화할 수 있다.The risk analyzer 130 may cluster the detected personal information by object of personal information based on the detected personal information and target information corresponding thereto, and quantify the risk of personal information exposure for each cluster.

본 발명의 실시예에 따른 비정형 데이터로부터 개인정보를 탐지하고 노출 위험도를 측정하는 방법을 수행하는 방법에 대하여 도 1을 참조하면, 인공지능 모델러(110)에 의해, 기설정된 개인정보 탐지 알고리즘에 학습데이터를 제공하여 개인정보 탐지 모델(120)을 구축하는 단계(S110), 상기 개인정보 탐지 모델(120)에 의해, 비정형 데이터를 입력받아 개인정보를 탐지하는 단계(S120), 리스크 분석부(130)에 의해, 상기 탐지된 개인정보와 그에 대응하는 대상정보를 기반으로 개인정보의 대상별로 탐지된 개인정보를 군집화하는 단계(S130) 및 상기 리스크 분석부(130)에 의해, 각 군집별로 개인정보 노출 위험도를 수치화하는 단계(S140)를 포함할 수 있다.Referring to FIG. 1 for a method of detecting personal information from unstructured data and measuring exposure risk according to an embodiment of the present invention, the artificial intelligence modeler 110 learns a preset personal information detection algorithm. Building a personal information detection model 120 by providing data (S110), receiving unstructured data and detecting personal information by the personal information detection model 120 (S120), risk analysis unit 130 ), clustering the detected personal information for each object of personal information based on the detected personal information and corresponding target information (S130) and the risk analysis unit 130, personal information for each cluster A step of quantifying the risk of exposure (S140) may be included.

인공지능 모델러(110)에 의해, 기설정된 개인정보 탐지 알고리즘에 학습데이터를 제공하여 개인정보 탐지 모델(120)을 구축하는 단계(S110)에서, 인공지능 모델러(110)는 먼저 개인정보가 포함된 텍스트 데이터를 수집하여 전처리를 수행한다. 이를 위해 인공지능 모델러(110)는 텍스트 마이닝을 통해 텍스트 데이터를 수집할 수 있다. 이 과정에서, 텍스트 데이터가 수집되면 인공지능 모델러(110)는 데이터 정규화, 데이터 분리, 형태소 분석(토큰화), 개체명 인식, 원형 복원, 불용어 제거, 단어 빈도 분석 등의 데이터 전처리를 수행한다. 데이터 전처리에 대해서는 도 2에 대한 설명에서 후술하기로 한다.In the step of constructing the personal information detection model 120 by providing learning data to a preset personal information detection algorithm by the artificial intelligence modeler 110 (S110), the artificial intelligence modeler 110 first includes personal information The text data is collected and pre-processed. To this end, the artificial intelligence modeler 110 may collect text data through text mining. In this process, when text data is collected, the artificial intelligence modeler 110 performs data preprocessing such as data normalization, data separation, morphological analysis (tokenization), object name recognition, original restoration, stopword removal, and word frequency analysis. Data pre-processing will be described later in the description of FIG. 2 .

상기 개인정보 탐지 모델(120)에 의해, 비정형 데이터를 입력받아 개인정보를 탐지하는 단계(S120)에서, 개인정보 탐지 모델(120)은 입력된 비정형 데이터로부터 개체명 인식을 통해 개인정보를 탐지할 수 있다.In the step of receiving unstructured data and detecting personal information by the personal information detection model 120 (S120), the personal information detection model 120 detects personal information from the input unstructured data through entity name recognition. can

리스크 분석부(130)에 의해, 상기 탐지된 개인정보와 그에 대응하는 대상정보를 기반으로 개인정보의 대상별로 탐지된 개인정보를 군집화하는 단계(S130)와 상기 리스크 분석부(130)에 의해, 각 군집별로 개인정보 노출 위험도를 수치화하는 단계(S140)에 대한 구체적인 설명은 도 3 및 도 8에서 후술하기로 한다.By the risk analysis unit 130, clustering the detected personal information for each subject of personal information based on the detected personal information and corresponding target information (S130) and the risk analysis unit 130, A detailed description of the step of digitizing the risk of personal information exposure for each cluster (S140) will be described later with reference to FIGS. 3 and 8.

도 2를 참조하여 인공지능 모델러(110)에 의해 개인정보 탐지 모델(120)을 구축하는 단계(S110)를 보다 자세히 살펴보면, 개인정보가 포함된 텍스트 데이터를 수집하여 전처리를 수행하는 학습데이터 구축 단계(S101), 상기 기설정된 개인정보 탐지 알고리즘을 설정하여 상기 개인정보 탐지 모델(120)을 초기 세팅하는 단계(S102), 상기 학습데이터를 상기 개인정보 탐지 모델(120)에 제공하여 학습을 수행하도록 하는 단계(S103), 학습을 수행한 상기 개인정보 탐지 모델(120)을 테스트하는 모델 테스트 단계(S104), 상기 모델 테스트 단계(S104)에서 테스트 결과를 분석하여 상기 개인정보 탐지 모델(120)의 복수의 파라미터 중 적어도 어느 하나를 튜닝하는 모델 튜닝 단계(S105) 및 튜닝된 상기 개인정보 탐지 모델(120)에 검증데이터를 제공하여 탐지된 개인정보의 정확도를 평가하는 모델 검증 단계(S106)를 포함할 수 있다.Looking at the step (S110) of building the personal information detection model 120 by the artificial intelligence modeler 110 with reference to FIG. 2 in more detail, the learning data construction step of collecting text data containing personal information and performing preprocessing. (S101), initially setting the personal information detection model 120 by setting the preset personal information detection algorithm (S102), providing the learning data to the personal information detection model 120 to perform learning (S103), a model test step (S104) of testing the personal information detection model 120 that has been trained, and a test result analyzed in the model test step (S104) to determine the personal information detection model 120. A model tuning step of tuning at least one of a plurality of parameters (S105) and a model verification step of providing verification data to the tuned personal information detection model 120 to evaluate the accuracy of the detected personal information (S106) can do.

개인정보가 포함된 텍스트 데이터를 수집하여 전처리를 수행하는 학습데이터 구축 단계(S110)에서, 인공지능 모델러(110)는 먼저 개인정보가 포함된 텍스트 데이터를 수집하여 전처리를 수행한다. 이를 위해 인공지능 모델러(110)는 텍스트 마이닝을 통해 텍스트 데이터를 수집할 수 있다.In the learning data construction step (S110) of collecting text data including personal information and performing pre-processing, the artificial intelligence modeler 110 first collects text data including personal information and performs pre-processing. To this end, the artificial intelligence modeler 110 may collect text data through text mining.

상기 기설정된 개인정보 탐지 알고리즘을 설정하여 상기 개인정보 탐지 모델(120)을 초기 세팅하는 단계(S102)에서, 기설정된 개인정보 탐지 알고리즘은 개인정보 탐지 모델(120)에 적용될 수 있는 인공 신경망 학습 모델로, 순환 신경망(RNN, Recurrent Neural Network) 모델, 합성곱 신경망(CNN, Convolution Neural Network), BERT(Bidirectional Encoder Representations from Transformers) 중 어느 하나이거나 하나 이상이 조합된 것일 수 있다.In the step of initially setting the personal information detection model 120 by setting the preset personal information detection algorithm (S102), the preset personal information detection algorithm is an artificial neural network learning model that can be applied to the personal information detection model 120. , which may be any one of a Recurrent Neural Network (RNN) model, a Convolution Neural Network (CNN), and Bidirectional Encoder Representations from Transformers (BERT), or a combination of one or more of them.

다음으로, 상기 학습데이터를 상기 개인정보 탐지 모델(120)에 제공하여 학습을 수행하도록 하는 단계(S103)가 수행된다.Next, a step (S103) of providing the learning data to the personal information detection model 120 to perform learning is performed.

학습을 수행한 상기 개인정보 탐지 모델(120)을 테스트하는 모델 테스트 단계(S104)에서, 인공지능 모델러(110)는 학습을 수행한 개인정보 탐지 모델(120)에 대하여 테스트를 진행한다. 여기서 테스트 수행 결과가 기설정된 정확도에 미치지 못하는 경우, 인공지능 모델러(110)는 개인정보 탐지 모델(120)의 복수의 파라미터 중 적어도 어느 하나를 튜닝하는 모델 튜닝 단계(S105)로 넘어갈 수 있다. 그리고 나서 개인정보 탐지 모델(120)은 다시 학습데이터를 통해 학습을 수행하게 된다.In the model testing step (S104) of testing the learned personal information detection model 120, the artificial intelligence modeler 110 tests the learned personal information detection model 120. Here, if the test execution result does not reach the preset accuracy, the artificial intelligence modeler 110 may proceed to a model tuning step ( S105 ) of tuning at least one of a plurality of parameters of the personal information detection model 120 . Then, the personal information detection model 120 performs learning through learning data again.

이후, 다시 모델 테스트 단계(S104)로 돌아가서 복수의 파라미터 중 적어도 어느 하나를 튜닝한 개인정보 탐지 모델(120)을 다시 테스트할 수 있다.Thereafter, returning to the model testing step ( S104 ), the personal information detection model 120 tuned to at least one of a plurality of parameters may be tested again.

이때, 테스트 수행 결과가 기설정된 정확도를 넘어서는 경우, 다음 단계로 넘어가게 된다.At this time, when the test execution result exceeds the preset accuracy, the next step is performed.

튜닝된 상기 개인정보 탐지 모델(120)에 검증데이터를 제공하여 탐지된 개인정보의 정확도를 평가하는 모델 검증 단계(S106)에서, 검증 결과가 기설정된 정확도에 미치지 못하는 경우, 인공지능 모델러(110)는 검증 결과를 분석하여 다시 개인정보 탐지 모델(120)의 복수의 파라미터 중 적어도 어느 하나를 튜닝할 수 있다. In the model verification step (S106) of providing verification data to the tuned personal information detection model 120 to evaluate the accuracy of the detected personal information, if the verification result does not reach the preset accuracy, the artificial intelligence modeler 110 may tune at least one of a plurality of parameters of the personal information detection model 120 again by analyzing the verification result.

이후, 개인정보 탐지 모델(120)은 다시 학습데이터를 통해 학습을 수행하게 된다.Thereafter, the personal information detection model 120 performs learning through learning data again.

도 3을 참조하여 상기 탐지된 개인정보와 그에 대응하는 대상 정보를 기반으로 개인정보의 대상별로 탐지된 개인정보를 군집화하는 단계(S130)를 살펴보면, 개인식별 위험도를 기반으로 상기 탐지된 개인정보를 분류하는 단계 및 개인정보 클래스 간 속성 관계를 정의하는 단계를 포함할 수 있다.Referring to FIG. 3, looking at the step (S130) of clustering the detected personal information for each object of personal information based on the detected personal information and the corresponding target information, the detected personal information is classified based on the risk of personal identification. It may include a step of classifying and a step of defining attribute relationships between personal information classes.

개인식별 위험도를 기반으로 상기 탐지된 개인정보를 분류하는 단계(S131)에서, 개인식별 위험도는 각 개인정보 개체마다 미리 그 위험도에 해당하는 점수가 설정되어 있을 수 있다.In the step of classifying the detected personal information based on the personal identification risk (S131), a score corresponding to the risk may be set in advance for each personal information object.

개인정보 클래스 간 속성 관계를 정의하는 단계(S132)에서, 개인정보 클래스 간 속성 관계는 인과관계, 포함관계, 직/간접관계 중 적어도 하나를 포함할 수 있다.In the step of defining an attribute relationship between personal information classes (S132), the attribute relationship between personal information classes may include at least one of a causal relationship, an inclusion relationship, and a direct/indirect relationship.

상기 리스크 분석부에 의해, 각 군집별로 개인정보 노출 위험도를 수치화하는 단계(S140)에서, 도 8에서 볼 수 있는 바와 같이 탐지된 개인정보의 유형별로 개인정보 노출 위험도의 점수가 미리 설정되어 있음을 알 수 있다. 도 8과 같이 각 군집별로 탐지된 개인정보 중 가장 높은 점수를 해당 군집의 개인정보 노출 위험도 점수로 설정할 수 있다. 그러나 이에 제한되지는 않으며, 각 군집별 탐지된 개인정보의 개수와, 탐지된 개인정보의 조합으로 어떤 사람에 대한 특정 가능성을 기반으로 가중치를 부여하여 점수를 산출할 수도 있고, 점수 구간 별로 등급화할 수도 있다.In the step of digitizing the risk of personal information exposure for each cluster by the risk analyzer (S140), as shown in FIG. Able to know. As shown in FIG. 8 , the highest score among personal information detected for each cluster may be set as the personal information exposure risk score of the corresponding cluster. However, it is not limited thereto, and the number of personal information detected for each cluster and the combination of the detected personal information may be weighted based on a specific probability for a person to calculate a score, and a score may be ranked for each score interval. may be

도 6을 참조하면, 학습 데이터 구축 단계(S101)에서 상기 수집된 텍스트 데이터로부터 개인정보 개체명 태깅을 수행하는 예를 볼 수 있다.Referring to FIG. 6 , in the step of building learning data (S101), an example of tagging personal information object names from the collected text data can be seen.

개인정보 개체명으로는 이름, 별명/애칭, 생년월일, 나이, 성별, 키, 몸무게, 혈액형, 종교, 국적, 동아리/동호회, 주소, 건물이름, 의료보험번호, 주민등록번호, 외국인번호, 여권번호, 운전면허번호, 핸드폰번호, 일반전화/FAX번호, 카드번호, 계좌번호, 이메일주소, 차량번호, 직장명, 부서명, 직책/직급, 학교명, 학년, 전공, ID, URL, IP정보, 군번, 근무부대 등을 포함할 수 있고, 이에 대해 각각 개체명 태그가 영어 알파벳으로 된 축약어와 언더바 등으로 이루어진 것을 볼 수 있다.Name of personal information entity includes name, nickname/nickname, date of birth, age, gender, height, weight, blood type, religion, nationality, club/club, address, building name, medical insurance number, resident registration number, alien number, passport number, driving License number, mobile phone number, landline/fax number, card number, account number, e-mail address, vehicle number, company name, department name, position/position, school name, grade, major, ID, URL, IP information, military service number, service unit etc., and it can be seen that each entity name tag consists of an abbreviation in English alphabet and an underscore.

도 7을 참조하면, 화행 정보를 정의한 예시를 볼 수 있다.Referring to FIG. 7 , an example of defining dialog act information can be seen.

여기서 화행이란 언어행위, 즉 언어를 통해 이루어지는 행위로, 1960년대에 창시된 언어학 이론으로부터 파생된 개념이다. 화행은 집단의 한 구성원이 방해받지 않고 생산한 하나의 발언이 특정 기능 또는 행위로 다른 구성원들에게 받아들여지는 집단 상호작용 과정으로 정의될 수 있다. 또는 소그룹 담화나 대화에서 각 대화의 턴(turn)을 각각의 언어행위, 즉 화행이라 할 수 있다.Here, speech act is a linguistic act, that is, an act performed through language, and is a concept derived from linguistic theory founded in the 1960s. A speech act can be defined as a group interaction process in which a single utterance produced unhindered by one member of a group is accepted by other members as a specific function or action. Alternatively, each turn of conversation in a small group discourse or conversation can be referred to as each speech act, that is, a dialogue act.

화행 정보는 Question, Inform, Answer, Request, Offer, Other, None을 포함할 수 있다. Question은 화자가 청자로부터 특정 정보의 획득을 원하는 경우, Inform은 화자가 청자에게 semantic content를 구성하는 정보를 전달하고자 하는 경우(정보는 정확하다고 가정), 화자가 청자에게 정보를 전달하는 경우, Answer는 화자가 청자에게 semantic content를 구성하는 정보를 전달하고자 하는 경우(정보는 정확하다고 가정) 즉, 직전 발화의 화행이 Question인 경우, Request는 화자가 청자로 하여금 부탁한 행동을 하길 원하는 경우, 지시 행위(명령 화행), Offer는 화자가 청자가 수행하기를 원하는 방식으로 그렇게 하기를 원하는 경우(화자가 선택지를 제시함), 언약 행위(약속 화행), Other는 전술한 화행에 속하지 않는 기타 화행, None은 화행이 드러나지 않는 발화로 정의할 수 있다.Dialog act information may include Question, Inform, Answer, Request, Offer, Other, and None. Question is when the speaker wants to acquire specific information from the listener, Inform is when the speaker wants to deliver information constituting semantic content to the listener (assuming that the information is accurate), when the speaker delivers information to the listener, and Answer is when the speaker wants to convey information constituting semantic content to the listener (assuming that the information is accurate), that is, when the dialogue act of the previous utterance is Question, Request indicates when the speaker wants the listener to do the requested action Action (command speech act), Offer is when the speaker wants the listener to do it the way he wants it to be done (speaker presents an option), covenant act (promise speech act), Other is any other speech act that does not belong to the preceding speech acts, None can be defined as an utterance in which the dialogue act is not revealed.

따라서, 인공지능 모델러(110)는 화행 정보에서 Inform, Answer, Offer의 경우, 해당 발화에 개인정보가 포함될 가능성이 높은 것으로 판단하도록 개인정보 탐지 모델(120)을 설정할 수 있다. 또한, 화행 정보가 Other이거나, None인 경우, 개인정보가 포함될 가능성이 낮은 것으로 판단하도록 개인정보 탐지 모델(120)을 설정할 수 있다.Accordingly, the artificial intelligence modeler 110 may set the personal information detection model 120 to determine that the personal information is highly likely to be included in the corresponding utterance in the case of Inform, Answer, and Offer in dialogue act information. In addition, when the dialog act information is Other or None, the personal information detection model 120 may be set to determine that the possibility of including personal information is low.

도 8을 참조하면, 두 사람 간의 채팅(대화) 데이터로부터 탐지된 개인정보와 이를 대상별로 분류하고 군집화하여 수치화하는 예를 볼 수 있다.Referring to FIG. 8 , an example of personal information detected from chatting (conversation) data between two people, classified by subject, clustered, and digitized can be seen.

먼저, 대화 내용으로부터 소아과 의사, 피부과 전문의, 나주, 국립병원, 36살, 치과의사, 교회, 오빠라는 개인정보가 식별될 수 있다. 해당 개인정보 각각에 대하여 그 대상을 고려하여 군집화를 수행하면 군집 1은 소아과 의사, 군집 2는 피부과 전문의, 나주, 국립병원, 36살, 군집 3은 치과의사, 교회, 오빠로 3개의 군집이 형성될 수 있다.First, personal information such as pediatrician, dermatologist, Naju, national hospital, 36 years old, dentist, church, and older brother can be identified from the contents of the conversation. If clustering is performed considering the target for each of the personal information, cluster 1 is a pediatrician, cluster 2 is a dermatologist, Naju, national hospital, 36 years old, and cluster 3 is a dentist, church, and older brother. can be formed

군집 1을 살펴보면, 소아과 의사(직업)에 대해서는 0.2의 개인정보 노출 위험도(개인정보 재식별 위험도)를 나타내는 것을 볼 수 있다.Looking at cluster 1, it can be seen that the pediatrician (occupation) has a personal information exposure risk (personal information re-identification risk) of 0.2.

군집 2를 살펴보면 피부과 전문의(직업), 나주 국립병원(직장), 36살(나이) 각각에 대해 0.2, 0.6, 0.8로 0.8의 개인정보 노출 위험도(개인정보 재식별 위험도)를 나타내는 것을 볼 수 있다.Looking at cluster 2, it can be seen that the dermatologist (occupation), Naju National Hospital (workplace), and 36 years old (age) show a personal information exposure risk (personal information re-identification risk) of 0.8 with 0.2, 0.6, and 0.8, respectively. there is.

군집 3을 살펴보면 치과의사(직업), 교회(종교), 오빠(성별) 각각에 대해 02., 0.3, 0.4로 0.4의 개인정보 노출 위험도(개인정보 재식별 위험도)를 나타내는 것을 볼 수 있다.Looking at cluster 3, it can be seen that the personal information exposure risk (personal information re-identification risk) of 0.4 is 02., 0.3, and 0.4 for dentist (occupation), church (religion), and older brother (gender), respectively.

여기서, 탐지된 개인정보의 유형별로 개인정보 노출 위험도의 점수가 미리 설정되어 있음을 알 수 있다. 직업은 0.2, 종교는 0.3, 성별은 0.4, 직장은 0.6, 나이는 0.8로 설정되어 있다.Here, it can be seen that the personal information exposure risk score is preset for each type of detected personal information. Occupation is set to 0.2, religion to 0.3, gender to 0.4, job to 0.6, and age to 0.8.

또한, 도 8에서는 개인정보 노출 위험도(개인정보 재식별 위험도)에 대해, 각 군집별로 탐지된 개인정보 중 가장 높은 점수를 개인정보 노출 위험도(개인정보 재식별 위험도)로 설정한 것을 알 수 있다.In addition, in FIG. 8, it can be seen that the highest score among the detected personal information for each cluster is set as the risk of personal information exposure (risk of personal information re-identification) for the risk of personal information exposure (risk of personal information re-identification).

그러나 이에 제한되지는 않으며, 각 군집별로 노출된 개인정보의 갯수와 노출된 개인정보의 조합으로 어떤 사람을 특정 가능한 가능성 등을 고려하여 가중치를 설정하도록 구성될 수 있다.However, it is not limited thereto, and the number of exposed personal information for each cluster and the combination of the exposed personal information may be configured to set a weight in consideration of a possible possibility of specifying a person.

예를 들어, 군집 2는 직업, 직장, 나이가 노출되어 있으므로 어떤 사람으로 특정될 가능성이 높다고 판단할 수 있으므로 각 점수를 더한 값에 기설정된 가중치, 예를 들어 1.5 등을 곱하여 개인정보 노출 위험도를 산출할 수 있다. 이와 같은 예시에 따르면, 군집 2의 개인정보 노출 위험도는 2.4가 될 수 있다.For example, since cluster 2 is exposed to occupation, workplace, and age, it can be determined that it is highly likely to be identified as a certain person, so the value of adding each score is multiplied by a preset weight, such as 1.5, to calculate the risk of personal information exposure. can be calculated According to this example, the personal information exposure risk of cluster 2 may be 2.4.

이렇게 수치화된 개인정보 노출 위험도는 점수로만 나타내지 않고 점수 구간 별로 등급을 설정하여 등급으로 표시되도록 할 수 있다. 예를 들어 0부터 0.3까지는 노출 위험도 5등급, 0.3부터 0.5까지는 노출 위험도 4등급, 0.6부터 1.0까지는 노출 위험도 3등급, 1.1부터 1.5까지는 노출 위험도 2등급, 1.6 이상은 노출 위험도 1등급으로 설정될 수 있다.The digitized personal information exposure risk may be displayed as a grade by setting a grade for each score section instead of indicating only a score. For example, from 0 to 0.3, exposure risk level 5, from 0.3 to 0.5, exposure risk level 4, from 0.6 to 1.0, exposure risk level 3, from 1.1 to 1.5, exposure risk level 2, and above 1.6, exposure risk level 1. there is.

이상에서는 본 발명에 따른 바람직한 실시예들에 대하여 도시하고 또한 설명하였다. 그러나 본 발명은 상술한 실시예에 한정되지 아니하며, 특허 청구의 범위에서 첨부하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변형 실시가 가능할 것이다.In the above, preferred embodiments according to the present invention have been shown and described. However, the present invention is not limited to the above-described embodiments, and various modifications can be made by anyone having ordinary knowledge in the technical field to which the present invention belongs without departing from the gist of the present invention appended within the scope of the claims. .

100: 서버
110: 인공지능 모델러
120: 개인정보 탐지 모델
130: 리스크 분석부
200: 사용자 단말100: server
110: AI modeler
120: Privacy detection model
130: risk analysis unit
200: user terminal

Claims

In the method of detecting personal information from unstructured data and measuring the risk of exposure,
constructing a personal information detection model by providing learning data to a preset personal information detection algorithm by an artificial intelligence modeler;
detecting personal information by receiving unstructured data using the personal information detection model;
Grouping the detected personal information by object of personal information based on the detected personal information and corresponding target information by a risk analysis unit; and
Including the step of quantifying the risk of personal information exposure for each cluster by the risk analysis unit,
The step of building the personal information detection model by the artificial intelligence modeler,
Learning data construction step of collecting text data including personal information and performing pre-processing;
initial setting of the personal information detection model by setting the predetermined personal information detection algorithm;
providing the learning data to the personal information detection model to perform learning;
a model testing step of testing the learned personal information detection model;
a model tuning step of tuning at least one of a plurality of parameters of the personal information detection model by analyzing a test result in the model testing step; and
A method for detecting personal information from unstructured data and measuring exposure risk, comprising a model verification step of evaluating the accuracy of the detected personal information by providing verification data to the tuned personal information detection model.

delete

According to claim 1,
The learning data construction step performs personal information object name tagging from the collected text data,
The method of detecting personal information from unstructured data and measuring the risk of exposure, characterized in that the text data containing personal information is text data containing personal information or personal information in a modified form.

According to claim 3,
A method for detecting personal information from unstructured data and measuring the risk of exposure, characterized in that when tagging the personal information object name from the collected text data in the learning data construction step, labeling the target information of the personal information together .

According to claim 1,
Detecting and exposing personal information from unstructured data, characterized in that in the step of receiving unstructured data and detecting personal information by the personal information detection model, the possibility of including personal information is determined by analyzing dialogue information from the unstructured data How to measure risk.

According to claim 1,
The step of clustering the detected personal information by object of personal information based on the detected personal information and corresponding target information by the risk analysis unit,
Classifying the detected personal information based on the risk of personal identification; and
Defining attribute relationships between personal information classes;
A method for detecting personal information from unstructured data and measuring the risk of exposure, comprising:

According to claim 6,
The method of detecting personal information from unstructured data and measuring exposure risk, characterized in that the attribute relationship between the personal information classes includes a causal relationship, an inclusion relationship, and a direct or indirect relationship.

According to claim 1,
The method of detecting personal information from unstructured data and measuring the risk of exposure, characterized in that the personal information exposure risk is calculated as a score or grade.

According to claim 1,
The method of detecting personal information from unstructured data and measuring the risk of exposure, characterized in that the unstructured data is interactive text data.