KR102395550B1

KR102395550B1 - Method and apparatus for analyzing confidential information

Info

Publication number: KR102395550B1
Application number: KR1020200126846A
Authority: KR
Inventors: 이삼일
Original assignee: 주식회사 에임시스
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2022-05-09
Also published as: KR20220043445A

Abstract

본 발명은 기밀정보 분석 방법 및 장치에 관한 것이다. 본 발명의 일 실시예에 따른 기밀정보 분석 방법은 컴퓨팅이 가능한 전자 장치에 의해 수행되며, 데이터베이스에 포함된 테이블에서 기밀정보의 저장 여부와 관련된 분석을 위한 방법으로서, 데이터베이스의 메타데이터(metadata)를 기반으로 머신 러닝(machine learning) 기법으로 기 학습된 머신 러닝 모델(machine learning model)을 이용하여, 테이블의 각 컬럼에 대해 기밀정보 저장 여부 및 저장된 기밀정보 종류를 추정하는 단계; 및 추정된 정보를 기반으로 해당 테이블에 대한 기밀정보 관련 지표를 생성하는 단계;를 포함한다.The present invention relates to a method and apparatus for analyzing confidential information. The method for analyzing confidential information according to an embodiment of the present invention is performed by an electronic device capable of computing, and is a method for analyzing whether confidential information is stored in a table included in a database. estimating whether or not confidential information is stored and the type of stored confidential information for each column of the table using a machine learning model previously learned by a machine learning technique based on the method; and generating an index related to confidential information for the corresponding table based on the estimated information.

Description

Method and apparatus for analyzing confidential information

본 발명은 기밀정보 분석 방법 및 장치에 관한 것으로서, 더욱 상세하게는 다양한 정보를 저장한 데이터베이스에 대해 머신 러닝(machine learning) 기법을 이용하여 기밀정보의 저장 여부와 관련된 다양한 분석을 수행하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for analyzing confidential information, and more particularly, a method and apparatus for performing various analyzes related to whether confidential information is stored or not by using a machine learning technique on a database storing various information. is about

데이터베이스(database)는 여러 사람이 공유하여 사용할 목적으로 체계화해 통합 관리하는 데이터의 집합이다. 특히, 데이터베이스에는 개인정보 등과 같이 기밀 유지가 필요한 기밀정보가 저장되어 있기도 하다. 이러한 기밀정보에 대해서는 특별한 관리가 필요하므로, 데이터베이스에서 기밀정보를 검출하는 등의 분석이 필요하다. 하지만, 데이터베이스에는 방대한 양의 데이터가 저장되므로, 그 중에서 어떤 것이 기밀정보에 해당하는지를 분석하기는 쉽지 않다.A database is a set of data that is systematically managed and integrated for the purpose of shared use by several people. In particular, confidential information that requires confidentiality, such as personal information, is stored in the database. Since special management is required for such confidential information, analysis such as detecting confidential information in a database is required. However, since a large amount of data is stored in the database, it is not easy to analyze which of them corresponds to confidential information.

한편, 종래에는 관리자가 직접 검출할 대상을 찾아 개별적으로 기밀정보를 지정하거나, 모든 컬럼에 저장된 실제 데이터를 모두 샘플링하여 패턴 매칭을 수행함으로써, 데이터베이스에서 기밀정보를 검출하였다. 하지만, 이러한 종래 방법의 경우, 대용량 데이터 및 다수의 다른 데이터베이스가 존재로 인해 그 처리에 많은 시간이 소요될 뿐 아니라, 데이터베이스의 테이블 및 컬럼의 변동에 따른 대응 처리도 쉽지 않다.Meanwhile, in the prior art, confidential information was detected in the database by an administrator finding a target to be detected and individually designating confidential information or performing pattern matching by sampling all actual data stored in all columns. However, in the case of this conventional method, it takes a lot of time to process due to the existence of large amounts of data and a large number of other databases, and it is not easy to deal with changes in tables and columns of the database.

또한, 종래 기술의 경우, 기밀정보를 포함한 테이블들에 대한 보안 관리 대상의 우선순위 선정을 위해, 단순히 기밀정보의 데이터 검출 수량 또는 상중하의 기밀정보 위험도를 그 우선순위 선정의 기준으로 정하고 있다. 하지만, 이러한 기준으로는 보안 관리 대상의 우선순위를 정하기에는 그 변별력이 부족할 수밖에 없다.In addition, in the case of the prior art, in order to select the priority of the security management target for tables including confidential information, the data detection quantity of confidential information or the level of confidential information risk of high and medium are simply set as the priority selection criteria. However, these criteria are inevitably insufficient in discriminating power to prioritize security management targets.

KR10- 2007-0039478 AKR10- 2007-0039478 A

상기한 바와 같은 종래 기술의 문제점을 해결하기 위하여, 본 발명은 다양한 정보를 저장한 데이터베이스에 대해 머신 러닝(machine learning) 기법을 이용하여 기밀정보의 저장 여부와 관련된 분석을 수행하기 위한 방법 및 장치를 제공하는데 그 목적이 있다.In order to solve the problems of the prior art as described above, the present invention provides a method and apparatus for performing an analysis related to whether confidential information is stored or not by using a machine learning technique on a database storing various information. Its purpose is to provide

다만, 본 발명이 해결하고자 하는 과제는 이상에서 언급한 과제에 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned can be clearly understood by those of ordinary skill in the art to which the present invention belongs from the description below. There will be.

상기와 같은 과제를 해결하기 위한 본 발명의 일 실시예에 따른 방법은 컴퓨팅이 가능한 전자 장치에 의해 수행되며, 데이터베이스에 포함된 테이블에서 기밀정보의 저장 여부와 관련된 분석을 위한 방법으로서, 데이터베이스의 메타데이터(metadata)를 기반으로 머신 러닝(machine learning) 기법으로 기 학습된 머신 러닝 모델(machine learning model)을 이용하여, 테이블의 각 컬럼에 대해 기밀정보 저장 여부 및 저장된 기밀정보 종류를 추정하는 단계; 및 추정된 정보를 기반으로 해당 테이블에 대한 기밀정보 관련 지표를 생성하는 단계;를 포함한다.A method according to an embodiment of the present invention for solving the above problems is performed by an electronic device capable of computing, and is a method for analysis related to whether confidential information is stored in a table included in a database. estimating whether or not confidential information is stored and the type of stored confidential information for each column of a table using a machine learning model previously learned by a machine learning technique based on data (metadata); and generating an index related to confidential information for the corresponding table based on the estimated information.

상기 머신 러닝 모델은 입력데이터 및 출력데이터의 쌍을 포함하는 학습데이터를 이용하여 학습될 수 있으며, 상기 입력데이터는 메타데이터 중에서 선택된 하나 이상의 메타데이터 항목을 포함할 수 있고, 상기 출력데이터는 기밀정보 저장 여부 및 기밀정보 종류를 포함할 수 있다.The machine learning model may be trained using training data including a pair of input data and output data, the input data may include one or more metadata items selected from metadata, and the output data may include confidential information. It may include whether it is stored or not and the type of confidential information.

상기 메타데이터 항목은 컬럼 이름, 컬럼 종류, 컬럼 길이 및 저장된 컬럼 데이터 길이 중에서 선택된 적어도 하나일 수 있다.The metadata item may be at least one selected from a column name, a column type, a column length, and a stored column data length.

상기 메타데이터 항목은 저장된 컬럼 데이터의 치환값을 포함할 수 있으며, 상기 치환값은 저장된 컬럼 데이터에서 숫자를 특정 숫자로 치환하거나 문자를 특정 문자를 치환한 것일 수 있다.The metadata item may include a substitution value of stored column data, and the substitution value may be a specific number for a number or a specific character for a specific character in the stored column data.

상기 지표를 생성하는 단계는, 테이블에 대해 복수개의 분류 항목의 비중과 등급을 설정하는 단계; 및 설정된 각 분류 항목의 비중과 등급을 기반으로 테이블에 대한 기밀정보 관련 지표를 계산하는 단계;를 포함할 수 있다.The generating of the indicator may include: setting weights and grades of a plurality of classification items for a table; and calculating an index related to confidential information for the table based on the weight and grade of each set classification item.

상기 분류 항목은, 기밀정보를 저장한 것으로 추정된 컬럼(기밀정보 컬럼) 중에서 가장 높은 위험도를 가진 컬럼의 해당 위험도에 따라 그 등급이 설정되는 제1 분류 항목; 기밀정보 컬럼의 개수에 따라 그 등급이 설정되는 제2 분류 항목; 기밀정보 컬럼을 포함한 레코드의 개수에 따라 그 등급이 설정되는 제3 분류 항목; 해당 테이블의 중요도에 따라 그 등급이 설정되는 제4 분류 항목; 대외 공개 여부 또는 해당 테이블의 사용자에 따라 그 등급이 설정되는 제5 분류 항목; 및 해당 테이블의 사용 부서에 따라 그 등급이 설정되는 제6 분류 항목;를 포함하는 그룹 중에서 선택되는 적어도 하나를 포함할 수 있다.The classification item may include: a first classification item in which a grade is set according to a corresponding risk level of a column having the highest risk among columns (confidential information columns) estimated to store confidential information; a second classification item whose grade is set according to the number of confidential information columns; a third classification item whose grade is set according to the number of records including the confidential information column; a fourth classification item whose grade is set according to the importance of the corresponding table; a fifth classification item whose rating is set according to whether it is disclosed to the public or a user of the corresponding table; and a sixth classification item whose grade is set according to a department using the corresponding table.

다수의 테이블에 대해 생성한 기밀정보 관련 지표를 이용하여, 각 테이블의 관리 우선순위에 대한 정보를 제공하는 단계를 더 포함할 수 있다.The method may further include providing information on a management priority of each table by using the confidential information related index generated for a plurality of tables.

본 발명의 일 실시예에 따른 장치는 데이터베이스에 포함된 테이블에서 기밀정보의 저장 여부와 관련된 분석을 수행하는 장치로서, 데이터베이스의 메타데이터(metadata)를 기반으로 머신 러닝(machine learning) 기법으로 기 학습된 머신 러닝 모델(machine learning model)을 저장한 저장부; 및 머신 러닝 모델을 이용하여, 테이블의 각 컬럼에 대해 기밀정보 저장 여부 및 저장된 기밀정보 종류를 추정하며, 추정된 정보를 기반으로 해당 테이블에 대한 기밀정보 관련 지표를 생성하도록 제어하는 제어부;를 포함한다. An apparatus according to an embodiment of the present invention is an apparatus for performing analysis related to whether confidential information is stored in a table included in a database, and is a machine learning method based on metadata of the database. A storage unit for storing the machine learning model (machine learning model); and a control unit for estimating whether confidential information is stored and the type of stored confidential information for each column of the table by using a machine learning model, and controlling to generate confidential information-related indicators for the corresponding table based on the estimated information; do.

상기와 같이 구성되는 본 발명은 다양한 정보를 저장한 데이터베이스에 대해 머신 러닝(machine learning) 기법을 이용하여 기밀정보의 저장 여부와 관련된 분석을 수행할 수 있는 이점이 있다.The present invention configured as described above has an advantage in that it is possible to perform an analysis related to whether or not confidential information is stored by using a machine learning technique on a database storing various information.

본 발명은 실제 저장된 모든 데이터를 탐지하는 것이 아니고, 메타데이터에 따른 머신 러닝 기법을 이용하여 기밀정보의 저장 여부와 관련된 분석을 수행하므로, 해당 분석을 보다 빠르고 정확하게 수행할 수 있는 이점이 있다.The present invention does not detect all actually stored data, but uses a machine learning technique according to metadata to perform an analysis related to whether confidential information is stored or not, so that the analysis can be performed more quickly and accurately.

또한, 본 발명은 정형데이터에 대해 기밀정보 저장의 종류를 판별하는 자동화된 방법을 제시하여 보안관리 업무 부담을 줄여줄 있을 뿐 아니라, 기밀정보를 포함하는 테이블들에 대해 관리자가 조정할 수 있는 기밀정보 관련 지표를 산출하는 알고리즘을 제시하여, 보안관리 대상의 우선순위 선정을 위한 기준으로 사용할 있는 이점이 있다.In addition, the present invention provides an automated method for determining the type of confidential information storage for structured data, thereby reducing the burden of security management tasks and confidential information that an administrator can adjust for tables containing confidential information By presenting an algorithm that calculates related indicators, there is an advantage that it can be used as a criterion for prioritizing security management targets.

본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects obtainable in the present invention are not limited to the above-mentioned effects, and other effects not mentioned may be clearly understood by those of ordinary skill in the art to which the present invention belongs from the following description. will be.

도 1은 본 발명의 일 실시예에 따른 기밀정보 분석의 개념도를 나타낸다.
도 2는 본 발명의 일 실시예에 따른 기밀정보 분석 장치(100)의 구성도를 나타낸다.
도 3은 본 발명의 일 실시예에 따른 기밀정보 분석 장치(100)의 제어부(150)의 구성도를 나타낸다.
도 4는 본 발명의 일 실시예에 따른 기밀정보 분석 방법의 순서도를 나타낸다.1 shows a conceptual diagram of confidential information analysis according to an embodiment of the present invention.
2 shows a configuration diagram of a confidential information analysis apparatus 100 according to an embodiment of the present invention.
3 is a block diagram of the control unit 150 of the confidential information analysis apparatus 100 according to an embodiment of the present invention.
4 is a flowchart of a confidential information analysis method according to an embodiment of the present invention.

본 발명의 상기 목적과 수단 및 그에 따른 효과는 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다.The above object and means of the present invention and its effects will become more apparent through the following detailed description in relation to the accompanying drawings, and accordingly, those of ordinary skill in the art to which the present invention pertains can easily understand the technical idea of the present invention. will be able to carry out In addition, in describing the present invention, if it is determined that a detailed description of a known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며, 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 경우에 따라 복수형도 포함한다. 본 명세서에서, "포함하다", “구비하다”, “마련하다” 또는 “가지다” 등의 용어는 언급된 구성요소 외의 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다.The terminology used herein is for the purpose of describing the embodiments, and is not intended to limit the present invention. In the present specification, the singular form also includes the plural form as the case may be, unless otherwise specified in the phrase. In this specification, terms such as “include”, “provide”, “provide” or “have” do not exclude the presence or addition of one or more other components other than the mentioned components.

본 명세서에서, “또는”, “적어도 하나” 등의 용어는 함께 나열된 단어들 중 하나를 나타내거나, 또는 둘 이상의 조합을 나타낼 수 있다. 예를 들어, “또는 B”“및 B 중 적어도 하나”는 A 또는 B 중 하나만을 포함할 수 있고, A와 B를 모두 포함할 수도 있다.In this specification, terms such as “or” and “at least one” may indicate one of the words listed together, or a combination of two or more. For example, “or B” and “at least one of B” may include only one of A or B, or both A and B.

본 명세서에서, “예를 들어” 등에 따르는 설명은 인용된 특성, 변수, 또는 값과 같이 제시한 정보들이 정확하게 일치하지 않을 수 있고, 허용 오차, 측정 오차, 측정 정확도의 한계와 통상적으로 알려진 기타 요인을 비롯한 변형과 같은 효과로 본 발명의 다양한 실시 예에 따른 발명의 실시 형태를 한정하지 않아야 할 것이다.In the present specification, descriptions according to “for example” and the like may not exactly match the information presented, such as recited properties, variables, or values, tolerances, measurement errors, limits of measurement accuracy, and other commonly known factors The embodiments of the present invention according to various embodiments of the present invention should not be limited by effects such as modifications including .

본 명세서에서, 어떤 구성요소가 다른 구성요소에 '연결되어’ 있다거나 '접속되어' 있다고 기재된 경우, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성 요소에 '직접 연결되어' 있다거나 '직접 접속되어' 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해될 수 있어야 할 것이다.In this specification, when it is described that a certain element is 'connected' or 'connected' to another element, it may be directly connected or connected to the other element, but other elements may exist in between. It should be understood that there may be On the other hand, when it is mentioned that a certain element is 'directly connected' or 'directly connected' to another element, it should be understood that the other element does not exist in the middle.

본 명세서에서, 어떤 구성요소가 다른 구성요소의 '상에' 있다거나 '접하여' 있다고 기재된 경우, 다른 구성요소에 상에 직접 맞닿아 있거나 또는 연결되어 있을 수 있지만, 중간에 또 다른 구성요소가 존재할 수 있다고 이해되어야 할 것이다. 반면, 어떤 구성요소가 다른 구성요소의 '바로 위에' 있다거나 '직접 접하여' 있다고 기재된 경우에는, 중간에 또 다른 구성요소가 존재하지 않은 것으로 이해될 수 있다. 구성요소 간의 관계를 설명하는 다른 표현들, 예를 들면, '～사이에'와 '직접 ～사이에' 등도 마찬가지로 해석될 수 있다.In this specification, when it is described that a certain element is 'on' or 'in contact with' another element, it may be directly in contact with or connected to the other element, but another element may exist in the middle. It should be understood that On the other hand, when it is described that a certain element is 'directly on' or 'directly' of another element, it may be understood that another element does not exist in the middle. Other expressions describing the relationship between the elements, for example, 'between' and 'directly between', etc. may be interpreted similarly.

본 명세서에서, '제1', '제2' 등의 용어는 다양한 구성요소를 설명하는데 사용될 수 있지만, 해당 구성요소는 위 용어에 의해 한정되어서는 안 된다. 또한, 위 용어는 각 구성요소의 순서를 한정하기 위한 것으로 해석되어서는 안되며, 하나의 구성요소와 다른 구성요소를 구별하는 목적으로 사용될 수 있다. 예를 들어, '제1구성요소'는 '제2구성요소'로 명명될 수 있고, 유사하게 '제2구성요소'도 '제1구성요소'로 명명될 수 있다.In this specification, terms such as 'first' and 'second' may be used to describe various components, but the components should not be limited by the above terms. In addition, the above terms should not be construed as limiting the order of each component, and may be used for the purpose of distinguishing one component from another. For example, a 'first component' may be referred to as a 'second component', and similarly, a 'second component' may also be referred to as a 'first component'.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. Unless otherwise defined, all terms used herein may be used with meanings commonly understood by those of ordinary skill in the art to which the present invention pertains. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless specifically defined explicitly.

본 발명은 IT 환경에서 산재된 기밀정보를 분석하여 기밀정보의 종류를 식별하고 존재 유무 및 수량을 파악하여 기밀정보에 대한 현황을 도출함으로써 적절한 보안 관리를 수행할 수 있는 기준을 마련할 수 있도록 한다. 특히 정형화된 데이터를 저장하고 있는 데이터베이스 시스템 같은 경우 저장 구조 정보를 나타내는 메타데이터를 머신 러닝 기법을 이용하여 기밀정보의 저장 종류를 판별해내는 방안을 제시한다. 또한 탐지된 기밀정보의 저장에 대하여 정량화된 기밀정보 관련 지표(또는 위험 지표)를 생성하는 알고리즘을 제시하여 관리 업무가 효율적이고 효과적으로 수행될 수 있게 한다. 즉, 본 발명은 보안 관리 대상의 우선 순위를 정하는데 사용할 수 있는 알고리즘 및 방안을 제시한다.The present invention analyzes the confidential information scattered in the IT environment to identify the type of confidential information and determine the existence and quantity of confidential information to derive the current status of the confidential information so that a standard for performing appropriate security management can be prepared. . In particular, in the case of a database system that stores standardized data, we propose a method for determining the storage type of confidential information using machine learning metadata representing storage structure information. In addition, an algorithm for generating quantified confidential information-related indicators (or risk indicators) for the storage of detected confidential information is presented so that management tasks can be performed efficiently and effectively. That is, the present invention proposes an algorithm and a method that can be used for prioritizing security management targets.

한편, “기밀정보”는 특정 사람 또는 조직에만 허용되고 그 외에는 허용되지 않은 정보를 의미한다. 즉, 기밀정보는 조직 내의 특정 부서, 직책, 업무 또는 조직원에만 허용되고 그 외에는 허용되지 않은 정보를 지칭할 수 있다. 이때, 정보는 넓게는 컴퓨터 파일 등의 전자 파일 또는 그에 포함된 내용을 의미할 수 있다. 기밀정보는 개인정보, 영업정보, 고객정보, 설계도면 등과 같이 다양한 종류를 포함할 수 있다. 기밀정보의 저장된 내용의 형태를 분석하여 일정한 패턴이 있는지를 파악하고 해당 패턴에 관련된 정보를 생성할 수 있다. 패턴을 정의하는 방법은 컴퓨터 산업군에서 많이 쓰는 정규식을 이용할 수 있으며 단순하게는 특정 문자열의 포함 여부를 나타내는 와일드카드를 이용할 수도 있다. 기밀정보의 유출은 기밀에 위배된 경우를 지칭하는 것으로서, 예를 들어 기밀정보를 비정상적으로 전송 또는 반출하는 경우를 지칭할 수 있다. On the other hand, “confidential information” means information that is permitted only to a specific person or organization and is not permitted to others. That is, confidential information may refer to information that is permitted only to specific departments, positions, tasks, or members of an organization and is not permitted to others. In this case, the information may broadly mean an electronic file such as a computer file or content included therein. Confidential information may include various types such as personal information, business information, customer information, and design drawings. By analyzing the form of the stored content of confidential information, it is possible to determine whether there is a certain pattern and to generate information related to the pattern. As a method of defining a pattern, regular expressions widely used in the computer industry can be used, and a wildcard indicating whether or not a specific string is included can be used simply. Leakage of confidential information refers to a case in which confidentiality is violated. For example, it may refer to an abnormal transmission or export of confidential information.

기밀정보가 저장되는 데이터 형태는 크게 정형과 비정형으로 나눌 수 있다. 정형 데이터는 데이터베이스 시스템처럼 저장되는 데이터의 종류, 크기 등을 확인하여 적합한 데이터만 저장될 수 있도록 한다. 이때, 정형 데이터의 구조를 정의하는 정보를 “메타데이터”라 부른다. 비정형 데이터는 정형과 달리 동일한 저장 영역에 다양한 종류의 데이터를 저장한다. 비정형 데이터 저장의 예로 일반적인 문서 파일, 텍스트 파일처럼 보통 파일들을 지칭한다. 정형 데이터 내에도 비정형 데이터를 저장하는 영역을 포함할 수 있다. 다만, 본 발명에서는 정형 데이터에서의 기밀정보 저장 종류를 효율적으로 파악하는 내용을 주로 다루도록 한다. Data types in which confidential information is stored can be broadly divided into structured and unstructured types. Structured data checks the type and size of data to be stored like a database system so that only appropriate data can be stored. In this case, information defining the structure of structured data is called “metadata”. Unlike structured data, unstructured data stores various types of data in the same storage area. As an example of storing unstructured data, it refers to ordinary files such as general document files and text files. A region for storing unstructured data may be included in the structured data as well. However, in the present invention, the content of efficiently identifying the storage type of confidential information in the structured data is mainly dealt with.

데이터베이스에서, 컬럼(column)은 데이터의 가장 작은 저장 단위를 지칭하며, 레코드는 컬럼들의 집합으로 의미 있는 데이터 집합이다. 또한, 테이블은 레코드들의 저장소를 의미한다. 즉, 데이터베이스는 다양한 테이블들의 저장소이다. 가장 기초적인 저장 단위인 컬럼이 기밀정보를 포함하고 있다면 그 집합인 레코드는 당연히 기밀정보를 갖고 있다. 하나의 컬럼은 하나의 동일 형태의 정형 데이터를 갖는 것이 일반적이지만 비정형의 데이터를 갖을 수도 있다. 비정형 컬럼들은 하나의 컬럼에 여러 개의 각기 다른 데이터를 저장하고 있을 수 있으며 비정형 컬럼은 실제 내용을 검색하여 기밀정보 포함여부를 판단하여야 한다.In a database, a column refers to the smallest storage unit of data, and a record is a meaningful data set as a set of columns. Also, a table refers to a storage of records. In other words, a database is a repository of various tables. If the column, which is the most basic storage unit, contains confidential information, the record, which is the set, naturally contains confidential information. One column generally has one and the same type of structured data, but may have unstructured data. Unstructured columns may store several different data in one column, and for unstructured columns, it is necessary to search the actual contents to determine whether confidential information is included.

다만, 비정형 컬럼의 데이터에 대한 기밀정보 탐지방법은 이미 다양한 종래 기술을 이용하여 탐지하고 있다. 본 발명에서는 비정형 데이터의 기밀정보 종류의 분석 보다는 정형 데이터의 기밀정보 종류 분석을 주로 다룬다.However, methods for detecting confidential information on data of unstructured columns have already been detected using various conventional techniques. The present invention mainly deals with the confidential information type analysis of structured data rather than the type of confidential information analysis of unstructured data.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일 실시예를 상세히 설명하도록 한다.Hereinafter, a preferred embodiment according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 기밀정보 분석의 개념도를 나타낸다.1 shows a conceptual diagram of confidential information analysis according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 기밀정보 분석 장치(100)(이하, “분석 장치”이라 지칭함)는 데이터베이스(또는 그 시스템)(database; DB)에 저장된 다수의 테이블에 대한 기밀정보를 분석하는 장치로서, 컴퓨팅(computing)이 가능한 전자 장치, 시스템 또는 컴퓨팅 네트워크일 수 있다. 이러한 데이터베이스는 다수의 테이블을 포함하고, 복수 개일 수 있으며, 도 1에 도시된 바와 같이, 별도의 데이터베이스 시스템(DBS)에 저장되어 있거나, 도 1에 도시된 바와 달리, 기밀정보 분석 장치(100) 내의 저장부에 저장되어 있다.The confidential information analysis apparatus 100 (hereinafter referred to as "analysis apparatus") according to an embodiment of the present invention is a device for analyzing confidential information for a plurality of tables stored in a database (or its system) (database; DB) As such, it may be an electronic device, system, or computing network capable of computing. Such a database includes a plurality of tables, and there may be a plurality of tables, and as shown in FIG. 1 , it is stored in a separate database system (DBS), or as shown in FIG. 1 , the confidential information analysis apparatus 100 . It is stored in the internal storage.

즉, 도 1을 참조하면, 분석 장치(100)는 데이터베이스 시스템(DBS)에 접속하여, 데이터베이스에 포함된 테이블의 메타데이터를 전달받아, 해당 메타데이터를 이용하여 해당 테이블에 기밀정보 저장 여부와 관련된 다양한 분석을 수행할 수 있다. 이와 달리, 분석 장치(100)는 자신의 저장부에 저장된 데이터베이스에 직접 접근하여 테이블의 메타데이터를 이용해 해당 분석을 수행할 수도 있다.That is, referring to FIG. 1 , the analysis apparatus 100 accesses a database system (DBS), receives metadata of a table included in the database, and uses the metadata to determine whether confidential information is stored in the table. Various analyzes can be performed. Alternatively, the analysis apparatus 100 may directly access the database stored in its storage unit and perform the corresponding analysis using the metadata of the table.

특히, 분석 장치(100)는 데이터베이스의 테이블에 포함된 기밀정보의 분석 처리를 위해 머신 러닝 모델(machine learning model)을 이용할 수 있다. 이러한 머신 러닝 모델은 분석 장치(100)에서 학습되거나, 타 장치에서 학습된 후 분석 장치(100)에 저장될 수 있다.In particular, the analysis apparatus 100 may use a machine learning model for analyzing and processing confidential information included in a table of a database. Such a machine learning model may be learned by the analysis device 100 or may be stored in the analysis device 100 after being trained in another device.

또한, 분석 장치(100)는 기밀정보 분석 처리를 위한 서버 등으로 동작할 수도 있다. 즉, 서버인 경우, 분석 장치(100)는 단말로부터 데이터베이스에 대한 분석 요청을 수신할 수 있고, 수신한 요청에 따른 해당 데이터베이스에 대한 기밀정보 분석을 처리하여 그 결과를 해당 단말 또는 타 단말로 전송하도록 동작할 수도 있다. 이때, 단말은 해당 데이터베이스를 저장하고 있는 전자 장치이거나, 해당 데이터베이스가 저장된 데이터베이스 시스템에 접속 가능한 전자 장치일 수 있다.In addition, the analysis apparatus 100 may operate as a server for processing confidential information analysis. That is, in the case of a server, the analysis apparatus 100 may receive an analysis request for the database from the terminal, process confidential information analysis on the database according to the received request, and transmit the result to the terminal or another terminal may work to do so. In this case, the terminal may be an electronic device storing a corresponding database or an electronic device capable of accessing a database system in which the corresponding database is stored.

예를 들어, 전자 장치는 데스크탑 PC(desktop personal computer), 랩탑 PC(laptop personal computer), 태블릿 PC(tablet personal computer), 넷북 컴퓨터(netbook computer), 워크스테이션(workstation), PDA(personal digital assistant), 스마트폰(smartphone), 스마트패드(smartpad), 또는 휴대폰(mobile phone), 등일 수 있으나, 이에 한정되는 것은 아니다For example, the electronic device includes a desktop personal computer (PC), a laptop personal computer (PC), a tablet personal computer (PC), a netbook computer, a workstation, and a personal digital assistant (PDA). , a smartphone (smartphone), a smart pad (smartpad), or a mobile phone (mobile phone), etc., but is not limited thereto

도 2는 본 발명의 일 실시예에 따른 기밀정보 분석 장치(100)의 구성도를 나타낸다.2 shows a configuration diagram of a confidential information analysis apparatus 100 according to an embodiment of the present invention.

구체적으로, 분석 장치(100)는, 도 2에 도시된 바와 같이, 입력부(110), 저장부(120), 통신부(130), 디스플레이부(140), 제어부(150) 등을 포함할 수 있다.Specifically, as shown in FIG. 2 , the analysis apparatus 100 may include an input unit 110 , a storage unit 120 , a communication unit 130 , a display unit 140 , a control unit 150 , and the like. .

입력부(110)는 각종 정보를 입력 받는 구성이다. 즉, 입력부(110)는 사용자의 입력에 대응하여, 입력데이터를 발생시킨다. 입력부(110)는 적어도 하나의 입력수단을 포함할 수 있다. 특히, 입력부(110)는 데이터베이스에 대한 선택, 테이블에 대한 선택, 분류 항목에 대한 선택, 선택된 각 분류 항목의 비중과 등급에 대한 설정 등을 관리자 등으로부터 입력 받을 수 있다. The input unit 110 is configured to receive various types of information. That is, the input unit 110 generates input data in response to a user's input. The input unit 110 may include at least one input means. In particular, the input unit 110 may receive a selection for a database, a selection for a table, a selection for a classification item, a setting for the weight and grade of each selected classification item, and the like, from an administrator or the like.

예를 들어, 입력부(110)는 키보드(key board), 키패드(key pad), 돔 스위치(dome switch), 터치패널(touch panel), 터치 키(touch key), 마우스(mouse), 또는 메뉴 버튼(menu button) 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.For example, the input unit 110 may include a keyboard, a keypad, a dome switch, a touch panel, a touch key, a mouse, or a menu button. (menu button) and the like may be included, but is not limited thereto.

저장부(120)는 각종 정보를 저장하는 구성이다. 즉, 저장부(120)는 분석 장치(100)의 동작에 필요한 각종 정보, 소프트웨어 등을 저장할 수 있다. 특히, 저장부(120)는 입력부(110)에서 입력된 각종 선택 정보, 선택된 각 분류 항목의 비중과 등급에 대한 설정 정보, 기밀정보 관련 지표(위험 지표)의 계산을 위한 정보, 머신 러닝 모델 등을 저장할 수 있다. The storage unit 120 is configured to store various types of information. That is, the storage unit 120 may store various types of information and software required for the operation of the analysis apparatus 100 . In particular, the storage unit 120 includes various selection information input from the input unit 110, setting information about the weight and grade of each selected classification item, information for calculating confidential information related indicators (risk indicators), machine learning models, etc. can be saved.

머신 러닝 모델이 분석 장치(100)에서 학습되는 경우, 저장부(120)는 머신 러닝 모델의 학습을 위한 학습데이터를 저장할 수 있다. 또한, 저장부(120)는 데이터베이스를 저장하거나, 해당 데이터베이스가 저장된 데이터베이스 시스템에 대한 접속 정보 등을 저장할 수도 있다.When the machine learning model is learned by the analysis apparatus 100 , the storage unit 120 may store training data for learning the machine learning model. In addition, the storage unit 120 may store a database or may store connection information on a database system in which the corresponding database is stored.

예를 들어, 저장부(120)는 그 유형에 따라 하드디스크 타입(hard disk type), 마그네틱 매체 타입(Sagnetic media type), CD-ROM(compact disc read only memory), 광기록 매체 타입(Optical Media type), 자기-광 매체 타입(Sagneto-optical media type), 멀티미디어 카드 마이크로 타입(Sultimedia card micro type), 플래시 저장부 타입(flash memory type), 롬 타입(read only memory type), 또는 램 타입(random access memory type) 등일 수 있으나, 이에 한정되는 것은 아니다. 또한, 저장부(120)는 그 용도/위치에 따라 캐시(cache), 버퍼, 주기억장치, 또는 보조기억장치이거나 별도로 마련된 저장 시스템일 수 있으나, 이에 한정되는 것은 아니다.For example, the storage unit 120 may be a hard disk type, a magnetic media type, a compact disc read only memory (CD-ROM), or an optical media type depending on the type. type), a Sagneto-optical media type, a multimedia card micro type, a flash memory type, a read only memory type, or a RAM type ( random access memory type), but is not limited thereto. In addition, the storage unit 120 may be a cache, a buffer, a main memory, an auxiliary memory, or a separately provided storage system according to its purpose/location, but is not limited thereto.

통신부(130)는 서버 등의 타 전자 장치와 통신을 수행하는 구성이다. 이때, 통신부(130)는 다양한 통신 방식의 유/무선 통신 모듈을 포함할 수 있다. 특히, 통신부(130)는 데이터베이스에 대한 선택, 테이블에 대한 선택, 분류 항목에 대한 선택, 선택된 각 분류 항목의 비중과 등급에 대한 설정, 머신 러닝 모델 등을 타 전자 장치로부터 수신할 수도 있다. The communication unit 130 is configured to communicate with other electronic devices such as a server. In this case, the communication unit 130 may include wired/wireless communication modules of various communication methods. In particular, the communication unit 130 may receive a selection for a database, a selection for a table, a selection for a classification item, a setting for the weight and grade of each selected classification item, a machine learning model, and the like from another electronic device.

예를 들어, 통신부(130)는 5G(5th generation communication), LTE-A(long term evolution-advanced), LTE(long term evolution), 블루투스, BLE(bluetooth low energe), 또는 NFC(near field communication) 등의 무선 통신을 수행할 수 있고, 케이블 통신 등의 유선 통신을 수행할 수 있으나, 이에 한정되는 것은 아니다.For example, the communication unit 130 is 5th generation communication (5G), long term evolution-advanced (LTE-A), long term evolution (LTE), Bluetooth, bluetooth low energe (BLE), or near field communication (NFC). It is possible to perform wireless communication such as cable communication, etc., and may perform wired communication such as cable communication, but is not limited thereto.

디스플레이부(140)는 분석 장치(100)의 동작에 따른 표시데이터를 표시하는 구성이다. 예를 들어, 디스플레이부(140)는 액정 디스플레이(LCD; liquid crystal display), 발광 다이오드(LED; light emitting diode) 디스플레이, 유기 발광 다이오드(OLED; organic LED) 디스플레이, 마이크로 전자기계 시스템(MEMS; micro electro mechanical systems) 디스플레이, 또는 전자 종이(electronic paper) 디스플레이 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. 또한, 디스플레이부(140)는 입력부(110)와 결합되어 터치 스크린(touch screen) 등으로 구현될 수도 있다.The display unit 140 is configured to display display data according to the operation of the analysis apparatus 100 . For example, the display unit 140 may include a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, and a microelectromechanical system (MEMS). electro mechanical systems) display, or an electronic paper display, etc., but is not limited thereto. In addition, the display unit 140 may be implemented as a touch screen or the like in combination with the input unit 110 .

제어부(150)는 후술할 기밀정보 분석 방법의 수행을 제어할 수 있다. 즉, 제어부(150)는 머신 러닝 모델을 이용하여 테이블의 각 컬럼에 대해 기밀정보 저장 여부 및 저장된 기밀정보 종류를 추정할 수 있다. 또한, 제어부(150)는 추정된 정보를 기반으로 해당 테이블에 대한 기밀정보 관련 지표를 생성하도록 제어할 수 있다. The controller 150 may control execution of a method for analyzing confidential information to be described later. That is, the controller 150 may estimate whether confidential information is stored and the type of stored confidential information for each column of the table by using the machine learning model. In addition, the controller 150 may control to generate a confidential information related indicator for the corresponding table based on the estimated information.

제어부(150)는 머신 러닝 모델의 학습을 제어할 수도 있다. 이러한 머신 러닝 모델의 학습에 대한 내용은 후술할 기밀정보 분석 방법에서 보다 상세하게 설명하도록 한다. 제어부(150)의 기밀정보 분석 처리 또는 머신 러닝 모델 학습 처리는 저장부(120)에 설치된 소프트웨어(프로그램)를 통해 수행될 수 있다.The controller 150 may control learning of the machine learning model. The learning of such a machine learning model will be described in more detail in the confidential information analysis method to be described later. Confidential information analysis processing or machine learning model learning processing of the control unit 150 may be performed through software (program) installed in the storage unit 120 .

제어부(150)는 입력부(110), 저장부(120), 통신부(130), 디스플레이부(140 등에 대한 제어를 수행할 수 있다. 예를 들어, 제어부(150)는 프로세서(processor)이거나, 해당 프로세서에서 수행되는 프로세스/프로그램 등의 소프트웨어일 수 있으나, 이에 한정되는 것은 아니다.The control unit 150 may control the input unit 110, the storage unit 120, the communication unit 130, the display unit 140, etc. For example, the control unit 150 may be a processor or a corresponding It may be software such as a process/program executed by a processor, but is not limited thereto.

도 3은 본 발명의 일 실시예에 따른 기밀정보 분석 장치(100)의 제어부(150)의 구성도를 나타낸다.3 is a block diagram of the control unit 150 of the confidential information analysis apparatus 100 according to an embodiment of the present invention.

또한, 도 3을 참조하면, 제어부(150)는 추정부(151), 생성부(152) 및 관리부(153)를 포함할 수 있다. 또한, 제어부(150)는 머신 러닝 모델을 학습시키는 경우에 필요한 학습부(154)를 더 포함할 수 있다. 예를 들어, 이러한 추정부(151), 생성부(152), 관리부(153) 및 학습부(152)는 프로세서에 포함된 하드웨어 구성이거나, 해당 프로세서에서 수행되는 프로세스/프로그램 등의 소프트웨어일 수 있으나, 이에 한정되는 것은 아니다.Also, referring to FIG. 3 , the controller 150 may include an estimator 151 , a generator 152 , and a manager 153 . In addition, the control unit 150 may further include a learning unit 154 necessary for learning the machine learning model. For example, the estimator 151 , the generator 152 , the manager 153 , and the learner 152 may be hardware components included in the processor or software such as a process/program executed in the processor. , but is not limited thereto.

이하, 본 발명의 일 실시예에 따른 기밀정보 분석 방법에 대해서 설명하도록 한다.Hereinafter, a method for analyzing confidential information according to an embodiment of the present invention will be described.

도 4는 본 발명의 일 실시예에 따른 기밀정보 분석 방법의 순서도를 나타낸다.4 is a flowchart of a confidential information analysis method according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 기밀정보 분석 방법(이하, “본 분석 방법”이라 지칭함)은, 도 4에 도시된 바와 같이, S201 내지 S203을 포함할 수 있다. 이때, S201는 추정부(151)에서, S202는 생성부(152)에서, S203은 관리부(153)에서 각각 그 수행이 제어될 수 있다.The confidential information analysis method (hereinafter, referred to as “the present analysis method”) according to an embodiment of the present invention may include steps S201 to S203 as shown in FIG. 4 . In this case, S201 may be controlled by the estimator 151 , S202 by the generator 152 , and S203 by the manager 153 , respectively.

S201은 데이터베이스의 머신 러닝 모델(machine learning model)을 이용하여, 테이블의 각 컬럼에 대해 기밀정보 저장 여부 및 저장된 기밀정보 종류를 추정(또는 추천)한다. 이때, 머신 러닝 모델은 메타데이터(metadata)를 기반으로 머신 러닝(machine learning) 기법으로 기 학습된 모델이다.S201 estimates (or recommends) whether confidential information is stored and the type of stored confidential information for each column of the table by using a machine learning model of the database. In this case, the machine learning model is a pre-trained model using a machine learning technique based on metadata.

이때, 머신 러닝 모델은 입력데이터 및 출력데이터의 쌍(데이터 세트)을 가지는 학습데이터를 통해 지도 학습(supervised learning)의 머신 러닝 기법에 따라 학습된 모델일 수 있다. 이때, 학습데이터에서, 입력데이터는 메타데이터 중에서 선택된 하나 이상의 메타데이터 항목을 포함하고, 출력 데이터는 기밀정보 저장 여부 및 기밀정보 종류를 포함한다.In this case, the machine learning model may be a model learned according to a machine learning technique of supervised learning through training data having a pair (data set) of input data and output data. In this case, in the learning data, input data includes one or more metadata items selected from metadata, and output data includes whether confidential information is stored and the type of confidential information.

이에 따라, 머신 러닝 모델은 입력데이터인 하나 이상의 메타데이터 항목과 출력데이터인 기밀정보 저장 여부 및 기밀정보 종류 간의 관계에 대한 함수를 가지며, 이를 다양한 파라미터를 이용해 표현한다.Accordingly, the machine learning model has a function for the relationship between one or more metadata items as input data, whether to store confidential information as output data, and the type of confidential information, and expresses this using various parameters.

예를 들어, 상술한 머신 러닝 모델에 적용되는 머신 러닝 기법은 Artificial neural network, Boosting, Bayesian statistics, Decision tree, Gaussian process regression, Nearest neighbor algorithm, Support vector machine, Random forests, Symbolic machine learning, Ensembles of classifiers, 또는 Deep learning 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. 또한, 예를 들어, Deep learning 기법은 Deep Neural Network(DNN), Convolutional Neural Network(CNN), Recurrent Neural Network(RNN), Restricted Boltzmann Machine(RBM), Deep Belief Network(DBN), Deep Q-Networks 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.다만, S201에서는 분류를 위한 머신 러닝 기법을 이용하므로, 머신 러닝 모델의 용량 및 연산량을 줄이면서 정확한 분류의 효과를 위해, Decision tree 또는 Random forests 등을 사용하는 것이 바람직할 수 있으나, 이에 한정되는 것은 아니다.For example, the machine learning technique applied to the above-described machine learning model is Artificial neural network, Boosting, Bayesian statistics, Decision tree, Gaussian process regression, Nearest neighbor algorithm, Support vector machine, Random forests, Symbolic machine learning, Ensembles of classifiers , or deep learning, etc., but is not limited thereto. Also, for example, deep learning techniques include Deep Neural Network (DNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Deep Q-Networks, etc. may include, but is not limited thereto. However, in S201, since a machine learning technique for classification is used, a decision tree or random forests, etc. are used for accurate classification while reducing the capacity and computational amount of the machine learning model. It may be desirable to use, but is not limited thereto.

가령, Random forests 기법은 분류, 회귀 분석 등에 사용되는 앙상블 학습 방법의 일종으로, 훈련 과정에서 구성한 다수의 결정 트리로부터 부류(분류) 또는 평균 예측치(회귀 분석)를 출력함으로써 동작한다. 이 랜덤 포레스트에 수집된 기초 정보를 입력하여 기밀정보의 종류를 추정할 수 있다.For example, the random forests technique is a kind of ensemble learning method used for classification and regression analysis, and operates by outputting a class (classification) or average prediction value (regression analysis) from a plurality of decision trees constructed in the training process. The type of confidential information can be estimated by inputting the collected basic information into this random forest.

한편, 머신 러닝 모델은 S201 이전에 머신 러닝 기법에 따라 기 학습된 것이다. 이러한 머신 러닝 모델은 S201 이전에 분석 장치(100)에서 학습되거나, S201 이전에 타 장치에서 학습되고 분석 장치(100)에 전달(전송)되어 저장부(120)에 저장된 것일 수 있다. 이때, 분석 장치(100)에서 머신 러닝 모델이 학습될 경우, 학습부(154)가 해당 학습의 수행을 제어할 수 있다.On the other hand, the machine learning model is pre-trained according to the machine learning technique before S201. Such a machine learning model may be learned by the analysis device 100 before S201, or it may be learned from another device before S201 and transmitted (transmitted) to the analysis device 100 and stored in the storage unit 120 . In this case, when the machine learning model is learned by the analysis apparatus 100 , the learning unit 154 may control the corresponding learning.

한편, 학습데이터의 입력데이터에 포함되는 메타데이터는 다음과 같은 항목(이하, “메타데이터 항목”이라 지칭함)을 포함할 수 있다.Meanwhile, the metadata included in the input data of the learning data may include the following items (hereinafter referred to as “metadata items”).

<메타데이터 항목><metadata item>

- - 데이터베이스 이름database name : 데이터베이스의 이름에 대한 항목으로서, 카탈로그라고 지칭되기도 함: An entry for the name of a database, also referred to as a catalog.

- - 테이블 이름: table name: 테이블의 이름에 대한 항목을 나타냄Indicates an entry for the name of a table

- - 컬럼 이름: Column name: 컬럼의 이름에 대한 항목을 나타냄Indicates an entry for a column's name

- - 컬럼 종류Column type : 컬럼의 종류에 대항 항목으로서, 문자열, 숫자, 날짜, 불린, 바이너리 등과 같이 데이터베이스에 따라 다양한 종류가 있음: As an item against the type of column, there are various types depending on the database such as string, number, date, boolean, binary, etc.

- - 컬럼 길이column length : 컬럼에 저장되는 데이터의 최대 길이를 나타냄: Indicates the maximum length of data to be stored in the column.

- - 정규화된 컬럼의 저장된 값(치환값):Stored values (substitutions) of normalized columns: 컬럼에 실제 저장된 데이터 중에 하나를 샘플링 추출(즉, 동일한 종류의 컬럼들에 실제 데이터가 저장된 경우, 그 중에 어느 한 컬럼의 데이터를 샘플링 추출)하여, 해당 데이터에 포함된 숫자는 특정 숫자값(예를 들어, '0')으로 치환하고, 해당 데이터에 포함된 문자는 특정 문자(예를 들어, 'a')로 치환하여, 정규화된 데이터 값을 나타냄(예를 들어, 추출된 데이터가 '123422'인 경우에 '000000'으로 치환하고, 추출된 데이터가 'example'인 경우에 'aaaaaa'로 치환함) By sampling and extracting one of the data actually stored in the column (that is, sampling and extracting the data of any one column if the actual data is stored in the same type of columns), the number included in the data is a specific numeric value (e.g. For example, '0') and characters included in the data are replaced with specific characters (eg, 'a') to indicate normalized data values (eg, extracted data is '123422') ', replace with '000000', and if the extracted data is 'example', replace with 'aaaaaa')

- - 저장된 컬럼 데이터 길이Stored column data length : [정규화된 컬럼의 저장된 값]의 실제 데이터 길이를 나타냄: Indicates the actual data length of [stored value of normalized column]

즉, [데이터베이스 이름], [테이블의 이름], [컬럼 이름], [컬럼 종류], [컬럼 길이], [정규화된 컬럼의 저장된 값] 및 [저장된 컬럼 데이터 길이]는 메타데이터 항목이며, 이들 중에 적어도 하나가 학습데이터의 입력데이터에 사용될 수 있다. 다만, 효율적인 분류를 위해, [컬럼 이름], [컬럼 종류], [컬럼 길이] 및 [저장된 컬럼 데이터 길이]가 학습데이터의 입력데이터에 포함되는 것이 바람직하나, 이에 한정되는 것은 아니다.That is, [database name], [name of table], [column name], [column type], [column length], [stored value of normalized column] and [stored column data length] are metadata items, and these At least one of them may be used for input data of the learning data. However, for efficient classification, [column name], [column type], [column length] and [stored column data length] are preferably included in the input data of the training data, but is not limited thereto.

특히, [정규화된 컬럼의 저장된 값]은 실제 데이터를 대표하는 숫자 또는 문자로 치환하여 정규화한 메타데이터 항목이므로, 분류의 오탐율을 줄이기 위해 학습데이터의 입력데이터에 추가될 수 있다. 다만, [정규화된 컬럼의 저장된 값] 없이도 분류의 오탐지 가능성이 낮다면 학습데이터의 입력데이터에서 제외시켜도 무방하다. 즉, 본 발명은 단순히 메타데이터 자체만을 이용하는 것을 보완하여, [정규화된 컬럼의 저장된 값]의 항목을 통해 실제 데이터를 샘플링하고 데이터를 단순 문자와 숫자로 정규화함으로써, 머신 러닝 모델의 학습 및 분류 결과 품질을 더욱 높일 수 있다.In particular, [the stored value of the normalized column] is a metadata item that is normalized by substituting numbers or letters representing the actual data, so it can be added to the input data of the training data to reduce the false positive rate of classification. However, if the probability of false detection of classification is low even without [stored value of normalized column], it is okay to exclude it from the input data of the training data. That is, the present invention supplements simply using only the metadata itself, sampling the actual data through the item of [normalized column stored value] and normalizing the data to simple letters and numbers, resulting in learning and classification of the machine learning model The quality can be further improved.

또한, 학습데이터의 결과데이터에 포함되는 [기밀정보 종류]는 기밀로 분류된 정보의 종류를 나타내며, 개인정보 등을 포함할 수 있다. 예를 들어, [기밀정보 종류]는 [비 기밀정보], [주민등록번호], [운전면허번호], [여권번호], [외국인등록번호], [법인번호], [전화번호] 또는 [이메일] 등을 포함할 수 있다. 이때, [비 기밀정보]는 테이블에 기밀정보가 없는 경우를 나타낸다. 즉, [기밀정보 종류]가 [비 기밀정보]의 항목을 포함함으로써, 머신 러닝 모델은 기밀정보 저장 여부까지 한번에 그 분류 결과를 나타낼 수 있다. 즉, 테이블에 기밀정보가 없는 경우, 머신 러닝 모델은 [비 기밀정보]인 것으로 추정하는 결과를 출력할 수 있다. 또한, 테이블에 기밀정보가 없는 경우, 머신 러닝 모델은 [기밀정보 종류]의 나머지 항목 중에 하나, 예를 들어, [주민등록번호], [운전면허번호], [여권번호], [외국인등록번호], [법인번호], [전화번호] 또는 [이메일]이 해당 테이블에 포함되어 있음을 추정하는 결과를 출력한다.In addition, [confidential information type] included in the result data of the learning data indicates the type of information classified as confidential, and may include personal information and the like. For example, [type of confidential information] is [non-confidential information], [resident registration number], [driver's license number], [passport number], [alien registration number], [corporation number], [phone number] or [e-mail] and the like. In this case, [Non-Confidential Information] indicates a case where there is no confidential information in the table. In other words, since [class of confidential information] includes the item of [non-confidential information], the machine learning model can show the classification result at once until whether or not confidential information is stored. That is, if there is no confidential information in the table, the machine learning model may output a result that is assumed to be [non-confidential information]. In addition, if there is no confidential information in the table, the machine learning model can use one of the remaining items of [Class of confidential information], for example, [Resident registration number], [Driver's license number], [Passport number], [Alien registration number], Outputs the result of estimating that [corporation number], [phone number], or [email] are included in the table.

S201에 따른 머신 러닝 모델을 이용함에 따라, 관리자가 아직 파악하지 못한 컬럼에 대해서도 기밀정보 종류를 추정할 수 있다. 그 결과, 관리자는 기밀정보로 추정된 컬럼들에 대해서 우선적으로 기밀정보 파악 및 적절한 관리를 수행하면 되므로, 관리 효율성이 증대될 수 있는 이점이 있다.By using the machine learning model according to S201, the type of confidential information can be estimated even for columns that the administrator has not yet grasped. As a result, there is an advantage that management efficiency can be increased because the administrator needs to first identify confidential information and appropriately manage the columns estimated to be confidential information.

즉, S201에 따라, 본 발명에서는 데이터베이스의 메타데이터를 이용하여 컬럼들이 기밀정보를 내포하는지를 분석하여 전체적인 기밀정보의 저장 현황을 보다 쉽게 파악할 수 있다. 이때, S201에 따른 컬럼에 대한 기밀정보 저장 여부 및 그 기밀정보 종류의 추정(분석)은, 해당 데이터의 저장 시점에 판별하거나, 주기적인 점검을 통해 분석될 수 있다. 이는 기밀정보 유출방지 또는 개인정보보호 활용을 목표로 하여 사내 보안 규정 준수 및 감독기관의 보안 지침을 준수하려는 목적을 갖고 있다.That is, according to S201, in the present invention, it is possible to more easily understand the storage status of the entire confidential information by analyzing whether the columns contain confidential information using the metadata of the database. At this time, whether or not confidential information is stored for the column according to S201 and estimation (analysis) of the type of confidential information may be determined at the time of storage of the corresponding data or analyzed through periodic inspection. It aims to prevent leakage of confidential information or use personal information protection, and aims to comply with internal security regulations and to comply with the security guidelines of supervisory authorities.

특히, 본 발명은 실제 저장된 모든 데이터를 탐지하는 것이 아니고, 메타데이터를 이용하여 컬럼들이 기밀정보를 내포하는지를 분석하므로, 보다 빠르고 정확하게 해당 분석이 가능한 이점이 있다. 이때, 기밀정보 컬럼으로 판별된 컬럼들은 저장된 개별 데이터들에 대해 기밀 패턴을 적용하여 기밀정보 여부를 판단할 필요가 없이 데이터의 수만 검사하면, 기밀정보 개수가 된다.In particular, the present invention does not detect all data actually stored, but analyzes whether columns contain confidential information using metadata, so that the analysis can be performed more quickly and accurately. At this time, the number of columns determined as confidential information columns becomes the number of confidential information when only the number of data is checked without the need to determine whether confidential information is confidential by applying the confidential pattern to stored individual data.

예를 들어, 주민등록번호 컬럼은 주민등록번호만을 보통 그 내용으로 저장하기 때문에, 각 실제 내용에 대해 다양한 기밀정보 종류 탐지를 위한 패턴 매칭을 수행할 필요 없다. 이에 따라, 본 발명의 경우, 주민등록번호 컬럼 자체를 기밀정보 칼럼으로 추정할 수 있으며, 이때 해당 주민등록번호 컬럼에 저장된 데이터의 개수(주민등록번호가 개수)가 주민등록번호 컬럼의 기밀정보 개수가 되는 것이다.For example, since the resident registration number column usually stores only the resident registration number as its content, it is not necessary to perform pattern matching for detecting various types of confidential information for each actual content. Accordingly, in the present invention, the resident registration number column itself can be estimated as a confidential information column, and in this case, the number of data stored in the corresponding resident registration number column (the number of resident registration numbers) becomes the number of confidential information in the resident registration number column.

한편, 하나의 데이터베이스에는 많은 수의 테이블들이 존재하며 다양한 사용자 및 개발자들에 의해 테이블 및 컬럼들이 임시 또는 임으로 추가되는 경우가 다수 존재할 가능성이 있다. 이런 환경에서 소수의 보안 관리자가 기밀정보의 저장 현황 파악을 최신으로 유지하는 것이 쉽지 않다. 하나의 기업에는 다수의 데이터베이스가 존재하고 임시로 많은 테이블들과 데이터들이 생성된다. 이 또한 보안 관리 대상을 추적 판단하는데 어려움을 가중시킨다.On the other hand, a large number of tables exist in one database, and there are many cases where tables and columns are added temporarily or arbitrarily by various users and developers. In such an environment, it is not easy for a small number of security administrators to keep up-to-date on the storage status of confidential information. Multiple databases exist in one company, and many tables and data are temporarily created. This also increases the difficulty in tracking and judging the security management target.

이에 대응하기 위해, 본 발명은 데이터베이스의 메타데이터를 자동으로 분석하여 분석 시점의 최신 정보들을 수집하고 기밀정보의 저장 컬럼을 판단하여 보안 관리에 활용할 수 있는 방법을 제시한다. 즉, S201에 따른 머신 러닝 기법을 적용함으로써, 기밀정보 저장 컬럼 및 기밀정보의 종류에 대한 자동 추천(분석)이 가능하다.In order to cope with this, the present invention proposes a method that can be utilized for security management by automatically analyzing the metadata of a database to collect the latest information at the time of analysis, and determining the storage column of confidential information. That is, by applying the machine learning technique according to S201, automatic recommendation (analysis) of the confidential information storage column and the type of confidential information is possible.

한편, 테이블의 개별 컬럼에 대해 기밀정보를 포함하는지를 판단하는 흐름은 다음과 같을 수 있다.Meanwhile, the flow of determining whether confidential information is included in an individual column of a table may be as follows.

(1) 컬럼의 저장형식이 날자, 불린, 숫자 등 관리자가 무시하기로 설정 형식이라면 해당 컬럼은 기밀정보 저장 컬럼이 아닌 것으로 설정한다.(1) If the storage format of a column is set to be ignored by the administrator, such as date, Boolean, or number, set the column as not a confidential information storage column.

(2) 컬럼의 저장형식이 비정형 데이터를 포함한다면 컬럼의 데이터를 별도 파일로 저장하고 비정형 데이터에서 기밀정보의 종류 판별 방식을 따른다. 다만, 이는 본 발명의 대상이 아니라 해당 컬럼은 판단을 유보한다.(2) If the storage format of a column includes unstructured data, save the column data as a separate file and follow the method of determining the type of confidential information from unstructured data. However, this is not the subject of the present invention, and the corresponding column reserves the judgment.

(3) S201에 따라, 본 발명에서 제시하는 컬럼의 기밀정보의 저장의 종류를 추정하는 방법을 적용한다. 다만, 관리자가 해당 컬럼에 기밀정보의 종류를 설정하거나 기밀정보가 아니라고 설정한 경우 해당 설정을 따른다.(3) According to S201, the method of estimating the type of storage of confidential information in a column proposed in the present invention is applied. However, if the administrator sets the type of confidential information in the corresponding column or sets it as non-confidential information, the corresponding setting is followed.

(4) 상술한 (3)에서 추정된 기밀정보의 종류에 대해 컬럼의 최대 저장 길이가 기밀정보를 포함하기에 너무 작다면 기밀 정보 저장 컬럼이 아니라 판단하고 충분히 크다면 기밀정보 저장 컬럼으로 설정한다.(4) For the type of confidential information estimated in (3) above, if the maximum storage length of the column is too small to contain confidential information, it is determined that it is not a confidential information storage column. If it is large enough, it is set as a confidential information storage column. .

다음으로, S202는 S201에서 추정된 정보를 기반으로(즉, 테이블의 각 컬럼에 대해 기밀정보 저장 여부 및 저장된 기밀정보 종류가 추정된 정보를 기반으로), 해당 테이블에 대한 기밀정보 관련 지표를 생성한다. 물론, 상술한 (3)에서 설정된 정보 외에ㄷ 상술한 (4)에 따라 설정된 정보를 기반으로, 해당 기밀정보 관련 지표를 생성할 수도 있다Next, S202 generates confidential information related indicators for the table based on the information estimated in S201 (that is, whether confidential information is stored for each column of the table and the type of stored confidential information is estimated) do. Of course, in addition to the information set in (3) above, based on the information set according to (4) above, a corresponding confidential information related indicator may be generated.

이때, 기밀정보 관련 지표는 테이블의 기밀정보 포함 여부와, 테이블에 포함된 기밀정보의 종류 및 개수 등에 따라, 테이블이 유출될 경우에 해당 기밀정보의 파급 정도 등을 고려하여, 테이블의 위험한 정도(위험도)(즉, 테이블이 유출될 경우의 위험도)가 점수 형태로 계산된 지표이다. 이때, 위험도가 높을수록 해당 기밀정보 등을 포함하는 컬럼, 레코드 또는 테이블의 기밀성 유지가 더 필요한 것임을 나타낸다.At this time, the confidential information-related indicator is the level of danger ( Risk) (that is, the risk in case the table is leaked) is an index calculated in the form of a score. In this case, the higher the risk, the more it is necessary to maintain the confidentiality of the column, record, or table including the corresponding confidential information.

즉, S202는 테이블에 대해 복수 개의 분류 항목을 선택하는 단계(이하, “선택 단계”라 지칭함)와, 선택 단계에서 선택된 각 분류 항목에 대한 비중 및 등급을 설정하는 단계(이하, “설정 단계”)와, 설정 단계에서 설정된 각 분류 항목의 비중과 등급을 기반으로 테이블에 대한 기밀정보 관련 지표를 계산하는 단계(이하, “계산 단계”라 지칭함)를 각각 포함할 수 있다. 즉, S202를 통해, 기밀정보를 저장하고 있는 테이블의 기밀정보 관련 지표(위험 지표)를 생성하는 알고리즘이 수행될 수 있다.That is, S202 includes the steps of selecting a plurality of classification items for the table (hereinafter, referred to as “selection step”), and setting the weight and grade for each classification item selected in the selection step (hereinafter, “setting step”). ) and calculating the confidential information related index for the table based on the weight and grade of each classification item set in the setting step (hereinafter referred to as the “calculation step”), respectively. That is, through S202, an algorithm for generating confidential information-related indicators (risk indicators) of a table storing confidential information may be performed.

다만, 선택 단계에 따른 선택 정보 또는 설정 단계에 따른 설정 정보는 관리자로부터 입력부(110)를 통해 입력 받거나(이하, “제1 예”이라 지칭함) 통신부(130)를 통해 수신(이하, “제2 예”라 지칭함)할 수 있으며, 그 외에도 제어부(150)가 머신 러닝 기법 적용을 위해 임의적으로 선택(이하, “제3 예”라 지칭함)할 수도 있다. 다만, 후술의 설명이나 특허 청구 범위에서의 선택 단계 또는 설정 단계에 대한 능동 표현 또는 수동 표현은 이러한 제1 예 내지 제3 예의 동작 중에 적어도 하나를 포함하는 것을 의미할 수 있다. 다만, 제3 예의 동작이 수행되더라도, 최초의 선택 단계에 따른 선택 정보 또는 설정 단계에 따른 설정 정보는 제1 예 또는 제2 예의 동작에 따라 선택 및 설정되는 것이 바람직할 수 있다.However, selection information according to the selection step or setting information according to the setting step is received from the manager through the input unit 110 (hereinafter referred to as “first example”) or received through the communication unit 130 (hereinafter, “second example”). “Yes”), and in addition, the controller 150 may arbitrarily select (hereinafter, referred to as “third example”) to apply the machine learning technique. However, an active expression or a passive expression for the selection step or setting step in the following description or claims may mean including at least one of the operations of the first to third examples. However, even if the operation of the third example is performed, it may be preferable that the selection information according to the initial selection step or the setting information according to the setting step is selected and set according to the operation of the first example or the second example.

이러한 선택 단계 내지 계산 단계의 수행은 제어부(150)의 생성부(152)에 의해 제어될 수 있다. 이때, 생성부(152)는 분석 장치(100)에 설치된 전용의 소프트웨어(프로그램)을 이용하여 선택 단계 내지 계산 단계의 수행을 제어할 수 있다.The execution of the selection step or the calculation step may be controlled by the generator 152 of the controller 150 . In this case, the generator 152 may control the execution of the selection step or the calculation step by using a dedicated software (program) installed in the analysis apparatus 100 .

이하, 선택 단계 내지 계산 단계의 수행 과정에 대해서 보다 상세하게 설명하도록 한다.Hereinafter, the process of performing the selection step or the calculation step will be described in more detail.

먼저, 테이블에 대한 기밀정보 관련 지표를 생성하려는 경우, 선택 단계에서 복수 개의 분류 항목이 선택될 수 있다.First, in the case of generating an index related to confidential information for a table, a plurality of classification items may be selected in the selection step.

분류 항목은 테이블에 대한 분류 체계에 따라 테이블을 보다 상세하게 분류하기 위한 항목이다.The classification item is an item for classifying the table in more detail according to the classification system for the table.

선택 단계에서 복수 개의 분류 항목이 선택되면, 이후 설정 단계에서 각 분류 항목의 비중과 등급이 설정 설정된다. 이때, 비중은 선택된 타 분류 항목들과 비교해서 해당 분류 항목이 차지하는 중요성의 정도(점수)를 지칭한다. 즉, 비중은 선택된 복수의 제1 분류 항목들 중에서 차지하는 중요성의 정도에 따라 책정될 수 있다. 등급은 해당 분류 항목이 테이블에서 적용되는 정도(점수)을 지칭한다.When a plurality of classification items are selected in the selection step, the weight and grade of each classification item are set and set in a subsequent setting step. In this case, the weight refers to the degree of importance (score) that the corresponding classification item occupies in comparison with other selected classification items. That is, the weight may be determined according to the degree of importance occupied by the plurality of selected first classification items. The grade refers to the degree (score) to which the corresponding classification item is applied in the table.

이때, 비중과 등급은 숫자(점수)로 설정될 수 있으며, 비중 및 등급이 클수록 그 숫자(점수)가 높게 설정될 수 있다. 특히, 그 설정의 용이성을 위해, 선택된 각 분류 항목의 등급은 모두 다 동일한 범위 내에서 설정될 수 있다. 예를 들어, 선택된 각 분류 항목의 등급은 1 내지 10의 범위에서 설정될 수 있다. 다만, 선택 단계에서 특정 분류 항목이 선택됐지만, 해당 분류 항목이 그 테이블과 관련이 없거나 그 등급의 설정이 어렵거나 모호한 경우, 설정 단계에서 그 등급이 미지정되어 그 범위 외의 값(예를 들어, 0 등의 값)으로 설정될 수도 있다.In this case, the specific gravity and grade may be set as a number (score), and the higher the specific gravity and grade, the higher the number (score) may be set. In particular, for ease of setting, the grades of each selected classification item may all be set within the same range. For example, the grade of each selected classification item may be set in a range of 1 to 10. However, if a specific classification item is selected in the selection stage, but the classification item is not related to the table or the setting of the rating is difficult or ambiguous, the rating is not specified in the setting stage and a value outside the range (for example, 0 etc.) may be set.

선택 단계 및 설정 단계에 따라 복수 개의 분류 항목의 선택 및 선택된 각 분류 항목에 대한 비중 및 등급을 설정한 후, 계산 단계에서, 하기 [식1] 및 [식2]를 이용하여 각 분류 항목에 대한 개별 분류점수가 계산될 수 있으며, 이후, 계산 단계에서, 계산된 개별 분류점수의 합계를 이용하여 기밀정보 관련 지표가 도출될 수 있다.After selecting a plurality of classification items according to the selection step and setting step and setting the weight and grade for each selected classification item, in the calculation step, using the following [Equation 1] and [Equation 2], Individual classification scores may be calculated, and then, in the calculation step, confidential information-related indicators may be derived using the sum of the calculated individual classification scores.

[식1]

[Formula 1]

[식1]에서, i는 자연수, G_i는 i번째 분류 항목의 비중 지표, W_i는 i번째 분류 항목의 비중, W_S는 비중 합계(등급이 0이 아닌 값으로 설정된 각 분류 항목의 등급에 대한 합), C는 상수를 각각 나타낸다.In [Equation 1], i is a natural number, G _i is the weight index of the i-th classification item, W _i is the weight of the i-th classification item, and W _S is the sum of weights (the grade of each classification item whose grade is set to a non-zero value) ), and C represents a constant, respectively.

[식2]

[Formula 2]

[식2]에서, S_i는 i번째 분류 항목의 개별 분류점수, R_i는 i번째 분류 항목의 설정된 등급을 각각 나타낸다.In [Equation 2], S _i represents the individual classification score of the i-th classification item, and R _i represents the set grade of the i-th classification item, respectively.

<기밀정보 관련 지표의 생성 예시><Example of generation of confidential information related indicators>

예를 들어, 조직원에 대한 기밀정보 관련 지표의 생성을 위해, 선택 단계에서 선택될 수 있는 분류 항목은 [컬럼 등급], [컬럼 개수 등급], [레코드 개수 등급], [테이블 등급], [노출 등급], [부서 등급] 또는 [기타 등급] 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.For example, for the generation of indicators related to confidential information about organizational members, classification items that can be selected in the selection step are [column class], [column count class], [record count class], [table class], [exposure class] grade], [department grade], or [other grade], but is not limited thereto.

[컬럼 등급]의 분류 항목은 S201에서 기밀정보를 저장한 것으로 추정된 컬럼(이하, “기밀정보 컬럼”이라 지칭함)들 중에서 가장 높은 위험도를 가진 컬럼의 해당 위험도에 따라 그 등급이 설정되는 항목이다. 즉, 복수개의 기밀정보 컬럼이 있는 경우, 그 중 최고 위험도의 기밀정보 컬럼에 대한 위험도를 평가하는 항목이다. 물론, 개별 기밀정보 컬럼에 대한 위험도는 기밀정보 분류마다 관리자 등에 의해 미리 설정된다. 예를 들어, 개별 기밀정보 컬럼에 대한 위험도 값(등급)은 0 내지 10의 범위에서 설정될 수 있다. 이때, 개별 기밀정보 컬럼의 위험도가 높을수록 10에 가깝게 설정될 수 있다. 다만, 미지정일 경우, 0이 설정될 수 있다. 이와 같이 복수개의 개별 기밀정보 컬럼에 대한 위험도가 설정된 후, 이들 중 가장 위험도가 높은 등급을 [컬럼 등급]의 등급으로 설정할 수 있다. 즉, [정보 분류]는 가장 위험도가 높은 기밀정보의 영향력을 반영하기 위한 항목이다.The classification item of [column grade] is an item whose grade is set according to the corresponding risk of the column with the highest risk among the columns estimated to store confidential information in S201 (hereinafter referred to as “confidential information column”). . That is, when there are a plurality of confidential information columns, it is an item for evaluating the risk of the confidential information column with the highest risk among them. Of course, the degree of risk for each confidential information column is preset by an administrator for each confidential information classification. For example, the risk value (grade) for each confidential information column may be set in the range of 0 to 10. In this case, the higher the risk of the individual confidential information column, the closer it may be set to 10. However, if not specified, 0 may be set. After the level of risk for a plurality of individual confidential information columns is set in this way, the highest level of risk among them may be set as the level of [column level]. That is, [Information Classification] is an item to reflect the influence of confidential information with the highest risk.

[컬럼 개수 등급]의 분류 항목은 기밀정보 컬럼의 개수에 따라 그 등급이 설정되는 항목이다. 즉, 기밀정보 컬럼의 개수가 많을수록 더 높은 위험도를 가지므로, 이러한 경향을 반영한 항목이다. 예를 들어, [컬럼 개수 등급]은 0 내지 10의 범위에서 설정될 수 있다. 이때, 기밀정보 컬럼의 개수가 많을수록 10에 가깝게 설정될 수 있다. 다만, 미지정일 경우, 0이 설정될 수 있다. 기밀정보 컬럼의 개수에 특정 보정 값을 곱하여 등급을 설정할 수 있다.Classification items of [Column number grade] are items whose grade is set according to the number of confidential information columns. That is, the higher the number of confidential information columns, the higher the risk, so this trend is reflected. For example, [column number grade] may be set in the range of 0 to 10. In this case, as the number of confidential information columns increases, it may be set closer to 10. However, if not specified, 0 may be set. The grade can be set by multiplying the number of confidential information columns by a specific correction value.

[레코드 개수 등급]의 분류 항목은 기밀정보 컬럼을 포함한 레코드의 개수에 따라 그 등급이 설정되는 항목이다. 즉, 해당 레코드의 개수가 많을수록 더 높은 위험도를 가지므로, 이러한 경향을 반영한 항목이다. 예를 들어, [레코드 개수 등급]은 0 내지 10의 범위에서 설정될 수 있다. 이때, 해당 레코드의 개수가 많을수록 10에 가깝게 설정될 수 있다. 다만, 미지정일 경우, 0이 설정될 수 있다. 해당 레코드의 개수에 log10를 취한 후, 제곱하여 소수점 올림 처리하여 그 값을 그 등급을 설정할 수 있다.The classification item of [Record number grade] is an item whose grade is set according to the number of records including the confidential information column. That is, the higher the number of records, the higher the risk, so this trend is reflected. For example, [record number grade] may be set in the range of 0 to 10. In this case, as the number of corresponding records increases, it may be set closer to 10. However, if not specified, 0 may be set. After taking log10 for the number of records, squaring and rounding up the decimal point, the value can be set to the grade.

[테이블 등급]의 분류 항목은 해당 테이블의 중요도에 따라 그 등급이 설정되는 항목이다. 즉, 해당 테이블의 위험도에 따라 그 등급이 설정되는 항목이다. 예를 들어, [테이블 등급]은 0 내지 10의 범위에서 설정될 수 있다. 이때, 해당 테이블의 중요도(위험도)가 높을수록 10에 가깝게 설정될 수 있다. 다만, 미지정일 경우, 0이 설정될 수 있다. 해당 레코드의 개수에 log10를 취한 후, 제곱하여 소수점 올림 처리하여 그 값을 그 등급을 설정할 수 있다. 다만, 테이블은 자체에 많은 정보를 포함하므로, [테이블 등급]은 분류 항목 중에서 비교적 중요한 항목에 해당할 수 있으며, 이에 따라 다른 분류 항목과 상대적으로 중간 이상의 비중을 부여하는 것이 바람직할 수 있다.Classification items of [Table Grade] are items whose grades are set according to the importance of the corresponding table. That is, it is an item whose grade is set according to the level of risk of the corresponding table. For example, [Table Grade] may be set in the range of 0 to 10. In this case, the higher the importance (risk) of the corresponding table, the closer it may be set to 10. However, if not specified, 0 may be set. After taking log10 for the number of records, squaring and rounding up the decimal point, the value can be set to the grade. However, since a table contains a lot of information in itself, [Table Grade] may correspond to a relatively important item among classification items, and accordingly, it may be desirable to give a relatively high weight relative to other classification items.

[노출 등급]의 분류 항목은 해당 테이블의 대외 공개 여부 또는 해당 테이블의 사용자에 따라 그 등급이 설정되는 항목이다. 즉, 해당 테이블이 이미 대외적으로 공개된 것인 경우에 그 위험도가 낮고, 반대로 해당 테이블이 대외적으로 공개되지 않은 것인 경우에 그 위험도가 높을 수밖에 없는 경향이 있다. 또한, 해당 테이블이 조직 등의 내부에서 특정 사용자만이 사용하는 경우에 그 위험도가 높고, 테이블이 조직 등의 내부에서 다수의 사용자가 사용하는 경우에 그 위험도가 낮을 수밖에 없는 경향이 있다. [노출 등급]은 이러한 경향을 반영한 항목이다. 예를 들어, [노출 등급]은 1 내지 10의 범위에서 설정될 수 있다. 이때, 해당 노출 등급에 따른 위험도가 높을수록 10에 가깝게 설정될 수 있다. 다만, 미지정일 경우, 0이 설정될 수 있다.The classification item of [Exposure Level] is an item whose level is set depending on whether the table is publicly disclosed or the user of the table. That is, when the corresponding table is already publicly disclosed, the risk is low, and conversely, when the corresponding table is not publicly available, the risk tends to be high. In addition, when the table is used only by a specific user inside the organization, the risk is high, and when the table is used by a large number of users inside the organization, the risk tends to be low. [Exposure grade] reflects this trend. For example, [Exposure Level] may be set in the range of 1 to 10. In this case, the higher the risk according to the exposure class, the closer it may be set to 10. However, if not specified, 0 may be set.

[부서 등급]은 분류 항목은 조직 등에서 해당 테이블의 사용 부서에 따라 그 등급이 설정되는 항목이다. 즉, 해당 테이블을 관리 및 사용하는 부서의 취급 정보에 따른 기밀의 정도나, 해당 부서의 보안(유출) 위험사고 가능성 정도를 평가하는 항목이다. 즉, 부서에 따라 취급 정보의 기밀성 또는 보안 위험사고가 다를 수밖에 없는 경향을 반영한 항목이다. 예를 들어, [부서 등급]은 1 내지 10의 범위 내에서 설정될 수 있다. [부서 등급]에 따른 위험도가 높을수록 10에 가깝게 설정될 수 있다. 다만, 미지정일 경우, 0이 설정될 수 있다. [Department Rating] is an item whose classification is set according to the department using the table in the organization, etc. That is, it is an item that evaluates the degree of confidentiality according to the handling information of the department that manages and uses the table, or the degree of possibility of a security (leakage) risk accident of the relevant department. In other words, this item reflects the trend in which confidentiality or security risk accidents of handling information are inevitably different depending on the department. For example, [department grade] may be set within the range of 1 to 10. The higher the risk according to [Department Level], the closer it can be set to 10. However, if not specified, 0 may be set.

[기타 내용]의 분류 항목은 상술한 분류 항목 외에 추가될 수 있는 항목으로서, 테이블의 기밀정보 유출 위험 판단에 중요한 정보에 해당될 수 있는 내용에 대한 항목이다. 예를 들어, 선택 단계에서 [기타 내용]이 선택되는 경우, 설정 단계에서 그 등급은 1 내지 10의 범위에서 설정되되, 그 위험도가 높을수록 10에 가깝게 설정될 수 있다. 다만, 미지정일 경우, [기내 내용]의 등급은 0이 설정될 수 있다.The category of [Other Contents] is an item that can be added in addition to the above-mentioned classification items, and is an item that may correspond to important information in determining the risk of leaking confidential information in the table. For example, when [Other content] is selected in the selection step, the grade is set in the range of 1 to 10 in the setting step, but may be set closer to 10 as the risk is higher. However, if it is not specified, the grade of [In-flight contents] may be set to 0.

상술한 분류 항목에서 그 등급이 0으로 설정된 경우는 해당 분류 항목이 해당 테이블과 관련 없음을 의미할 수 있다. 이를 통해, 다수의 분류 항목 중에서 해당 테이블에 관계된 분류 항목만을 취사 선택해서 사용할 수 있다.When the grade is set to 0 in the above-described classification item, it may mean that the classification item is not related to the corresponding table. Through this, it is possible to select and use only the classification items related to the corresponding table from among the plurality of classification items.

한편, 설정 단계에서, 분류 항목에 대한 비중 설정 시, 설정 수행에 대한 안내가 디스플레이부(140)에 표시될 수 있다.Meanwhile, in the setting step, when the weight of the classification item is set, a guide for setting execution may be displayed on the display unit 140 .

하기 [표 1]은 어느 테이블에 대해, 선택 단계에서 선택된 [컬럼 등급], [컬럼 개수 등급], [레코드 개수 등급], [테이블 등급], [노출 등급] 및 [부서 등급]의 각 분류 항목(G_i)과, 설정 단계에서 설정된 각 분류 항목에 대한 비중(W_i) 및 등급(R_i)의 값을 나타낸다.The following [Table 1] shows each classification item of [column grade], [column count grade], [record count grade], [table grade], [exposure grade] and [department grade] selected in the selection step for a table (G _i ) and values of specific gravity (W _i ) and grade (R _i ) for each classification item set in the setting step are shown.

No
(i)No
(i) 분류 항목Category item 비중
(W_i)importance
(W _i ) 등급
(R_i)ranking
(R _i ) 참고Reference 1One 컬럼 등급column class 2020 55 R₁: 미지정 0, 최소 1 ~ 최대 10R ₁ : Unspecified 0, min 1 ~ max 10 22 컬럼 개수 등급number of columns class 1515 55 R₂: 미지정 0, 최소 1 ~ 최대 10R ₂ : Unspecified 0, min 1 ~ max 10 33 레코드 개수 등급record count class 1010 99 R₃: 미지정 0, 최소 1 ~ 최대 10R ₃ : Unspecified 0, min 1 ~ max 10 44 테이블 등급table grade 55 1010 R₄: 미지정 0, 최소 1 ~ 최대 10R ₄ : Unspecified 0, min 1 ~ max 10 55 노출 등급exposure class 55 00 R₅: 미지정 0, 최소 1 ~ 최대 10R ₅ : Unspecified 0, min 1 ~ max 10 66 부서 등급Department level 55 00 R₆: 미지정 0, 최소 1 ~ 최대 10R ₆ : Unspecified 0, min 1 ~ max 10

하기 [표 2]는 [표 1]의 설정을 이용하여, 계산 단계에서 계산되는 각 제1 분류 항목에 대한 개별 분류점수와, 기밀정보 관련 지표(위험 지표)를 나타낸다.The following [Table 2] shows the individual classification score for each first classification item calculated in the calculation step and the confidential information related index (risk index) using the settings of [Table 1].

No
(i)No
(i) 분류 항목에 대한 설정Settings for classification items 계산 값calculated value 분류 항목Category item 비중
(W_i)importance
(W _i ) 등급
(R_i)ranking
(R _i ) 비중 지표
(G_i)

specific gravity indicator
(G _i )

individual classification score
(S _i )

One column class 20 5 20 / 50 × 10 = 4 5 × 4 = 20 2 number of columns class 15 5 15 / 50 × 10 = 3 5 × 3 = 15 3 record count class 10 9 10 / 50 × 10 = 2 9 × 2 = 18 4 table grade 5 10 5 / 50 × 10 = 1 10 × 1 = 10 5 exposure class 5 0 - - 6 Department level 5 0 - - Sum 10 63
(risk indicator)

[표 1] 및 [표 2]의 내용을 정리하면, 다음과 같다.The contents of [Table 1] and [Table 2] are summarized as follows.

- 선택 단계에서 선택된 분류 항목: 테이블에 대한 기밀정보의 위험성에 영향을 주는 분류 항목- Classification items selected in the selection step: Classification items that affect the risk of confidential information on the table

- 선택 단계에서 선택된 분류 항목의 수 = 6- Number of classification items selected in the selection step = 6

- 설정 단계에서 등급이 미지정인 경우가 아닌 분류 항목(이하, “사용 분류 항목”이라 지칭함)의 수 = 4- Number of classification items (hereinafter referred to as “use classification items”) that are not graded in the setting stage = 4

- 비중 합계(W_S): 등급이 0이 아닌 모든 사용 분류 항목들의 등급의 합 = 20 + 15 + 10 + 5 = 50- Sum of weights (W _S ): Sum of grades of all use classifications with non-zero grades = 20 + 15 + 10 + 5 = 50

- 사용 분류 항목에 대한 비중의 설정: 다른 사용 분류 항목들과 비교하되 비중 합계(W_S)를 고려하여 비중 값 부여(단, 모든 사용 분류 항목들의 비중 값의 합은 비중 합계가 되도록 함)- Setting the weight for use classification items: Compare with other use classification items, but give weight value considering the sum of weights (W _S )

- 비중 지표(G_i) = 설정된 비중(W_i) / 비중 합계(W_S) × 상수(C)(예를 들어, C는 10 등의 값일 수 있음)- Gravity Indicator (G _i ) = Set Specific Gravity (W _i ) / Weight Sum (W _S ) × Constant (C) (for example, C can be a value such as 10)

- 각 사용 분류 항목의 개별 분류 점수(S_i) = 설정된 등급(R_i) × 비중 지표(G_i) = [20, 15, 18, 10]- Individual classification score (S _i ) of each used classification item = established grade (R _i ) × specific gravity index (G _i ) = [20, 15, 18, 10]

- 위험 지표 = 개별 분류 점수(S_i)의 합계(단, 위험 지표의 값은 미지정 0, 또는 최소 1에서 최대 100 값이 될 수 있음) = 20 + 15 + 18 + 10 = 63- Risk indicator = Sum of individual classification scores (S _i ) (however, the value of the risk indicator can be unassigned 0, or a minimum of 1 to a maximum of 100) = 20 + 15 + 18 + 10 = 63

이후, S203에서는 다수의 테이블에 대해 생성한 기밀정보 관련 지표(위험 지표)를 이용하여, 각 테이블의 관리 우선순위에 대한 정보를 제공할 수 있다. 즉, 이 위험지표 생성 알고리즘은 관리자들이 의미 있는 항목과 비중으로 적절히 조절하여 사용할 수 있는 특징을 가지고 있다.Thereafter, in S203, information on the management priority of each table may be provided by using the confidential information related index (risk index) generated for a plurality of tables. In other words, this risk indicator generation algorithm has a characteristic that managers can appropriately adjust and use meaningful items and weights.

예를 들어, S202의 수행에 따라 산출된 기밀정보 관련 지표는 1 내지 100의 점수를 가질 수 있으며, 그 값이 클수록 해당 테이블의 위험도가 높다는 것을 의미한다. 따라서, 많은 시간이 소요되는 기밀정보에 대한 관리 및 사용 소명에 대한 업무에 대해, 해당 기밀정보 관련 지표를 이용함으로써 그 업무 보다 효율적으로 개선할 수 있다. 즉, 관리 여력에 따라, 위험 지표를 테이블 관리의 우선순위 선정 기준으로 사용할 수 있다.For example, the confidential information related index calculated according to the execution of S202 may have a score of 1 to 100, and a larger value means that the risk of the corresponding table is high. Therefore, the task of managing and using confidential information, which takes a lot of time, can be improved more efficiently by using the relevant confidential information related index. That is, depending on the management capacity, the risk index can be used as a priority selection criterion for table management.

가령, S202에서 계산된 기밀정보 관련 지표를 이용하여, 관리 대상 테이블에 대한 우선순위를 설정할 수 있다. 즉, 기밀정보 관련 지표가 높은 순서대로, 각 테이블에 대한 요약 정보를 순차적으로 나열하여 제시함으로써, 각 테이블의 관리 우선순위에 대한 정보를 제공할 수 있다.For example, by using the confidential information related index calculated in S202, it is possible to set the priority for the management target table. That is, by sequentially listing and presenting summary information for each table in the order of the highest confidential information related index, information on the management priority of each table can be provided.

또한, S202에서 계산된 기밀정보 관련 지표가 기준 지표에서 벗어나는 경우, 해당 지표의 테이블을 우선적으로 관리할 관리 대상 후보로 선정하여 알려줄 수 있다. 이때, 기준 지표는 기밀정보 관련 지표에 대한 특정의 값일 수 있다. In addition, when the index related to confidential information calculated in S202 deviates from the reference index, the table of the corresponding index may be selected and notified as a management target candidate to be managed preferentially. In this case, the reference indicator may be a specific value for the confidential information related indicator.

본 발명은 IT 환경에서 조직의 산재된 기밀정보들을 분석하여 기밀정보의 종류를 식별하는 방안을 제시한다. 특히 정형화된 데이터를 저장하고 있는 데이터베이스 시스템 같은 경우 저장 구조인 메타데이터를 대상으로 머신 러닝 기법을 적용하여 기밀정보의 종류를 판별해내는 방안을 제시한다. 또한 대표적인 실제 데이터 하나를 추출하여 정규화하여 머신 러닝의 학습데이터의 입력데이터에 추가함으로써, 기밀정보 종류 판별의 효율화 방안을 제시한다.The present invention proposes a method for identifying types of confidential information by analyzing scattered confidential information of organizations in an IT environment. In particular, in the case of a database system that stores standardized data, we propose a method for determining the type of confidential information by applying a machine learning technique to the metadata, which is a storage structure. Also, by extracting one representative real data, normalizing it, and adding it to the input data of machine learning learning data, we propose an efficient method for discriminating the type of confidential information.

또한, 본 발명은 탐지된 기밀정보의 종류를 사용하여 정량화된 위험지표를 생성하는 알고리즘을 제시한다. 이를 통해 관리업무를 효율적이고 효과적으로 수행할 수 있도록 할 수 있다. 즉, 본 발명은 정보보안, 유출방지, 개인정보보호, 보안관리, 보안감사 분야에 속할 수 있다.In addition, the present invention proposes an algorithm for generating a quantified risk indicator using the detected type of confidential information. In this way, it is possible to efficiently and effectively perform management tasks. That is, the present invention may belong to the fields of information security, leakage prevention, personal information protection, security management, and security audit.

본 발명은 기밀정보의 종류 수, 각 기밀정보의 종류의 위험도, 기밀정보의 건수, 정보 저장 서버의 외부공개 여부, 소속 그룹, 관리 상태 등을 기반으로 하여 기밀정보 관련 지표(위험 지표)를 생성할 수 있으며, 이러한 지표는 관리 대상을 효과적으로 선정하는데 이용될 수 있다. 즉, 본 발명은 정보를 저장하는 시스템을 분석하여 기밀정보의 종류를 탐지하고 수량을 파악하는 방법과 테이블의 위험 지표를 생성하는 알고리즘을 제시한다. 이러한 알고리즘은 기밀 유출방지 및 개인정보보호에 활용될 수 있으며, 사내 보안 규정 준수 및 감독기관의 보안 지침 준수를 위해 사용될 수 있다.The present invention generates confidential information-related indicators (risk indicators) based on the number of types of confidential information, the degree of risk of each type of confidential information, the number of confidential information, whether the information storage server is externally disclosed, group belonging to, management status, etc. and these indicators can be used to effectively select management targets. That is, the present invention proposes a method for detecting the type and quantity of confidential information by analyzing a system for storing information, and an algorithm for generating a risk index of a table. These algorithms can be used for confidentiality leak prevention and personal information protection, and can be used for compliance with internal security regulations and security guidelines of supervisory authorities.

본 발명의 상세한 설명에서는 구체적인 실시 예에 관하여 설명하였으나 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시 예에 국한되지 않으며, 후술되는 청구범위 및 이 청구범위와 균등한 것들에 의해 정해져야 한다.In the detailed description of the present invention, although specific embodiments have been described, various modifications are possible without departing from the scope of the present invention. Therefore, the scope of the present invention is not limited to the described embodiments, and should be defined by the following claims and their equivalents.

100: 기밀정보 분석 장치 110: 입력부
120: 저장부 130: 통신부
140: 디스플레이부 150: 제어부
151: 추정부 152: 생성부
153: 관리부 154: 학습부100: confidential information analysis device 110: input unit
120: storage unit 130: communication unit
140: display unit 150: control unit
151: estimator 152: generator
153: management department 154: learning department

Claims

A method for analysis related to whether confidential information is stored in a table included in a database and performed by an electronic device capable of computing, comprising:
Estimate whether confidential information is stored and the type of stored confidential information for each column of a table by using a machine learning model trained in machine learning based on the metadata of the database to do; and
Including, including; generating a confidential information related index for the table based on the estimated information;
The step of generating the indicator includes:
setting weights and grades of a plurality of classification items with respect to the table; and
Calculating the index related to confidential information for the table based on the weight and grade of each set classification item;
The classification items are
a first classification item in which a grade is set according to a corresponding risk of a column having the highest risk among columns estimated to store confidential information (confidential information column);
a second classification item in which a grade is set according to the number of confidential information columns;
a third classification item in which a grade is set according to the number of records including the confidential information column;
a fourth classification item in which a grade is set according to the importance of the corresponding table;
a fifth classification item whose rating is set according to whether it is disclosed to the public or a user of the corresponding table; and
a sixth classification item in which a grade is set according to a department using the table;
A method comprising at least one selected from the group comprising

According to claim 1,
The machine learning model is trained using training data including a pair of input data and output data, the input data includes one or more metadata items selected from metadata, and the output data determines whether confidential information is stored and confidential How to include information types.

3. The method of claim 2,
The metadata item includes at least one selected from a column name, a column type, a column length, and a stored column data length.

3. The method of claim 2,
The metadata item includes a substitution value,
The substitution value is a value in which each digit constituting the number is replaced with the same numerical value designated for one sample data extracted from data actually stored in the column, when the sample data includes a number, and the sample data A method in which each syllable character constituting a character is a value in which each character is substituted with the same specified character value when a character is included in the character.

delete

According to claim 1,
The method further comprising the step of providing information on the management priority of each table by using the confidential information related index generated for the plurality of tables.

A device for performing analysis related to whether confidential information is stored in a table included in a database, comprising:
a storage unit for storing a machine learning model pre-trained by a machine learning technique based on metadata of a database; and
A control unit for estimating whether confidential information is stored and the type of stored confidential information for each column of the table using a machine learning model, and controlling to generate an index related to confidential information for the table based on the estimated information; and ,
When the control unit controls to generate the index, the weight and grade of a plurality of classification items are set for the table, and the confidential information related index for the table is calculated based on the weight and grade of each set classification item,
The classification items are
a first classification item in which a grade is set according to a corresponding risk of a column having the highest risk among columns estimated to store confidential information (confidential information column);
a second classification item in which a grade is set according to the number of confidential information columns;
a third classification item in which a grade is set according to the number of records including the confidential information column;
a fourth classification item in which a grade is set according to the importance of the corresponding table;
a fifth classification item whose rating is set according to whether it is disclosed to the public or a user of the corresponding table; and
a sixth classification item in which a grade is set according to a department using the table;
A device comprising at least one selected from the group comprising a.