KR102563059B1

KR102563059B1 - System for generating graph-based training data for cyber threat detection and method thereof

Info

Publication number: KR102563059B1
Application number: KR1020210009353A
Authority: KR
Inventors: 이창훈; 공성현
Original assignee: 서울과학기술대학교 산학협력단
Priority date: 2020-11-25
Filing date: 2021-01-22
Publication date: 2023-08-04
Also published as: KR20220072697A

Abstract

본 발명의 실시예에 따른 학습 데이터 생성 장치는 보호 대상 시스템의 침해사고 지표 데이터(Indicator of Compromise data) 및 상기 침해사고 지표 데이터에 대응하는 보안 취약점 데이터를 수집하는 데이터 수집부; 상기 침해사고 지표 데이터 및 상기 보안 취약점 데이터에 포함된 메타 데이터를 검출하여 특징 데이터를 생성하는 특징 정보 생성부; 그리고 상기 침해사고 지표 데이터 및 상기 보안 취약점 데이터 사이의 상호 참조 관계 및 검출된 상기 특징 데이터에 기초하여 학습 데이터를 생성하는 학습 데이터 생성부;를 포함한다.An apparatus for generating learning data according to an embodiment of the present invention includes a data collection unit that collects indicator of compromise data of a system to be protected and security vulnerability data corresponding to the compromise indicator data; a feature information generator configured to generate feature data by detecting meta data included in the incident indicator data and the security vulnerability data; and a learning data generation unit configured to generate learning data based on the cross-reference relationship between the incident index data and the security vulnerability data and the detected feature data.

Description

Graph-based training data generation device for cyber threat detection

본 발명은 사이버 공격 탐지에 사용되는 딥러닝 모델을 학습시키기 위한 학습용 데이터셋을 생성하는 사이버 위협 탐지를 위한 그래프 기반 학습 데이터 생성 장치에 관한 것이다.The present invention relates to a graph-based training data generating device for detecting cyber threats that generates a training dataset for learning a deep learning model used for detecting cyber attacks.

인공지능 기술의 발달로 현실의 여러 기술적 문제를 해결하는 과정에 기계학습과 딥러닝이 활용되고 있다. 사이버보안의 영역 또한 대량의 데이터셋을 이용해 사이버 공격을 효과적으로 탐지, 예측 및 대응하기 위한 기술들이 활발히 등장하고 있으며, 이러한 기술의 기반에는 사이버 공격과 관련된 양질의 데이터셋이 필수적으로 요구된다. With the development of artificial intelligence technology, machine learning and deep learning are being used in the process of solving various technical problems in reality. In the area of cybersecurity, technologies for effectively detecting, predicting, and responding to cyberattacks using large datasets are actively emerging, and high-quality datasets related to cyberattacks are essential for the foundation of these technologies.

침해사고 지표(IoC)는 공격자의 IP 주소, 공격자가 사용한 악성코드(Malware)의 해시(Hash)값, 공격자가 악성코드 유포에 사용된 도메인(Domain), 공격자가 공격 대상에 접근하기 위해 사용한 포트 번호 등 사이버 공격 과정에서 관측된, 공격자의 행위를 유추할 수 있는 일련의 증거 데이터다. 사이버 공격이 광범위해지고 대규모화 되어가는 만큼, 대량으로 수집되는 IoC 데이터는 인공지능 기반의 사이버보안 기술을 위한 핵심 데이터셋으로 활용되고 있다. The IoC is the IP address of the attacker, the hash value of the malware used by the attacker, the domain used by the attacker to distribute the malware, and the port used by the attacker to access the target. It is a series of evidence data that can infer the behavior of an attacker observed in the course of a cyber attack, such as a number. As cyber attacks become widespread and large-scale, IoC data collected in large quantities is being used as a core data set for artificial intelligence-based cyber security technology.

그러나, IoC는 공격자의 행위에 대한 단편적인 데이터이며, 사이버 공격이라는 일련의 복잡한 사건을 재현하기에는 매우 제한적인 형태로 수집된다. 또한, 고도화된 악성코드는 공격 과정에서 관측될 수 있는 지표 정보들을 지속적으로 변조하면서 공격을 수행하기 때문에, 단편적인 IoC 정보의 조합으로는 사이버 공격에 효과적으로 대응하기 어려운 현실이다. 따라서, 단편적 형태의 IoC의 한계에서 벗어나, 사이버 공격이라는 절차적이고 복잡한 사건을 다각적인 관점에서 표현할 수 있는 데이터셋이 필요하다.However, IoC is fragmentary data about an attacker's behavior, and is collected in a very limited form to reproduce a series of complex events called cyberattacks. In addition, since advanced malicious code continuously modulates observable indicator information during an attack to perform an attack, it is difficult to effectively respond to a cyberattack with a combination of fragmentary IoC information. Therefore, there is a need for a dataset that can express the procedural and complex event of a cyber attack from multiple perspectives, free from the limitations of fragmentary forms of IoC.

실시 예는 사이버 위협 탐지를 위한 그래프 기반 학습 데이터를 제공한다. An embodiment provides graph-based learning data for cyber threat detection.

실시 예에서 해결하고자 하는 과제는 이에 한정되는 것은 아니며, 아래에서 설명하는 과제의 해결수단이나 실시 형태로부터 파악될 수 있는 목적이나 효과도 포함된다고 할 것이다.The problem to be solved in the embodiment is not limited thereto, and it will be said that the solution to the problem described below or the purpose or effect that can be grasped from the embodiment is also included.

본 발명의 실시예에 따른 학습 데이터 생성 장치는 보호 대상 시스템의 침해사고 지표 데이터(Indicator of Compromise data) 및 상기 침해사고 지표 데이터에 대응하는 보안 취약점 데이터를 수집하는 데이터 수집부; 상기 침해사고 지표 데이터 및 상기 보안 취약점 데이터에 포함된 메타 데이터를 검출하여 특징 데이터를 생성하는 특징 정보 생성부; 그리고 상기 침해사고 지표 데이터 및 상기 보안 취약점 데이터 사이의 상호 참조 관계 및 검출된 상기 특징 데이터에 기초하여 학습 데이터를 생성하는 학습 데이터 생성부;를 포함한다. An apparatus for generating learning data according to an embodiment of the present invention includes a data collection unit that collects indicator of compromise data of a system to be protected and security vulnerability data corresponding to the compromise indicator data; a feature information generator configured to generate feature data by detecting meta data included in the incident indicator data and the security vulnerability data; and a learning data generation unit configured to generate learning data based on the cross-reference relationship between the incident index data and the security vulnerability data and the detected feature data.

상기 학습 데이터 생성부는, 검출된 상기 메타 데이터를 이용하여 상기 침해사고 지표 데이터 및 상기 보안 취약점 데이터 각각에 대응하는 특징 행렬을 생성하고, 상기 침해사고 지표 데이터 및 상기 보안 취약점 데이터 중 적어도 하나에서 서로 다른 유형의 2개의 데이터에 대응하는 특징 행렬을 선택하며, 선택된 상기 2개의 데이터에 대응하는 상호 참조 관계 행렬을 선택하고, 선택된 상기 특징 행렬과 상기 상호 참조 관계 행렬을 이용하여 선택된 상기 2개의 데이터 사이의 연관 관계를 나타내는 그래프 데이터를 생성할 수 있다. The learning data generation unit generates a feature matrix corresponding to each of the incident indicator data and the security vulnerability data using the detected metadata, and at least one of the incident indicator data and the security vulnerability data is different from each other. Select feature matrices corresponding to two types of data, select a cross-reference relationship matrix corresponding to the selected two data, and determine the relationship between the selected feature matrix and the two data selected using the cross-reference relationship matrix. Graph data representing correlations can be created.

상기 특징 행렬은, 상기 메타 데이터의 개수와 상기 메타 데이터의 속성의 개수에 대응하는 행과 열을 가질 수 있다. The feature matrix may have rows and columns corresponding to the number of meta data and the number of attributes of the meta data.

상기 학습 데이터 생성부는, 상기 침해사고 지표 데이터 및 상기 보안 취약점 데이터 중 적어도 하나에서 서로 다른 유형의 2개의 데이터에 대응하는 복수의 상기 그래프 데이터를 생성하고, 복수의 상기 그래프 데이터에 대한 행렬곱을 통해 상기 학습 데이터를 생성할 수 있다. The learning data generation unit generates a plurality of graph data corresponding to two data of different types in at least one of the incident index data and the security vulnerability data, and performs matrix multiplication of the plurality of graph data. training data can be generated.

상기 학습 데이터를 저장 및 관리하는 학습 데이터 관리부;를 더 포함할 수 있다. It may further include a learning data management unit that stores and manages the learning data.

실시 예에 따르면, 단편적인 로그 데이터인 침해사고 지표 데이터와 보안 취약점 데이터를 이용하여, 복잡한 사건을 구성하는 여러 연관 관계를 표현할 수 있으며, 이를 통해 알고리즘의 학습 성능을 향상시킬 수 있다. According to the embodiment, it is possible to express various correlations constituting a complex event using fragmentary log data, such as intrusion index data and security vulnerability data, and through this, the learning performance of the algorithm can be improved.

본 발명의 다양하면서도 유익한 장점과 효과는 상술한 내용에 한정되지 않으며, 본 발명의 구체적인 실시형태를 설명하는 과정에서 보다 쉽게 이해될 수 있을 것이다.Various advantageous advantages and effects of the present invention are not limited to the above description, and will be more easily understood in the process of describing specific embodiments of the present invention.

도 1은 본 발명의 실시예에 따른 학습 데이터 생성 시스템을 나타낸 도면이다.
도 2는 본 발명의 실시예에 따른 학습 데이터 생성 장치를 나타낸 도면이다.
도 3은 본 발명의 실시예에 따른 메타 데이터의 예를 도시한 도면이다.
도 4는 본 발명의 실시예에 따른 메타 데이터의 예시를 나타낸 도면이다.
도 5는 본 발명의 실시예에 따른 침해사고 지표 데이터 및 보안 취약점 데이터의 상호 참조 관계를 나타낸 도면이다.
도 6은 본 발명의 실시예에 다른 학습 데이터 생성 방법을 나타낸 도면이다. 1 is a diagram showing a learning data generation system according to an embodiment of the present invention.
2 is a diagram showing an apparatus for generating learning data according to an embodiment of the present invention.
3 is a diagram showing an example of meta data according to an embodiment of the present invention.
4 is a diagram showing an example of meta data according to an embodiment of the present invention.
5 is a diagram illustrating a cross-reference relationship between intrusion index data and security vulnerability data according to an embodiment of the present invention.
6 is a diagram showing a learning data generation method according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Since the present invention can make various changes and have various embodiments, specific embodiments are illustrated and described in the drawings. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

제2, 제1 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제2 구성요소는 제1 구성요소로 명명될 수 있고, 유사하게 제1 구성요소도 제2 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다. Terms including ordinal numbers such as second and first may be used to describe various components, but the components are not limited by the terms. These terms are only used for the purpose of distinguishing one component from another. For example, a second element may be termed a first element, and similarly, a first element may be termed a second element, without departing from the scope of the present invention. The terms and/or include any combination of a plurality of related recited items or any of a plurality of related recited items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. It is understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, but other elements may exist in the middle. It should be. On the other hand, when an element is referred to as “directly connected” or “directly connected” to another element, it should be understood that no other element exists in the middle.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in this application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, the terms "include" or "have" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in the present application, they should not be interpreted in an ideal or excessively formal meaning. don't

이하, 첨부된 도면을 참조하여 실시예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 대응하는 구성 요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, the embodiments will be described in detail with reference to the accompanying drawings, but the same or corresponding components regardless of reference numerals are given the same reference numerals, and overlapping descriptions thereof will be omitted.

도 1은 본 발명의 실시예에 따른 학습 데이터 생성 시스템을 나타낸 도면이다. 1 is a diagram showing a learning data generation system according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시예에 따른 학습 데이터 생성 시스템은 보호 대상 시스템(100), OSINT 서버(200), 학습 데이터 생성 장치(300) 및 머신러닝 기반의 사이버 위협 탐지/대응 장치(400)를 포함할 수 있다. 1, the learning data generation system according to an embodiment of the present invention includes a protected system 100, an OSINT server 200, a learning data generation device 300, and a machine learning-based cyber threat detection/response device ( 400) may be included.

보호 대상 시스템(100)은 사이버 공격의 보호 대상이 되는 시스템을 의미할 수 있다. 예를 들어, 보호 대상 시스템(100)은 개인이나 기업 등에서 이용하는 컴퓨터 시스템을 의미할 수 있다. The system to be protected 100 may mean a system to be protected from a cyber attack. For example, the system to be protected 100 may refer to a computer system used by individuals or companies.

보호 대상 시스템(100)은 사이버 공격을 자체적으로 보호하기 위하여 방화벽(110) 및 백신 프로그램(120)을 포함할 수 있다. 방화벽(Firewall, 110)이란 기업이나 조직의 모든 정보가 컴퓨터에 저장되면서, 컴퓨터의 정보 보안을 위해 외부에서 내부, 내부에서 외부의 정보통신망에 불법으로 접근하는 것을 차단하는 시스템을 의미한다. 백신 프로그램(Anti-virus Program, 120)이란 컴퓨터의 바이러스 등 악성코드를 찾아내고 치료 및 방어하기 위한 소프트웨어를 의미한다. The system to be protected 100 may include a firewall 110 and a vaccine program 120 to protect itself from cyber attacks. A firewall (Firewall, 110) refers to a system that blocks illegal access to information and communication networks from the outside and from the inside for the security of computer information while all information of a company or organization is stored in a computer. An anti-virus program (120) refers to software for detecting, treating, and defending against malicious codes such as viruses in a computer.

보호 대상 시스템(100)은 방화벽(110) 및 백신 프로그램(120)을 이용하여 사이버 공격에 대한 차단, 치료, 방어 등을 수행하면서 사이버 공격에 대한 관측을 수행할 수 있다. 보호대상 시스템은 방화벽(110) 및 백신 프로그램(120)을 이용하여 사이버 공격에 대한 관측 정보인 침해사고 지표 데이터(Indicator of Compromise data, IoC data)를 생성할 수 있다. 보호 대상 시스템(100)은 침해사고 지표 데이터를 학습 데이터 생성 장치(300)로 전송할 수 있다. The system to be protected 100 may observe cyber attacks while blocking, treating, and defending against cyber attacks using the firewall 110 and the vaccine program 120 . The system to be protected may generate IoC data, which is observation information about cyber attacks, using the firewall 110 and the vaccine program 120 . The system to be protected 100 may transmit the incident indicator data to the learning data generating device 300 .

도 1에서는 보호 대상 시스템(100)에 포함된 보안 시스템의 예시로서 방화벽(110)과 백신 프로그램(120)을 도시하고 있으나, 이에 한정되는 것은 아니다. 보호 대상 시스템(100)에 포함된 보안 시스템은 방화벽(110)과 백신 프로그램(120) 이외에도 사이버 공격에 대한 차단, 치료, 방어 등을 수행할 수 있는 다양한 시스템을 포함할 수 있다. 1 illustrates a firewall 110 and a vaccine program 120 as examples of security systems included in the system to be protected 100, but are not limited thereto. The security system included in the system to be protected 100 may include various systems capable of blocking, treating, and defending against cyber attacks in addition to the firewall 110 and the vaccine program 120 .

OSINT 서버(200)는 공개적으로 수집된 정보를 저장하고 공유하는 서버를 의미할 수 있다. OSINT(Open Source INTelligence)란 공개된 출처에서 정보 기관 등이 합법적으로 수집한 정보를 의미할 수 있다. OSINT 서버(200)는 다양한 유형의 정보에 관한 데이터베이스를 포함할 수 있다. 본 발명의 실시예에 따르면, OSINT 서버(200)는 보안 취약점 데이터베이스(210)를 포함할 수 있다. 보안 취약점 데이터베이스(210)는 보안 취약점 데이터를 저장한다. 보안 취약점이란 사이버 공격자가 시스템의 정보 보증을 낮추는데 사용되는 약점을 의미한다. OSINT 서버(200)는 학습 데이터 생성 장치(300)의 요청에 따라 보안 취약점 데이터를 전송할 수 있다. The OSINT server 200 may refer to a server that stores and shares publicly collected information. OSINT (Open Source INTelligence) may refer to information legally collected by an intelligence agency or the like from an open source. OSINT server 200 may include a database for various types of information. According to an embodiment of the present invention, the OSINT server 200 may include a security vulnerability database 210 . The security vulnerability database 210 stores security vulnerability data. A security vulnerability is a weakness that is used by cyber attackers to lower the information assurance of a system. The OSINT server 200 may transmit security vulnerability data according to the request of the learning data generating device 300 .

학습 데이터 생성 장치(300)는 보호 대상 시스템(100)으로부터 수신한 침해사고 지표 데이터 및 OSINT 서버(200)로부터 수신한 보안 취약점 데이터에 기반하여 학습 데이터를 생성할 수 있다. 학습 데이터는 그래프(graph) 방식으로 표현되는 데이터일 수 있다. 학습 데이터 생성 장치(300)에 대한 상세한 설명은 도면을 참조하여 후술하도록 한다. The learning data generation device 300 may generate learning data based on the security vulnerability data received from the OSINT server 200 and the incident indicator data received from the system to be protected 100 . The training data may be data expressed in a graph method. A detailed description of the learning data generating device 300 will be described later with reference to the drawings.

머신러닝 기반의 사이버 위협 탐지/대응 장치(400)는 학습 데이터 생성 장치(300)로부터 학습 데이터를 수신하고, 수신한 학습 데이터를 이용하여 학습을 수행한다. The machine learning-based cyber threat detection/response device 400 receives learning data from the learning data generating device 300 and performs learning using the received learning data.

도 2는 본 발명의 실시예에 따른 학습 데이터 생성 장치를 나타낸 도면이다. 2 is a diagram showing an apparatus for generating learning data according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 실시예에 따른 학습 데이터 생성 장치(300)는 데이터 수집부(310), 특징 정보 생성부(320) 및 학습 데이터 생성부(330)를 포함하며, 학습 데이터 관리부(340)를 더 포함할 수 있다. Referring to FIG. 2 , the learning data generation device 300 according to an embodiment of the present invention includes a data collection unit 310, a feature information generation unit 320, and a learning data generation unit 330, and a learning data management unit. (340) may be further included.

데이터 수집부(310)는 보호 대상 시스템(100)의 침해사고 지표 데이터(Indicator of Compromise data) 및 침해사고 지표 데이터에 대응하는 보안 취약점 데이터를 수집할 수 있다. The data collection unit 310 may collect security vulnerability data corresponding to the compromise incident indicator data of the protection target system 100 (Indicator of Compromise data) and the compromise incident indicator data.

특징 정보 생성부(320)는 침해사고 지표 데이터 및 보안 취약점 데이터에 포함된 메타 데이터를 검출하여 특징 데이터를 생성할 수 있다. 예를 들어, 보호 대상 시스템(100)에서 어떤 악성코드를 포함한 파일이 발견될 경우, 데이터 수집부(310)에 의해 침해 사고 지표 데이터(예를 들어, 해당 악성코드 파일의 해시 값)가 수집될 수 있다. 그리고, 특징 정보 생성부(320)는 이로부터 해당 악성코드의 유형(예를 들어, 랜섬웨어 등), 공격 기법(예를 들어, object injection 등) 등을 특징정보로 생성할 수 있다. 또한, 특징 정보 생성부(320)는 보안 취약점 데이터의 경우에도 포함된 메타 데이터를 검출하여 유형, 공격 기법 등에 관한 특징정보를 생성할 수 있다. The feature information generating unit 320 may generate feature data by detecting metadata included in the incident indicator data and the security vulnerability data. For example, when a file containing a certain malicious code is found in the system to be protected 100, the data collection unit 310 collects the incident indicator data (eg, the hash value of the corresponding malicious code file). can Then, the characteristic information generation unit 320 may generate the type of malicious code (eg, ransomware, etc.), attack technique (eg, object injection, etc.) as characteristic information. In addition, the feature information generating unit 320 may generate feature information about a type, an attack technique, and the like by detecting metadata included even in the case of security vulnerability data.

학습 데이터 생성부(330)는 침해사고 지표 데이터 및 보안 취약점 데이터 사이의 상호 참조 관계 및 검출된 특징 데이터에 기초하여 학습 데이터를 생성할 수 있다. The learning data generator 330 may generate learning data based on the cross-reference relationship between the incident index data and the security vulnerability data and the detected feature data.

구체적으로, 학습 데이터 생성부(330)는 검출된 메타 데이터를 이용하여 침해사고 지표 데이터 및 보안 취약점 데이터 각각에 대응하는 특징 행렬을 생성할 수 있다. 이때, 특징 행렬은 메타 데이터의 개수와 메타 데이터의 속성의 개수에 대응하는 행과 열을 가질 수 있다. 학습 데이터 생성부(330)는 침해사고 지표 데이터 및 보안 취약점 데이터 중 적어도 하나에서 서로 다른 유형의 2개의 데이터에 대응하는 특징 행렬을 선택할 수 있다. 학습 데이터 생성부(330)는 선택된 2개의 데이터에 대응하는 상호 참조 관계 행렬을 선택할 수 있다. 학습 데이터 생성부(330)는 선택된 특징 행렬과 상호 참조 관계 행렬을 이용하여 선택된 2개의 데이터 사이의 연관 관계를 나타내는 그래프 데이터를 생성할 수 있다. Specifically, the learning data generating unit 330 may generate a feature matrix corresponding to each of the incident index data and the security vulnerability data by using the detected metadata. In this case, the feature matrix may have rows and columns corresponding to the number of meta data and the number of attributes of meta data. The learning data generator 330 may select feature matrices corresponding to two data of different types from at least one of the incident index data and the security vulnerability data. The learning data generator 330 may select a cross-reference relationship matrix corresponding to the two selected data. The learning data generation unit 330 may generate graph data representing an association between two selected pieces of data using the selected feature matrix and the cross-reference relationship matrix.

그리고, 학습 데이터 생성부(330)는 침해사고 지표 데이터 및 보안 취약점 데이터 중 적어도 하나에서 서로 다른 유형의 2개의 데이터에 대응하는 복수의 그래프 데이터를 생성할 수 있다. 학습 데이터 생성부(330)는 복수의 그래프 데이터에 대한 행렬곱을 통해 학습 데이터를 생성할 수 있다. Further, the learning data generating unit 330 may generate a plurality of graph data corresponding to two data of different types from at least one of the incident indicator data and the security vulnerability data. The learning data generator 330 may generate learning data through matrix multiplication of a plurality of graph data.

학습 데이터 관리부(340)는 학습 데이터를 저장 및 관리할 수 있다.The learning data management unit 340 may store and manage learning data.

도 3은 본 발명의 실시예에 따른 메타 데이터의 예를 도시한 도면이다. 3 is a diagram showing an example of meta data according to an embodiment of the present invention.

도 3을 참조하면, 보호 대상 시스템(100)은 방화벽(110) 및 백신 프로그램(120)을 이용하여 다양한 유형의 침해사고 지표 데이터를 생성할 수 있다. Referring to FIG. 3 , the system to be protected 100 may generate various types of incident indicator data using the firewall 110 and the vaccine program 120 .

우선, 보호 대상 시스템(100)은 방화벽(110)을 이용하여 IP 데이터, Protocol 데이터, Port 데이터, Domain 데이터, Email 데이터 등과 같은 다양한 유형의 침해사고 지표 데이터를 생성할 수 있다. 도 3에서는 방화벽(110)을 이용하여 생성되는 5개 종류의 침해 사고 지표 데이터를 도시하고 있으나, 이에 한정되지 않는다. First of all, the system to be protected 100 may generate various types of incident indicator data such as IP data, protocol data, port data, domain data, and email data using the firewall 110 . Although FIG. 3 shows five types of incident index data generated using the firewall 110, it is not limited thereto.

또한, 보호 대상 시스템(100)은 백신 프로그램(120)을 이용하여 Hash 데이터, DLL 데이터 등과 같은 다양한 유형의 침해사고 지표 데이터를 생성할 수 있다. 도 3에서는 백신 프로그램(120)을 이용하여 생성되는 2개 종류의 침해사고 지표 데이터를 도시하고 있으나, 이에 한정되지 않는다. In addition, the system to be protected 100 may generate various types of infringement indicator data, such as hash data and DLL data, using the vaccine program 120 . Although FIG. 3 shows two types of incident indicator data generated using the vaccine program 120, it is not limited thereto.

OSINT 서버(200)는 공개된 출처로부터 수집된 CVE 데이터, CWE 데이터, CAPEC 데이터 등과 같은 다양한 유형의 보안 취약점 데이터를 저장할 수 있다. CVE(Common Vulerablities and Exposures) 데이터는 공개적으로 알려진 보안취약점에 대한 공통 식별자 목록에 대한 데이터를 의미할 수 있다. CWE(Common Weakness Enumeration) 데이터는 소프트웨어 취약점 목록으로서 소스코드 취약점을 정의한 데이터를 의미할 수 있다. CAPEC(Common Attack Pattern Enumeration and Classification) 데이터는 보안 취약점에 대한 공격 패턴을 분류한 데이터를 의미할 수 있다. 도 2에서는 3개 종류의 보안 취약점 데이터만을 도시하고 있으나, 이에 한정되지 않는다. The OSINT server 200 may store various types of security vulnerability data, such as CVE data, CWE data, CAPEC data, etc. collected from public sources. Common Vulnerablities and Exposures (CVE) data may refer to data for a common identifier list for publicly known security vulnerabilities. CWE (Common Weakness Enumeration) data is a list of software vulnerabilities and may mean data defining source code vulnerabilities. CAPEC (Common Attack Pattern Enumeration and Classification) data may refer to data for classifying attack patterns for security vulnerabilities. Although FIG. 2 shows only three types of security vulnerability data, it is not limited thereto.

도 4는 본 발명의 실시예에 따른 메타 데이터의 예시를 나타낸 도면이다. 4 is a diagram showing an example of meta data according to an embodiment of the present invention.

도 4를 참조하면, 침해사고 지표 데이터는 유형 별로 각 데이터의 특징을 나타내는 복수의 메타 데이터를 포함할 수 있다. Referring to FIG. 4 , the incident index data may include a plurality of pieces of metadata representing characteristics of each data type.

예를 들어, IP 데이터는 해당 IP 주소가 할당된 국가 코드에 관한 메타 데이터(Country code), 해당 지역의 위도에 관한 메타 데이터(Latitude), 해당 지역의 경도를 나타내는 메타 데이터(Longitude), 해당 IP 주소가 할당된 시간에 대한 메타 데이터(Resolved time)를 포함할 수 있다. For example, IP data includes metadata about the country code to which the corresponding IP address is assigned (Country code), metadata about the latitude of the region (Latitude), metadata representing the longitude of the region (Longitude), and corresponding IP data. It may include meta data (resolved time) about the time the address was allocated.

다른 예로, Hash 데이터는 해당 악성 코드가 호출한 DLL 파일에 관한 메타 데이터(Number of related DLLs), 악성 코드의 크기에 관한 메타 데이터(Byte size), 악성 코드의 관측 시점에 관한 메타 데이터(Number of AV detections) 등을 포함할 수 있다. As another example, hash data includes metadata about the DLL file called by the malicious code (Number of related DLLs), metadata about the size of the malicious code (Byte size), and metadata about the observation point of the malicious code (Number of related DLLs). AV detections), etc.

상기에서 살펴본 것처럼, 침해사고 지표 데이터에 대한 메타 데이터는 침해사고 지표 데이터의 유형별로 상이할 수 있다. As reviewed above, meta data for the incident index data may be different for each type of incident index data.

본 발명의 실시예에 따른 특징 정보 생성부(320)는 다양한 유형의 침해사고 지표 데이터로부터 상기와 같은 메타 데이터들을 검출할 수 있으며, 검출된 메타 데이터를 이용하여 각각의 침해사고 지표 데이터에 대한 특징 데이터를 생성할 수 있다. The feature information generation unit 320 according to an embodiment of the present invention may detect the meta data as described above from various types of incident index data, and use the detected meta data to determine the characteristics of each incident index data. data can be generated.

도 4에는 도시되지 않았으나, 보안 취약점 데이터 역시 유형 별로 각 데이터의 특징을 나타내는 복수의 메타 데이터를 포함할 수 있다. 본 발명의 실시예에 따른 특징 정보 생성부(320)는 다양한 유형의 보안 취약점 데이터로부터 메타 데이터들을 검출할 수 있으며, 검출된 메타 데이터를 이용하여 각각의 보안 취약점 데이터에 대한 특징 데이터를 생성할 수 있다.Although not shown in FIG. 4 , security vulnerability data may also include a plurality of meta data indicating characteristics of each type of data. The characteristic information generation unit 320 according to an embodiment of the present invention may detect meta data from various types of security vulnerability data, and generate characteristic data for each security vulnerability data using the detected meta data. there is.

도 5는 본 발명의 실시예에 따른 침해사고 지표 데이터 및 보안 취약점 데이터의 상호 참조 관계를 나타낸 도면이다. 5 is a diagram illustrating a cross-reference relationship between intrusion index data and security vulnerability data according to an embodiment of the present invention.

도 5를 참조하면, 침해사고 지표 데이터들과 보안 취약점 데이터들은 상호 참조 관계를 가질 수 있다. 여기서, 참조 관계란 하나의 유형의 데이터와 다른 하나의 유형의 데이터 사이의 연관 관계를 의미할 수 있다. Referring to FIG. 5 , incident indicator data and security vulnerability data may have a cross-reference relationship. Here, the reference relationship may mean an association between one type of data and another type of data.

일 실시예에 따르면, 침해 사고 지표 데이터 중 IP 데이터(I)는 해당 IP 주소와 연결된 사이트의 도메인(Domain) 이름을 가지므로 Domain 데이터(D)와 연관 관계를 가질 수 있다. 다른 실시예에 따르면, 침해사고 지표 데이터 중 Hash 데이터(H)는 악성 코드의 해시값을 가지며, 해당 악성 코드가 동적 라이브러리(Dynamic Link Library, DLL) 파일을 호출하므로, DLL 데이터(L)와 연관 관계를 가질 수 있다. 이와 같이, 어느 일 유형의 침해사고 지표 데이터(또는 보안 취약점 데이터)는 다른 일 유형의 침해사고 지표 데이터(또는 보안 취약점 데이터) 사이에 연관 관계를 가질 수 있다. 본 발명의 실시예에 따른 학습 데이터 생성 장치(300)는 이러한 상호 참조 관계(연관 관계)를 이용하여 학습 데이터를 생성한다. According to an embodiment, since IP data (I) among the incident indicator data has a domain name of a site connected to a corresponding IP address, it may have a relationship with domain data (D). According to another embodiment, Hash data (H) among the incident indicator data has a hash value of malicious code, and since the corresponding malicious code calls a Dynamic Link Library (DLL) file, it is associated with DLL data (L). can have a relationship. In this way, one type of incident indicator data (or security vulnerability data) may have a correlation between another type of incident indicator data (or security vulnerability data). The learning data generation apparatus 300 according to an embodiment of the present invention generates learning data using this cross-reference relationship (association relationship).

상기에서 설명한 상호 참조 관계는 상호 참조 관계 행렬로 나타낼 수 있다. 상호 참조 관계 행렬은 인접 행렬(Adjacency Matrix, A)일 수 있다. 침해사고 지표 데이터(또는 보안 취약점 데이터) 중 제1 유형의 데이터(I)와 제2 유형의 데이터(J)에 대한 상호 참조 관계는 아래의 수학식 1과 같이 나타낼 수 있다. The cross-reference relationship described above can be represented by a cross-reference relationship matrix. The cross-reference relationship matrix may be an adjacency matrix (A). The cross-reference relationship between the first type of data (I) and the second type of data (J) among the incident indicator data (or security vulnerability data) can be expressed as Equation 1 below.

여기서, i는 제1 유형의 데이터(I)의 메타데이터를 의미하고, j는 제2 유형의 데이터(J)의 메타데이터를 의미한다. Here, i denotes metadata of the first type of data (I), and j denotes metadata of the second type of data (J).

따라서, 제1 유형의 데이터(I)의 메타데이터의 개수를 n_I라고 하고, 제2 유형의 데이터(J)의 메타데이터의 개수를 n_J라고 하면, 제1 유형의 데이터(I)와 제2 유형의 데이터(J)에 대응하는 상호 참조 관계 행렬(A_I,J)은 n_I×n_J 행렬로 나타낼 수 있다. Therefore, if the number of metadata of the first type of data (I) is n _I and the number of metadata of the second type of data (J) is n _J , the first type of data (I) and the second type of data (I) A cross-reference relationship matrix (A _I,J ) corresponding to the two types of data (J) can be expressed as an n _I × n _J matrix.

학습 데이터 생성부(330)는 특징 데이터를 이용하여 생성된 특징 행렬과 상기의 상호 참조 관계 행렬을 이용하여 2개 유형의 데이터 사이의 연관 관계를 나타내는 그래프 데이터를 생성할 수 있다. 그래프 데이터는 아래의 수학식 2와 같이 나타낼 수 있다. The learning data generation unit 330 may generate graph data representing a correlation between two types of data using the feature matrix generated using feature data and the cross-reference relationship matrix. Graph data can be expressed as in Equation 2 below.

여기서, α는 제1 유형의 데이터(I)의 특징 행렬을 의미하고, β는 제2 유형의 데이터(J)의 특징 행렬을 의미한다. Here, α means a feature matrix of the first type of data (I), and β means a feature matrix of the second type of data (J).

특징 행렬은 메타 데이터의 개수와 메타 데이터의 속성 개수에 따른 크기를 가질 수 있다. 예를 들어, 제1 유형 데이터(I)의 메타 데이터 개수가 n_I이고, 제1 유형 데이터(I)의 메타 데이터의 속성 개수 n_α인 경우, 제1 유형의 데이터(I)의 특징 행렬(α)은 n_I×n_α 행렬로 나타낼 수 있다. 그리고, 제2 유형 데이터(J)의 메타 데이터 개수가 n_J이고, 제2 유형 데이터(J)의 메타 데이터의 속성 개수 n_β인 경우, 제2 유형의 데이터(J)의 특징 행렬(β)은 n_I×n_β 행렬로 나타낼 수 있다. The feature matrix may have a size according to the number of meta data and the number of attributes of the meta data. For example, when the number of meta data of the first type of data (I) is n _I and the number of attributes of the meta data of the first type of data (I) is n _α , the feature matrix of the first type of data (I) ( α) can be represented by an n _I × n _α matrix. And, when the number of meta data of the second type data J is n _J and the number of attributes of the meta data of the second type data J is n _β , the feature matrix β of the second type data J can be represented by an n _I × n _β matrix.

이에 따라, 제1 유형의 데이터(I)와 제2 유형의 데이터(J) 사이의 연관 관계를 나타내는 그래프 데이터는 n_α×n_β 행렬로 나타낼 수 있다. Accordingly, graph data representing a relation between the first type of data (I) and the second type of data (J) may be represented by an n _α × n _β matrix.

한편, 그래프 데이터는 복수일 수 있다. 예를 들어, 수집된 데이터가 Port 데이터(O), IP 데이터(I) 및 Hash 데이터(H)인 경우, 그래프 데이터는 Port 데이터(O)와 IP 데이터(I) 사이의 제1 그래프 데이터, 및 IP 데이터(I)와 Hash 데이터(H) 사이의 제2 그래프 데이터를 포함할 수 있다. Port 데이터(O)와 Hash 데이터(H) 사이에는 상호 참조 관계가 없으므로, Port 데이터(O)와 Hash 데이터(H) 사이의 그래프 데이터는 생성되지 않을 수 있다. 즉, 그래프 데이터는 상호 참조 관계에 기반하여 생성될 수 있다. Meanwhile, graph data may be plural. For example, when the collected data is Port data (O), IP data (I), and Hash data (H), graph data is first graph data between Port data (O) and IP data (I), and It may include second graph data between IP data (I) and hash data (H). Since there is no cross-reference relationship between the port data (O) and the hash data (H), graph data between the port data (O) and the hash data (H) may not be created. That is, graph data may be generated based on a cross-reference relationship.

학습 데이터 생성부(330)는 복수의 그래프에 대한 행렬곱을 통해 학습 데이터(D_{I,J,K,…,Y,Z})를 생성할 수 있다. 이는 아래의 수학식 3과 같이 나타낼 수 있다. The learning data generation unit 330 may generate learning data D _{I, J, K, ..., Y, Z} through matrix multiplication of a plurality of graphs. This can be expressed as in Equation 3 below.

학습 데이터는 그래프 데이터의 행렬곱으로 나타낼 수 있으므로, 학습 데이터 역시 그래프 데이터의 형태로 나타낼 수 있다. Since learning data can be represented by matrix multiplication of graph data, learning data can also be represented in the form of graph data.

이에 따라, 학습 데이터(D_{I,J,K,…,Y,Z})는 n_α×n_ω 행렬로 나타낼 수 있다. 이는 다양한 유형의 침해사고 지표 데이터와 보안 취약점 데이터 사이의 연관 관계 중, 가장 처음에 배치된 유형의 데이터의 메타 데이터 특징과 가장 마지막에 배치된 유형의 데이터의 메타데이터 특징에 대한 행렬이다. 이러한 연산 결과는 단편적인 로그 데이터인 침해사고 지표 데이터와 보안 취약점 데이터를 이용하여, 복잡한 사건을 구성하는 여러 연관 관계를 표현할 수 있는 장점이 있다. 즉, 표현 결과를 단편적인 침해사고 지표 데이터와 보안 취약점 데이터로 도출하는 것이 아니라, 침해사고 지표 데이터와 보안 취약점 데이터의 메타데이터에 관한 형태로 도출함으로써 일련의 복잡한 사건에 대한 포괄적이고 함축적인 데이터 세트를 도출함을 의미한다. Accordingly, the learning data ( _{DI,J,K,...,Y,Z} ) can be expressed as an n _α × n _ω matrix. This is a matrix for the metadata characteristics of the first type of data and the last type of data among the associations between various types of breach indicator data and security vulnerability data. The result of such an operation has the advantage of being able to express various associations constituting a complex event using fragmentary log data, such as intrusion index data and security vulnerability data. That is, a comprehensive and implicit data set for a series of complex events is derived by deriving the expression result in the form of metadata of the incident indicator data and security vulnerability data, rather than deriving fragmentary incident indicator data and security vulnerability data. means to derive

한편, 이러한 방식의 그래프 데이터 산출 방식은 전체 사건을 구성하는 IoC 데이터 유형 중 어느 한 유형에 대한 메타데이터가 존재하지 않는 경우, 상기의 수학식 3에서 해당 IoC 데이터 유형에 대한 메타데이터 행렬 곱 연산을 생략함을 통해 예외 데이터에 대한 처리를 가능하게 할 수 있다.On the other hand, in this type of graph data calculation method, when metadata for any one of the IoC data types constituting the entire event does not exist, the metadata matrix multiplication operation for the corresponding IoC data type in Equation 3 is performed. Through omission, it is possible to process exception data.

도 6은 본 발명의 실시예에 다른 학습 데이터 생성 방법을 나타낸 도면이다. 6 is a diagram showing a learning data generation method according to an embodiment of the present invention.

도 6을 참조하면, 데이터 수집부(310)는 보호 대상 시스템(100)의 침해사고 지표 데이터(Indicator of Compromise data)를 입력받을 수 있다(S610). Referring to FIG. 6 , the data collection unit 310 may receive indicator of compromise data of the system to be protected 100 (S610).

그리고, 데이터 수집부(310)는 침해사고 지표 데이터에 대응하는 보안 취약점 데이터를 입력받 수 있다(S620). Then, the data collection unit 310 may receive security vulnerability data corresponding to the intrusion index data (S620).

그러면, 특징 정보 생성부(320)는 침해사고 지표 데이터 및 보안 취약점 데이터에 포함된 메타 데이터를 검출하여 특징 데이터를 생성할 수 있다(S630). Then, the feature information generating unit 320 may generate feature data by detecting metadata included in the incident index data and the security vulnerability data (S630).

구체적으로, 학습 데이터 생성부(330)는 검출된 메타 데이터를 이용하여 침해사고 지표 데이터 및 보안 취약점 데이터 각각에 대응하는 특징 행렬을 생성할 수 있다(S640). 이때, 특징 행렬은 메타 데이터의 개수와 메타 데이터의 속성의 개수에 대응하는 행과 열을 가질 수 있다. Specifically, the learning data generating unit 330 may generate a feature matrix corresponding to each of the incident index data and security vulnerability data using the detected metadata (S640). In this case, the feature matrix may have rows and columns corresponding to the number of meta data and the number of attributes of meta data.

학습 데이터 생성부(330)는 침해사고 지표 데이터 및 보안 취약점 데이터 중 적어도 하나에서 서로 다른 유형의 2개의 데이터에 대응하는 특징 행렬을 선택할 수 있다(S650). The learning data generator 330 may select feature matrices corresponding to two data of different types from at least one of the incident index data and the security vulnerability data (S650).

학습 데이터 생성부(330)는 선택된 2개의 데이터에 대응하는 상호 참조 관계 행렬을 선택할 수 있다(S660). The learning data generation unit 330 may select a cross-reference relationship matrix corresponding to the two selected data (S660).

학습 데이터 생성부(330)는 선택된 특징 행렬과 상호 참조 관계 행렬을 이용하여 선택된 2개의 데이터 사이의 연관 관계를 나타내는 그래프 데이터를 생성할 수 있다(S670). The learning data generation unit 330 may generate graph data representing an association between two selected pieces of data using the selected feature matrix and the cross-reference relationship matrix (S670).

그리고, 침해사고 지표 데이터 및 보안 취약점 데이터 중 적어도 하나에서 서로 다른 유형의 2개의 데이터에 대응하는 복수의 그래프 데이터를 생성되면, 학습 데이터 생성부(330)는 복수의 그래프 데이터에 대한 행렬곱을 통해 학습 데이터를 생성할 수 있다(S680). In addition, when a plurality of graph data corresponding to two data of different types is generated in at least one of the incident index data and the security vulnerability data, the learning data generator 330 learns through matrix multiplication of the plurality of graph data. Data can be generated (S680).

그러면, 학습 데이터 관리부(340)는 학습 데이터를 저장 및 관리할 수 있다(S690).Then, the learning data management unit 340 may store and manage the learning data (S690).

본 실시예에서 사용되는 '~부'라는 용어는 소프트웨어 또는 FPGA(field-programmable gate array) 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, '~부'는 어떤 역할들을 수행한다. 그렇지만 '~부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '~부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '~부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들, 및 변수들을 포함한다. 구성요소들과 '~부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '~부'들로 결합되거나 추가적인 구성요소들과 '~부'들로 더 분리될 수 있다. 뿐만 아니라, 구성요소들 및 '~부'들은 디바이스 또는 보안 멀티미디어카드 내의 하나 또는 그 이상의 CPU들을 재생시키도록 구현될 수도 있다.The term '~unit' used in this embodiment means software or a hardware component such as a field-programmable gate array (FPGA) or ASIC, and '~unit' performs certain roles. However, '~ part' is not limited to software or hardware. '~bu' may be configured to be in an addressable storage medium and may be configured to reproduce one or more processors. Therefore, as an example, '~unit' refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, and procedures. , subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. Functions provided within components and '~units' may be combined into smaller numbers of components and '~units' or further separated into additional components and '~units'. In addition, components and '~units' may be implemented to play one or more CPUs in a device or a secure multimedia card.

이상에서 실시예를 중심으로 설명하였으나 이는 단지 예시일 뿐 본 발명을 한정하는 것이 아니며, 본 발명이 속하는 분야의 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성을 벗어나지 않는 범위에서 이상에 예시되지 않은 여러 가지의 변형과 응용이 가능함을 알 수 있을 것이다. 예를 들어, 실시예에 구체적으로 나타난 각 구성 요소는 변형하여 실시할 수 있는 것이다. 그리고 이러한 변형과 응용에 관계된 차이점들은 첨부된 청구 범위에서 규정하는 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다. Although the above has been described with reference to the embodiments, this is only an example and does not limit the present invention, and those skilled in the art to which the present invention belongs will not deviate from the essential characteristics of the present embodiment. It will be appreciated that various variations and applications are possible. For example, each component specifically shown in the embodiment can be modified and implemented. And differences related to these modifications and applications should be construed as being included in the scope of the present invention as defined in the appended claims.

300 : 학습 데이터 생성 장치
310 : 데이터 수집부
320 : 특징 정보 생성부
330 : 학습 데이터 생성부
340 : 학습 데이터 관리부300: learning data generating device
310: data collection unit
320: feature information generation unit
330: learning data generation unit
340: learning data management unit

Claims

a data collection unit that collects indicator of compromise data of a system to be protected and security vulnerability data corresponding to the compromise indicator data;
a feature information generator configured to generate feature data by detecting meta data included in the incident indicator data and the security vulnerability data; and
A learning data generation unit configured to generate learning data based on the cross-reference relationship between the incident index data and the security vulnerability data and the detected feature data;
The learning data generating unit,
generating a feature matrix corresponding to each of the incident indicator data and the security vulnerability data using the detected metadata;
Selecting feature matrices corresponding to two data of different types from at least one of the incident indicator data and the security vulnerability data;
selecting a cross-reference relationship matrix corresponding to the two selected data;
Generating graph data representing a relational relationship between the selected two data using the selected feature matrix and the cross-reference relationship matrix;
The cross-reference relationship is determined by Equation 1, and the graph data is determined by Equation 2.
[Equation 1]

(Here, i means metadata of the first type of data (I), and j means metadata of the second type of data (J))
[Equation 2]

(Here, α means the feature matrix of the first type of data (I), and β means the feature matrix of the second type of data (J).)

delete

According to claim 1,
The feature matrix is
Learning data generating device having rows and columns corresponding to the number of meta data and the number of attributes of the meta data.

According to claim 1,
The learning data generating unit,
generating a plurality of graph data corresponding to two data of different types in at least one of the incident indicator data and the security vulnerability data;
A learning data generating device that generates the learning data through matrix multiplication of a plurality of graph data.

According to claim 1,
Learning data generating device further comprising; a learning data management unit that stores and manages the learning data.