KR102269652B1

KR102269652B1 - Machine learning-based learning vector generation device and method for analyzing security logs

Info

Publication number: KR102269652B1
Application number: KR1020190117353A
Authority: KR
Inventors: 윤명근; 조영훈; 명준우
Original assignee: 국민대학교산학협력단
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2021-06-25
Also published as: KR20210035502A

Abstract

본 발명은 보안관제 데이터 분석을 위한 머신러닝 기반의 학습 벡터 생성 장치 및 방법에 관한 것으로, 학습 모집단에 대해 학습데이터에 포함된 복수의 고정 필드(filed)들 각각에 대한 제1 피처(feature)를 생성하는 제1 피처 생성부, 상기 학습데이터에 포함된 적어도 하나의 가변 필드에 대한 제2 피처로서 n 차원(상기 n 은 자연수)의 피처 벡터를 생성하는 제2 피처 생성부 및 상기 제1 피처와 상기 제2 피처를 성분으로 포함하는 벡터를 상기 학습데이터에 관한 학습 벡터로서 생성하는 학습 벡터 생성부를 포함한다. 따라서, 본 발명은 학습데이터의 가변 필드에 대한 피처를 확장하여 악성코드 탐지의 정확도를 향상시킬 수 있다.The present invention relates to an apparatus and method for generating a machine learning-based learning vector for analyzing security control data, and a first feature for each of a plurality of fixed fields included in learning data for a learning population. a first feature generator for generating an n-dimensional feature vector (where n is a natural number) as a second feature for at least one variable field included in the training data; and a second feature generator for generating the first feature; and a training vector generator for generating a vector including the second feature as a component as a training vector for the training data. Therefore, the present invention can improve the accuracy of malicious code detection by extending the features of the variable field of the training data.

Description

Machine learning-based learning vector generation apparatus and method for security control data analysis {MACHINE LEARNING-BASED LEARNING VECTOR GENERATION DEVICE AND METHOD FOR ANALYZING SECURITY LOGS}

본 발명은 학습 벡터 생성 기술에 관한 것으로, 보다 상세하게는 학습데이터의 가변 필드에 대한 피처를 확장하여 악성코드 탐지의 정확도를 향상시킬 수 있는 보안관제 데이터 분석을 위한 머신러닝 기반의 학습 벡터 생성 장치 및 방법에 관한 것이다.The present invention relates to a technology for generating a learning vector, and more particularly, an apparatus for generating a machine learning-based learning vector for security control data analysis that can improve the accuracy of detection of malicious codes by expanding the features of the variable fields of the learning data. and methods.

인터넷 및 컴퓨터 기술이 지속적으로 개발되고, 이와 함께 이러한 기술을 악용하여 부당한 이익을 취하려는 시도도 증가하고 있다. 예를 들어, 악성코드를 사용자들의 컴퓨터에 설치 및 배포하여 사용자들로부터 부당한 이익을 취하는 방법이 증가하고 있다. 여기에서, 악성코드는 컴퓨터 사용자의 승인없이 컴퓨터에 침투하거나 설치되어 악의적인 행동을 하는 프로그램을 의미한다. 이에 따라, 보안 전문가들은 다양한 해결 방안을 모색하고 있다.As Internet and computer technologies continue to develop, attempts to exploit these technologies to obtain undue profits are also increasing. For example, there is an increasing number of methods for installing and distributing malicious codes on users' computers to obtain undue profits from users. Here, the malicious code refers to a program that infiltrates or is installed in a computer and performs malicious actions without the approval of the computer user. Accordingly, security experts are seeking various solutions.

특징 벡터(feature vector)는 분석 대상이 되는 컨텐츠의 특징 정보를 포함하는 차원을 가진 벡터에 해당할 수 있다. 컨텐츠에 따라 특징 벡터를 정의할 수 있고, 특징 벡터를 생성하는 알고리즘이 달라질 수 있다. 일반적으로, 악성코드 탐지를 위한 데이터베이스 구축 과정에서 학습데이터로 사용되는 특징 벡터를 생성하는 방법에 따라 악성코드 탐지 성능이 달라질 수 있다. 따라서, 악성코드 탐지에 있어 특징 벡터 생성 방법이 악성코드 탐지 성능을 결정할 수 있다.A feature vector may correspond to a vector having a dimension including feature information of content to be analyzed. A feature vector may be defined according to content, and an algorithm for generating a feature vector may vary. In general, malicious code detection performance may vary depending on a method of generating a feature vector used as training data in the process of constructing a database for detecting malicious code. Therefore, in detecting malicious code, the feature vector generation method can determine malicious code detection performance.

한국등록특허 제10-0729107(2007.06.08)호Korean Patent Registration No. 10-0729107 (June 8, 2007)

본 발명의 일 실시예는 학습데이터의 가변 필드에 대한 피처를 확장하여 악성코드 탐지의 정확도를 향상시킬 수 있는 보안관제 데이터 분석을 위한 머신러닝 기반의 학습 벡터 생성 장치 및 방법을 제공하고자 한다.An embodiment of the present invention is to provide an apparatus and method for generating a machine learning-based learning vector for security control data analysis that can improve the accuracy of detection of malicious codes by extending the features of the variable fields of the training data.

본 발명의 일 실시예는 학습데이터의 특정 필드에 대해 텍스트 마이닝을 통해 워드들을 추출하고 워드들에 대한 TF 또는 DF 순위를 기준으로 특정 길이의 피처 벡터를 생성할 수 있는 보안관제 데이터 분석을 위한 머신러닝 기반의 학습 벡터 생성 장치 및 방법을 제공하고자 한다.An embodiment of the present invention is a machine for analyzing security control data that can extract words through text mining for a specific field of training data and generate a feature vector of a specific length based on the TF or DF ranking for the words An object of the present invention is to provide an apparatus and method for generating a learning-based learning vector.

본 발명의 일 실시예는 종래의 피처 벡터와 확장된 피처 벡터를 함께 사용하여 머신러닝을 수행함으로써 데이터 분석의 성능 향상에 기여할 수 있는 보안관제 데이터 분석을 위한 머신러닝 기반의 학습 벡터 생성 장치 및 방법을 제공하고자 한다.An embodiment of the present invention provides an apparatus and method for generating a machine learning-based learning vector for security control data analysis that can contribute to performance improvement of data analysis by performing machine learning using a conventional feature vector and an extended feature vector together would like to provide

실시예들 중에서, 보안관제 데이터 분석을 위한 머신러닝 기반의 학습 벡터 생성 장치는 학습 모집단에 대해 학습데이터에 포함된 복수의 고정 필드(filed)들 각각에 대한 제1 피처(feature)를 생성하는 제1 피처 생성부, 상기 학습데이터에 포함된 적어도 하나의 가변 필드에 대한 제2 피처로서 n 차원(상기 n 은 자연수)의 피처 벡터를 생성하는 제2 피처 생성부 및 상기 제1 피처와 상기 제2 피처를 성분으로 포함하는 벡터를 상기 학습데이터에 관한 학습 벡터로서 생성하는 학습 벡터 생성부를 포함한다.Among the embodiments, the machine learning-based learning vector generating apparatus for security control data analysis is a first feature for generating a first feature for each of a plurality of fixed fields included in the learning data for the learning population. A first feature generator, a second feature generator for generating an n-dimensional feature vector (where n is a natural number) as a second feature for at least one variable field included in the training data, and the first feature and the second feature and a training vector generator for generating a vector including a feature as a component as a training vector for the training data.

상기 제1 피처 생성부는 상기 복수의 고정 필드들에 대해 특징 추출 알고리즘을 적용하여 상기 제1 피처(feature)로서 하나의 정수를 생성한다.The first feature generator generates one integer as the first feature by applying a feature extraction algorithm to the plurality of fixed fields.

상기 제2 피처 생성부는 상기 적어도 하나의 가변 필드로부터 복수의 워드(word)들을 추출하고 상기 복수의 워드들 각각의 출현빈도를 기초로 상기 n 차원의 피처 벡터를 생성한다.The second feature generator extracts a plurality of words from the at least one variable field and generates the n-dimensional feature vector based on the frequency of occurrence of each of the plurality of words.

상기 제2 피처 생성부는 상기 복수의 워드들을 기초로 상기 학습 모집단에 대한 TF(Term Frequency) 또는 DF(Document Frequency) 순위를 결정하는 제1 단계, 상기 TF 또는 DF 순위를 기준으로 상기 n 차원에 대응하는 워드를 결정하는 제2 단계 및 상기 n 차원 별로 대응하는 워드의 출현 여부에 따라 성분값을 결정하는 제3 단계를 수행하여 상기 제2 피처를 생성한다.The second feature generator corresponds to a first step of determining a TF (Term Frequency) or DF (Document Frequency) ranking for the learning population based on the plurality of words, the n-dimensionality based on the TF or DF ranking The second feature is generated by performing a second step of determining a word to be used and a third step of determining a component value according to whether a word corresponding to each of the n dimensions appears.

상기 제2 피처 생성부는 상기 제3 단계에서 워드 별로 출현한 경우 '1'을 상기 성분값으로 결정하고 출현하지 않은 경우 '0'을 상기 성분값으로 결정한다.The second feature generator determines '1' as the component value when it appears for each word in the third step, and determines '0' as the component value when it does not appear.

상기 제2 피처 생성부는 상기 적어도 하나의 가변 필드에 대해 피처 해싱(Feature Hashing)을 적용하여 상기 n 차원의 피처 벡터를 생성한다.The second feature generator generates the n-dimensional feature vector by applying feature hashing to the at least one variable field.

상기 학습 벡터 생성 장치는 상기 학습 데이터에 대한 특징 정보로서 상기 학습 벡터를 학습하여 악성코드 탐지 모델을 생성하는 악성코드 탐지 학습부 및 상기 악성코드 탐지 모델을 이용하여 악성코드를 탐지하는 악성코드 탐지부를 더 포함할 수 있다.The training vector generating apparatus includes a malicious code detection learning unit that generates a malicious code detection model by learning the training vector as characteristic information of the training data, and a malicious code detection unit that detects malicious code using the malicious code detection model. may include more.

실시예들 중에서, 보안관제 데이터 분석을 위한 머신러닝 기반의 학습 벡터 생성 방법은 학습데이터에 포함된 복수의 고정 필드(filed)들 각각에 대한 제1 피처(feature)를 생성하는 단계, 상기 학습데이터에 포함된 적어도 하나의 가변 필드에 대한 제2 피처로서 n차원(상기 n은 자연수)의 피처 벡터를 생성하는 단계, 상기 제1 피처와 상기 제2 피처를 성분으로 포함하는 벡터를 상기 학습데이터에 관한 학습 벡터로서 생성하는 단계, 상기 학습 데이터에 대한 특징 정보로서 상기 학습 벡터를 학습하여 악성코드 탐지 모델을 생성하는 단계 및 상기 악성코드 탐지 모델을 이용하여 악성코드를 탐지하는 단계를 포함한다.Among the embodiments, the machine learning-based training vector generation method for security control data analysis includes generating a first feature for each of a plurality of fixed fields included in training data, the training data generating an n-dimensional feature vector (where n is a natural number) as a second feature for at least one variable field included in generating a learning vector as a learning vector, generating a malicious code detection model by learning the training vector as feature information about the training data, and detecting a malicious code using the malicious code detection model.

개시된 기술은 다음의 효과를 가질 수 있다. 다만, 특정 실시예가 다음의 효과를 전부 포함하여야 한다거나 다음의 효과만을 포함하여야 한다는 의미는 아니므로, 개시된 기술의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.The disclosed technology may have the following effects. However, this does not mean that a specific embodiment should include all of the following effects or only the following effects, so the scope of the disclosed technology should not be construed as being limited thereby.

본 발명의 일 실시예에 따른 보안관제 데이터 분석을 위한 머신러닝 기반의 학습 벡터 생성 장치 및 방법은 학습데이터의 가변 필드에 대한 피처를 확장하여 악성코드 탐지의 정확도를 향상시킬 수 있다.The apparatus and method for generating a machine learning-based learning vector for analyzing security control data according to an embodiment of the present invention can improve the accuracy of detection of malicious codes by extending the features of the variable fields of the training data.

본 발명의 일 실시예에 따른 보안관제 데이터 분석을 위한 머신러닝 기반의 학습 벡터 생성 장치 및 방법은 학습데이터의 특정 필드에 대해 텍스트 마이닝을 통해 워드들을 추출하고 워드들에 대한 TF 또는 DF 순위를 기준으로 특정 길이의 피처 벡터를 생성할 수 있다.Machine learning-based learning vector generation apparatus and method for security control data analysis according to an embodiment of the present invention extracts words through text mining for a specific field of learning data, and based on TF or DF ranking for words to create a feature vector of a specific length.

도 1은 본 발명의 일 실시예에 따른 학습 벡터 생성 시스템을 설명하는 도면이다.
도 2는 도 1에 있는 학습 벡터 생성 장치의 물리적 구성을 설명하는 도면이다.
도 3은 도 1에 있는 학습 벡터 생성 장치의 기능적 구성을 설명하는 도면이다.
도 4는 도 1에 있는 학습 벡터 생성 장치에서 수행되는 보안관제 데이터 분석을 위한 머신러닝 기반의 학습 벡터 생성 과정을 설명하는 순서도이다.
도 5는 도 1에 있는 학습 벡터 생성 장치에서 수행되는 학습 벡터 생성 과정을 설명하는 도면이다.
도 6은 도 1에 있는 학습 벡터 생성 장치에서 수행되는 가변 필드에 대한 피처 벡터 생성 과정을 설명하는 도면이다.
도 7은 학습데이터의 일 실시예를 설명하는 도면이다.
도 8은 학습데이터의 특정 필드에 대한 피처 확장을 설명하는 도면이다.1 is a diagram illustrating a learning vector generation system according to an embodiment of the present invention.
FIG. 2 is a diagram for explaining a physical configuration of the learning vector generating apparatus shown in FIG. 1 .
FIG. 3 is a diagram for explaining the functional configuration of the learning vector generating apparatus shown in FIG. 1 .
4 is a flowchart illustrating a machine learning-based learning vector generation process for security control data analysis performed in the learning vector generating apparatus shown in FIG. 1 .
FIG. 5 is a view for explaining a learning vector generating process performed by the learning vector generating apparatus shown in FIG. 1 .
FIG. 6 is a view for explaining a feature vector generation process for a variable field performed by the apparatus for generating a learning vector of FIG. 1 .
7 is a view for explaining an embodiment of learning data.
8 is a diagram for explaining feature expansion for a specific field of training data.

본 발명에 관한 설명은 구조적 내지 기능적 설명을 위한 실시예에 불과하므로, 본 발명의 권리범위는 본문에 설명된 실시예에 의하여 제한되는 것으로 해석되어서는 아니 된다. 즉, 실시예는 다양한 변경이 가능하고 여러 가지 형태를 가질 수 있으므로 본 발명의 권리범위는 기술적 사상을 실현할 수 있는 균등물들을 포함하는 것으로 이해되어야 한다. 또한, 본 발명에서 제시된 목적 또는 효과는 특정 실시예가 이를 전부 포함하여야 한다거나 그러한 효과만을 포함하여야 한다는 의미는 아니므로, 본 발명의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.Since the description of the present invention is merely an embodiment for structural or functional description, the scope of the present invention should not be construed as being limited by the embodiment described in the text. That is, since the embodiment may have various changes and may have various forms, it should be understood that the scope of the present invention includes equivalents capable of realizing the technical idea. In addition, since the object or effect presented in the present invention does not mean that a specific embodiment should include all of them or only such effects, it should not be understood that the scope of the present invention is limited thereby.

한편, 본 출원에서 서술되는 용어의 의미는 다음과 같이 이해되어야 할 것이다.On the other hand, the meaning of the terms described in the present application should be understood as follows.

"제1", "제2" 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위한 것으로, 이들 용어들에 의해 권리범위가 한정되어서는 아니 된다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.Terms such as “first” and “second” are for distinguishing one component from another, and the scope of rights should not be limited by these terms. For example, a first component may be termed a second component, and similarly, a second component may also be termed a first component.

어떤 구성요소가 다른 구성요소에 "연결되어"있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결될 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어"있다고 언급된 때에는 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 한편, 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being “connected to” another component, it should be understood that the component may be directly connected to the other component, but other components may exist in between. On the other hand, when it is mentioned that a certain element is "directly connected" to another element, it should be understood that the other element does not exist in the middle. Meanwhile, other expressions describing the relationship between elements, that is, “between” and “immediately between” or “neighboring to” and “directly adjacent to”, etc., should be interpreted similarly.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함하다"또는 "가지다" 등의 용어는 실시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The singular expression is to be understood to include the plural expression unless the context clearly dictates otherwise, and terms such as "comprises" or "have" refer to the embodied feature, number, step, action, component, part or these It is intended to indicate that a combination exists, and it should be understood that it does not preclude the possibility of the existence or addition of one or more other features or numbers, steps, operations, components, parts, or combinations thereof.

각 단계들에 있어 식별부호(예를 들어, a, b, c 등)는 설명의 편의를 위하여 사용되는 것으로 식별부호는 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않는 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.Identifiers (eg, a, b, c, etc.) in each step are used for convenience of description, and the identification code does not describe the order of each step, and each step clearly indicates a specific order in context. Unless otherwise specified, it may occur in a different order from the specified order. That is, each step may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.

본 발명은 컴퓨터가 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수 있고, 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can be embodied as computer-readable codes on a computer-readable recording medium, and the computer-readable recording medium includes all types of recording devices in which data readable by a computer system is stored. . Examples of the computer-readable recording medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like. In addition, the computer-readable recording medium may be distributed in a network-connected computer system, and the computer-readable code may be stored and executed in a distributed manner.

여기서 사용되는 모든 용어들은 다르게 정의되지 않는 한, 본 발명이 속하는 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한 이상적이거나 과도하게 형식적인 의미를 지니는 것으로 해석될 수 없다.All terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, unless otherwise defined. Terms defined in general used in the dictionary should be interpreted as being consistent with the meaning in the context of the related art, and cannot be interpreted as having an ideal or excessively formal meaning unless explicitly defined in the present application.

도 1은 본 발명의 일 실시예에 따른 학습 벡터 생성 시스템을 설명하는 도면이다.1 is a diagram illustrating a learning vector generation system according to an embodiment of the present invention.

도 1을 참조하면, 학습 벡터 생성 시스템(100)은 사용자 단말(110), 학습 벡터 생성 장치(130) 및 데이터베이스(150)를 포함할 수 있다.Referring to FIG. 1 , a learning vector generating system 100 may include a user terminal 110 , a learning vector generating apparatus 130 , and a database 150 .

사용자 단말(110)은 특정 컨텐츠에 대한 특징 벡터 생성을 요청하고 그 결과를 확인할 수 있는 컴퓨팅 장치에 해당할 수 있고, 스마트폰, 노트북 또는 컴퓨터로 구현될 수 있으며, 반드시 이에 한정되지 않고, 태블릿 PC 등 다양한 디바이스로도 구현될 수 있다. 사용자 단말(110)은 학습 벡터 생성 장치(130)와 네트워크를 통해 연결될 수 있고, 복수의 사용자 단말(110)들은 학습 벡터 생성 장치(130)와 동시에 연결될 수 있다. The user terminal 110 may correspond to a computing device capable of requesting generation of a feature vector for specific content and confirming the result, and may be implemented as a smartphone, a laptop computer, or a computer, but is not necessarily limited thereto, and a tablet PC. It can also be implemented in various devices such as The user terminal 110 may be connected to the learning vector generating apparatus 130 through a network, and a plurality of user terminals 110 may be simultaneously connected to the learning vector generating apparatus 130 .

학습 벡터 생성 장치(130)는 학습데이터에 대한 학습 벡터를 생성할 수 있는 컴퓨터 또는 프로그램에 해당하는 서버로 구현될 수 있다. 또한, 학습 벡터 생성 장치(130)는 생성된 학습 벡터를 학습할 수 있고 학습의 결과로 생성된 탐지 모델을 이용하여 악성코드를 탐지할 수 있다. 학습 벡터 생성 장치(130)는 사용자 단말(110)과 블루투스, WiFi, 통신망 등을 통해 무선으로 연결될 수 있고, 네트워크를 통해 사용자 단말(110)과 데이터를 주고받을 수 있다.The training vector generating apparatus 130 may be implemented as a server corresponding to a computer or program capable of generating a training vector for training data. Also, the training vector generating apparatus 130 may learn the generated training vector and detect the malicious code using a detection model generated as a result of the training. The learning vector generating apparatus 130 may be wirelessly connected to the user terminal 110 through Bluetooth, WiFi, a communication network, etc., and may exchange data with the user terminal 110 through the network.

일 실시예에서, 학습 벡터 생성 장치(130)는 데이터베이스(150)와 연동하여 학습데이터에 대한 학습 벡터를 생성하는 과정에서 필요한 정보를 저장할 수 있다. 한편, 학습 벡터 생성 장치(130)는 도 1과 달리, 데이터베이스(150)를 내부에 포함하여 구현될 수 있다. 학습 벡터 생성 장치(130)는 프로세서, 메모리, 사용자 입출력부 및 네트워크 입출력부를 포함하여 구현될 수 있으며, 이에 대해서는 도 2에서 보다 자세히 설명한다.In an embodiment, the learning vector generating apparatus 130 may store information necessary in a process of generating a learning vector for learning data in association with the database 150 . Meanwhile, unlike FIG. 1 , the learning vector generating apparatus 130 may be implemented by including the database 150 therein. The training vector generating apparatus 130 may be implemented including a processor, a memory, a user input/output unit, and a network input/output unit, which will be described in more detail with reference to FIG. 2 .

데이터베이스(150)는 학습 벡터 생성 과정에서 필요한 다양한 정보들을 저장하는 저장장치에 해당할 수 있다. 데이터베이스(150)는 학습 벡터 생성에 사용되는 학습데이터를 저장할 수 있고, 학습 벡터 생성 과정에 사용되는 단어 임베딩에 관한 정보를 저장할 수 있으며, 반드시 이에 한정되지 않고, 학습 벡터 생성 장치(130)가 보안관제 데이터 분석을 위한 머신러닝 기반의 학습 벡터를 생성하는 과정에서 다양한 형태로 수집 또는 가공된 정보들을 저장할 수 있다.The database 150 may correspond to a storage device for storing various types of information required in the process of generating a learning vector. The database 150 may store learning data used for generating a learning vector, and may store information about word embeddings used in the process of generating a learning vector, but is not limited thereto, and the learning vector generating device 130 is secure. In the process of creating a machine learning-based learning vector for control data analysis, information collected or processed in various forms can be stored.

도 2는 도 1에 있는 학습 벡터 생성 장치의 물리적 구성을 설명하는 도면이다.FIG. 2 is a diagram for explaining a physical configuration of the learning vector generating apparatus shown in FIG. 1 .

도 2를 참조하면, 학습 벡터 생성 장치(130)는 프로세서(210), 메모리(230), 사용자 입출력부(250) 및 네트워크 입출력부(270)를 포함하여 구현될 수 있다.Referring to FIG. 2 , the learning vector generating apparatus 130 may be implemented including a processor 210 , a memory 230 , a user input/output unit 250 , and a network input/output unit 270 .

프로세서(210)는 학습 벡터 생성 장치(130)의 각 동작을 처리하는 프로시저를 실행할 수 있고, 그 과정 전반에서 읽혀지거나 작성되는 메모리(230)를 관리할 수 있으며, 메모리(230)에 있는 휘발성 메모리와 비휘발성 메모리 간의 동기화 시간을 스케줄할 수 있다. 프로세서(210)는 학습 벡터 생성 장치(130)의 동작 전반을 제어할 수 있고, 메모리(230), 사용자 입출력부(250) 및 네트워크 입출력부(270)와 전기적으로 연결되어 이들 간의 데이터 흐름을 제어할 수 있다. 프로세서(210)는 학습 벡터 생성 장치(130)의 CPU(Central Processing Unit)로 구현될 수 있다.The processor 210 may execute a procedure for processing each operation of the learning vector generating device 130 , and manage the memory 230 to be read or written throughout the process, and the volatile memory 230 in the memory 230 . Synchronization time between memory and non-volatile memory can be scheduled. The processor 210 may control the overall operation of the training vector generating device 130 , and is electrically connected to the memory 230 , the user input/output unit 250 , and the network input/output unit 270 to control data flow between them. can do. The processor 210 may be implemented as a central processing unit (CPU) of the training vector generating apparatus 130 .

메모리(230)는 SSD(Solid State Drive) 또는 HDD(Hard Disk Drive)와 같은 비휘발성 메모리로 구현되어 학습 벡터 생성 장치(130)에 필요한 데이터 전반을 저장하는데 사용되는 보조기억장치를 포함할 수 있고, RAM(Random Access Memory)과 같은 휘발성 메모리로 구현된 주기억장치를 포함할 수 있다.The memory 230 is implemented as a non-volatile memory, such as a solid state drive (SSD) or a hard disk drive (HDD), and may include an auxiliary storage device used to store overall data required for the learning vector generating device 130 and , and may include a main memory implemented as a volatile memory such as random access memory (RAM).

사용자 입출력부(250)는 사용자 입력을 수신하기 위한 환경 및 사용자에게 특정 정보를 출력하기 위한 환경을 포함할 수 있다. 예를 들어, 사용자 입출력부(250)는 터치 패드, 터치 스크린, 화상 키보드 또는 포인팅 장치와 같은 어댑터를 포함하는 입력장치 및 모니터 또는 터치스크린과 같은 어댑터를 포함하는 출력장치를 포함할 수 있다. 일 실시예에서, 사용자 입출력부(250)는 원격 접속을 통해 접속되는 컴퓨팅 장치에 해당할 수 있고, 그러한 경우, 학습 벡터 생성 장치(130)는 서버로서 수행될 수 있다.The user input/output unit 250 may include an environment for receiving a user input and an environment for outputting specific information to the user. For example, the user input/output unit 250 may include an input device including an adapter such as a touch pad, a touch screen, an on-screen keyboard, or a pointing device, and an output device including an adapter such as a monitor or a touch screen. In an embodiment, the user input/output unit 250 may correspond to a computing device connected through a remote connection, and in such a case, the training vector generating device 130 may be performed as a server.

네트워크 입출력부(270)은 네트워크를 통해 외부 장치 또는 시스템과 연결하기 위한 환경을 포함하고, 예를 들어, LAN(Local Area Network), MAN(Metropolitan Area Network), WAN(Wide Area Network) 및 VAN(Value Added Network) 등의 통신을 위한 어댑터를 포함할 수 있다.The network input/output unit 270 includes an environment for connecting with an external device or system through a network, for example, a local area network (LAN), a metropolitan area network (MAN), a wide area network (WAN), and a VAN (Wide Area Network) (VAN). It may include an adapter for communication such as Value Added Network).

도 3은 도 1에 있는 학습 벡터 생성 장치의 기능적 구성을 설명하는 도면이다.FIG. 3 is a diagram for explaining a functional configuration of the learning vector generating apparatus shown in FIG. 1 .

도 3을 참조하면, 학습 벡터 생성 장치(130)는 제1 피처 생성부(310), 제2 피처 생성부(320), 학습 벡터 생성부(330), 악성코드 탐지 학습부(340), 악성코드 탐지부(350) 및 제어부(360)를 포함할 수 있다.Referring to FIG. 3 , the training vector generating apparatus 130 includes a first feature generator 310 , a second feature generator 320 , a training vector generator 330 , a malicious code detection learner 340 , and a malicious code generator. It may include a code detection unit 350 and a control unit 360 .

제1 피처 생성부(310)는 학습 모집단에 대해 학습데이터에 포함된 복수의 고정 필드들 각각에 대한 제1 피처를 생성할 수 있다. 여기에서, 학습데이터는 머신러닝을 위한 학습 벡터를 생성하는데 사용되는 데이터에 해당할 수 있고, 학습 모집단은 학습데이터 집합에 해당할 수 있다. 예를 들어, 침입 방지 시스템(IPS) 또는 침입 탐지 시스템(IDS)에서 보안관제 데이터 분석을 위한 머신러닝을 수행하는 경우 학습데이터는 네트워크 패킷에 해당할 수 있다.The first feature generator 310 may generate a first feature for each of the plurality of fixed fields included in the training data for the training population. Here, the training data may correspond to data used to generate a training vector for machine learning, and the learning population may correspond to a training data set. For example, when machine learning for security control data analysis is performed in an intrusion prevention system (IPS) or an intrusion detection system (IDS), the learning data may correspond to network packets.

고정 필드는 학습데이터에 포함된 특정 영역에 해당할 수 있고, 고정된 크기를 갖거나 고정된 항목을 저장할 수 있다. 예를 들어, 네트워크 패킷의 경우 고정 필드는 시간(time), 출발지 IP, 출발지 Port, 목적지 IP, 목적지 P ort, 프로토콜, 패킷 크기 등을 포함할 수 있다. 또한, 제1 피처는 학습데이터에 대한 특징 정보에 해당할 수 있다. 제1 피처 생성부(310)는 학습데이터에 포함된 각 고정 필드에 대응하는 특징 정보를 제1 피처로서 생성할 수 있다. The fixed field may correspond to a specific area included in the training data, and may have a fixed size or store a fixed item. For example, in the case of a network packet, the fixed field may include time, source IP, source Port, destination IP, destination Port, protocol, packet size, and the like. Also, the first feature may correspond to feature information about the training data. The first feature generator 310 may generate feature information corresponding to each fixed field included in the training data as the first feature.

일 실시예에서, 제1 피처 생성부(310)는 복수의 고정 필드들에 대해 특징 추출 알고리즘을 적용하여 제1 피처(feature)로서 하나의 정수를 생성할 수 있다. 제1 피처 생성부(310)는 고정 필드를 특징 추출 알고리즘의 입력으로 제공할 수 있고 해당 고정 필드에 대응하는 특징 정보로서 정수를 출력으로 획득할 수 있다. 특히, 제1 피처 생성부(310)는 특징 추출 알고리즘에 따라 출력으로서 벡터를 획득할 수 있고, 이 경우 해당 벡터를 정수로 변환하는 후처리를 수행할 수 있다.In an embodiment, the first feature generator 310 may generate one integer as a first feature by applying a feature extraction algorithm to a plurality of fixed fields. The first feature generator 310 may provide the fixed field as an input to the feature extraction algorithm, and may obtain an integer as an output as feature information corresponding to the fixed field. In particular, the first feature generator 310 may obtain a vector as an output according to a feature extraction algorithm, and in this case, may perform post-processing of converting the vector into an integer.

제2 피처 생성부(320)는 학습데이터에 포함된 적어도 하나의 가변 필드에 대한 제2 피처로서 n 차원의 피처 벡터를 생성할 수 있다. 여기에서, 가변 필드는 학습데이터에 포함된 특정 영역에 해당할 수 있고, 학습데이터에 따라 크기가 변할 수 있으며, 고정된 항목이 아닌 기타 항목을 저장할 수 있다. 예를 들어, 네트워크 패킷의 경우 가변 필드는 바이트 스트림(byte stream) 형태의 페이로드(payload) 영역을 포함할 수 있다. 또한, 제2 피처는 학습데이터에 대한 특징 정보에 해당할 수 있다.The second feature generator 320 may generate an n-dimensional feature vector as a second feature for at least one variable field included in the training data. Here, the variable field may correspond to a specific region included in the training data, may change in size according to the training data, and may store other items other than fixed items. For example, in the case of a network packet, the variable field may include a payload region in the form of a byte stream. Also, the second feature may correspond to feature information about the training data.

제2 피처 생성부(320)는 가변 필드에 대해 하나의 정수가 아닌 길이가 n, 즉 n 차원의 피처 벡터를 제2 피처로서 생성할 수 있다. 물론, 제2 피처 생성부(320)는 가변 필드에 대해 길이에 상관없이 암호학적 해시 함수(예를 들어, MD5, SHA256 등)를 이용하여 하나의 정수를 제2 피처로서 생성할 수 있으나, 이에 대한 설명은 생략한다. 제2 피처 생성부(320)는 가변 필드에 대해 텍스트 마이닝(text mining)기법을 적용함으로써 하나의 정수가 아닌 n 차원의 피처 벡터를 생성할 수 있고 이를 통해 특징 확장을 제공함과 동시에 머신러닝에 따른 보안관제 데이터 분석의 정확도를 향상시킬 수 있다.The second feature generator 320 may generate, as the second feature, a feature vector having a length of n, that is, an n-dimensional dimension, instead of one integer for the variable field. Of course, the second feature generator 320 may generate one integer as the second feature using a cryptographic hash function (eg, MD5, SHA256, etc.) for the variable field regardless of the length, but this A description thereof will be omitted. The second feature generator 320 can generate an n-dimensional feature vector instead of a single integer by applying a text mining technique to the variable field, thereby providing feature expansion and at the same time It can improve the accuracy of security control data analysis.

일 실시예에서, 제2 피처 생성부(320)는 적어도 하나의 가변 필드로부터 복수의 워드(word)들을 추출하고 복수의 워드들 각각의 출현빈도를 기초로 n 차원의 피처 벡터를 생성할 수 있다. 가변 필드는 가변 길이의 문자열에 해당할 수 있고, 제2 피처 생성부(320)는 해당 문자열에 대한 토큰으로서 복수의 워드들을 추출할 수 있다. 제2 피처 생성부(320)는 학습 모집단에 포함된 전체 학습데이터에서 각 워드들의 출현빈도를 산출하고 이를 기초로 학습데이터 별로 길이 n인 피처 벡터를 생성할 수 있다.In an embodiment, the second feature generator 320 may extract a plurality of words from at least one variable field and generate an n-dimensional feature vector based on the frequency of occurrence of each of the plurality of words. . The variable field may correspond to a character string of a variable length, and the second feature generator 320 may extract a plurality of words as a token for the corresponding character string. The second feature generator 320 may calculate the frequency of appearance of each word in the entire training data included in the training population, and generate a feature vector having a length of n for each training data based on this.

일 실시예에서, 제2 피처 생성부(320)는 복수의 워드들을 기초로 학습 모집단에 대한 TF 또는 DF 순위를 결정하는 제1 단계, TF(Term Frequency) 또는 DF(Document Frequency) 순위를 기준으로 n 차원에 대응하는 워드를 결정하는 제2 단계 및 n 차원 별로 대응하는 워드의 출현 여부에 따라 성분값을 결정하는 제3 단계를 수행하여 제2 피처를 생성할 수 있다.In an embodiment, the second feature generator 320 determines the TF or DF ranking for the learning population based on the plurality of words, based on the TF (Term Frequency) or DF (Document Frequency) ranking. The second feature may be generated by performing a second step of determining a word corresponding to the n dimension and a third step of determining a component value according to whether a word corresponding to each n dimension appears.

보다 구체적으로, 제2 피처 생성부(320)는 전처리 단계에서 학습데이터의 가변 필드로부터 복수의 워드들을 추출할 수 있고, 제1 단계로서 학습 모집단의 전체 학습데이터에서 각 워드의 출현빈도에 관한 TF 또는 DF를 산출하여 순위를 결정할 수 있다. TF(Term Frequency)는 전체 학습데이터에서 각 워드의 출현횟수에 해당하고, DF(Document Frequency)는 전체 학습데이터 중 각 워드가 출현한 학습데이터의 개수에 해당할 수 있다. 제2 피처 생성부(320)는 TF 및 DF 중 어느 하나를 정렬 기준으로 결정한 후 해당 기준에 따라 워드들을 정렬시켜 순위를 결정할 수 있다.More specifically, the second feature generator 320 may extract a plurality of words from the variable field of the training data in the pre-processing step, and as a first step, a TF related to the frequency of appearance of each word in the entire training data of the training population Alternatively, the ranking may be determined by calculating the DF. TF (Term Frequency) may correspond to the number of occurrences of each word in the entire training data, and DF (Document Frequency) may correspond to the number of training data in which each word appears in the entire training data. After determining one of TF and DF as an alignment criterion, the second feature generator 320 may align the words according to the corresponding criterion to determine the rank.

또한, 제2 피처 생성부(320)는 TF 또는 DF 순위에 따라 정렬된 워드들 각각에 대해 워드별 순위를 차원의 인덱스로 결정하여 각 워드에 대한 차원을 대응시킬 수 있다. 예를 들어, 제2 피처 생성부(320)는 TF 순위 기준으로 상위 3개의 워드 'a', 'b', 'c'에 대해 순서대로 'a'는 1차원에 대응시키고, 'b'는 2차원에 대응시키며, 'c'는 3 차원에 대응시킬 수 있다. Also, the second feature generator 320 may determine the rank of each word as a dimension index for each of the words sorted according to the TF or DF rank, and may correspond to the dimension of each word. For example, the second feature generator 320 sequentially maps 'a' to one dimension with respect to the top three words 'a', 'b', and 'c' based on the TF ranking, and 'b' is Corresponds to two dimensions, and 'c' can correspond to three dimensions.

또한, 제2 피처 생성부(320)는 n 차원 별로 대응하는 워드의 출현 여부에 따라 성분값을 결정할 수 있다. 예를 들어, 제2 피처 생성부(320)는 상기의 경우에 있어서 3차원의 피처 벡터를 생성할 수 있다. 보다 구체적으로, 제2 피처 생성부(320)는 피처 벡터에 대해 1차원 성분은 워드 'a'의 출현여부를 기초로 값을 부여하고, 2차원 성분은 워드 'b'의 출현여부를 기초로 값을 부여하며, 3차원 성분은 워드 'c'의 출현여부를 기초로 값을 부여할 수 있다.Also, the second feature generator 320 may determine a component value according to whether a word corresponding to each n dimension appears. For example, the second feature generator 320 may generate a three-dimensional feature vector in the above case. More specifically, the second feature generator 320 assigns a value to the feature vector based on the occurrence of the word 'a' for the one-dimensional component, and the 2-D component based on the occurrence of the word 'b' for the feature vector. A value is assigned, and the 3D component can be assigned a value based on the appearance of the word 'c'.

일 실시예에서, 제2 피처 생성부(320)는 제3 단계에서 워드 별로 출현한 경우 '1'을 성분값으로 결정하고 출현하지 않은 경우 '0'을 성분값으로 결정할 수 있다. 예를 들어, 제2 피처 생성부(320)는 제3 단계에서 워드별 출원여부를 기초로 원 핫 인코딩(One Hot Encoding)에 따라 성분값을 부여할 수 있다. 여기에서, 원 핫 인코딩(One Hot Encoding)은 해당 피처가 존재하면 1, 존재하지 않으면 0으로 표시하여 피처 벡터를 생성하는 방법으로 피처 벡터의 크기가 피처의 개수에 의해 결정될 수 있다.In an embodiment, the second feature generator 320 may determine '1' as a component value if it appears for each word in the third step, and may determine '0' as a component value if it does not appear. For example, in the third step, the second feature generator 320 may assign a component value according to one hot encoding based on whether or not each word is applied. Here, one hot encoding is a method of generating a feature vector by indicating 1 if the corresponding feature exists and 0 if not present, and the size of the feature vector may be determined by the number of features.

즉, 제2 피처 생성부(320)는 TF 또는 DF 순위를 기초로 인덱스(또는 성분)를 결정할 수 있고, 피처 벡터의 해당 인덱스에 대응하는 성분값으로 1 또는 0을 할당하는 동작을 반복 수행함으로써 피처 벡터를 생성할 수 있다.That is, the second feature generator 320 may determine the index (or component) based on the TF or DF ranking, and repeat the operation of allocating 1 or 0 as a component value corresponding to the corresponding index of the feature vector. You can create feature vectors.

일 실시예에서, 제2 피처 생성부(320)는 적어도 하나의 가변 필드에 대해 피처 해싱(Feature Hashing)을 적용하여 n 차원의 피처 벡터를 생성할 수 있다. 제2 피처 생성부(320)는 가변 필드에 대해 워드별 출현빈도에 관한 TF 또는 DF를 적용하는 대신 피처 해싱(Feature Hashing)을 적용하여 피처 벡터를 생성할 수 있다. 여기에서, 피처 해싱(Feature Hashing)은 해시 함수(Hash Function)를 이용하여 피처 벡터를 생성하는 방법에 해당할 수 있고, 각 워드에 대해 해시 함수의 결과를 인덱스(index)로 결정하여 해당 인덱스의 성분값을 증가시켜 피처 벡터를 생성할 수 있다. 이 때, 인덱스의 개수는 n 개에 해당할 수 있다.In an embodiment, the second feature generator 320 may generate an n-dimensional feature vector by applying feature hashing to at least one variable field. The second feature generator 320 may generate a feature vector by applying feature hashing to the variable field instead of applying TF or DF regarding the frequency of occurrence for each word. Here, feature hashing may correspond to a method of generating a feature vector using a hash function, and determining the result of the hash function as an index for each word to determine the index of the corresponding index. Feature vectors can be created by increasing the component values. In this case, the number of indices may correspond to n.

학습 벡터 생성부(330)는 제1 피처와 제2 피처를 성분으로 포함하는 벡터를 학습데이터에 관한 학습 벡터로서 생성할 수 있다. 즉, 학습 벡터 생성부(330)는 학습데이터에 포함된 복수의 고정 필드들 각각에 관한 제1 피처를 해당 위치의 성분값으로 사용하고, 적어도 하나의 가변 필드에 관한 제2 피처에 대해서는 n 차원의 피처 벡터를 해당 위치에 삽입하여 n 개의 성분값으로 사용할 수 있다. 따라서, 고정 필드에 대해 학습 벡터의 하나의 성분에 대응할 수 있고, 가변 필드에 대해 학습 벡터의 n개의 성분에 대응할 수 있다.The training vector generator 330 may generate a vector including the first feature and the second feature as components as a training vector for the training data. That is, the training vector generator 330 uses a first feature of each of a plurality of fixed fields included in the training data as a component value of a corresponding position, and an n-dimensional second feature of at least one variable field. It can be used as n component values by inserting the feature vector of . Thus, it may correspond to one component of the training vector for a fixed field, and may correspond to n components of the training vector for a variable field.

일 실시예에서, 학습 벡터 생성부(330)는 제1 및 제2 피처들을 각각 하나의 성분에 대응시켜 학습 벡터를 생성할 수 있다. 즉, 학습 벡터 생성부(330)는 n 차원의 벡터인 제2 피처를 학습 벡터의 하나의 성분값으로 할당할 수 있다. 따라서, 학습 벡터는 n 차원의 벡터를 하나의 성분값으로 하는 다차원 벡터에 해당할 수 있다.In an embodiment, the learning vector generator 330 may generate a learning vector by matching the first and second features to one component, respectively. That is, the training vector generator 330 may allocate the second feature, which is an n-dimensional vector, as one component value of the training vector. Accordingly, the training vector may correspond to a multidimensional vector using an n-dimensional vector as one component value.

악성코드 탐지 학습부(340)는 학습 데이터에 대한 특징 정보로서 학습 벡터를 학습하여 악성코드 탐지 모델을 생성할 수 있다. 악성코드 탐지 모델은 학습을 통해 생성되는 머신러닝 기반의 분류 모델에 해당할 수 있고, 입력된 특징 정보와 연관된 컨텐츠가 악성코드인지 여부에 관한 결과를 출력할 수 있다. 예를 들어, 악성코드 탐지 모델은 침입 탐지 시스템(IPS) 또는 침입 방지 시스템(IDS)의 경우 보완관제 데이터 분석 과정에서 네트워크 패킷의 악성여부를 분류하는 동작을 수행할 수 있다. The malicious code detection learning unit 340 may generate a malicious code detection model by learning a training vector as characteristic information on the training data. The malicious code detection model may correspond to a machine learning-based classification model generated through learning, and may output a result regarding whether the content related to the input characteristic information is malicious code. For example, in the case of an intrusion detection system (IPS) or an intrusion prevention system (IDS), the malicious code detection model may perform an operation of classifying whether a network packet is malicious in a supplementary control data analysis process.

일 실시예에서, 악성코드 탐지 학습부(340)는 탐지 환경에 따라 라벨(label)이 부여된 학습데이터만을 사용하여 학습을 수행할 수 있다. 또한, 악성코드 탐지 학습부(340)는 라벨이 부여된 학습데이터와 라벨이 부여되지 않은 학습데이터를 혼합하여 학습을 수행할 수 있다. 이 경우, 악성코드 탐지 학습부(340)는 라벨이 부여되지 않은 학습데이터에 대해 전처리 단계에서 다른 악성코드 탐지 방법을 적용하여 사전 분류를 수행함으로써 라벨이 부여된 학습데이터로 변환한 후 학습을 수행할 수 있다.In an embodiment, the malicious code detection and learning unit 340 may perform learning using only training data to which a label is assigned according to a detection environment. Also, the malicious code detection and learning unit 340 may perform learning by mixing labeled learning data and unlabeled learning data. In this case, the malicious code detection learning unit 340 performs pre-classification by applying another malicious code detection method to the unlabeled training data in the pre-processing step to convert it into labeled training data and then perform learning. can do.

악성코드 탐지부(350)는 악성코드 탐지 모델을 이용하여 악성코드를 탐지할 수 있다. 악성코드 탐지 모델은 입력된 특징 벡터를 기초로 연관된 컨텐츠(예를 들어, 파일, 네트워크 패킷 등)이 악성인지 정상인지를 출력으로 제공할 수 있고, 악성코드 탐지부(350)는 이를 기초로 악성코드 탐지 결과를 제공할 수 있다. 악성코드 탐지 결과는 그래픽 처리되어 디스플레이 패널을 통해 제공될 수 있고, 별도의 메시지를 통해 관련 정보와 함께 제공될 수 있으며, 반드시 이에 한정되지 않고, 악성코드 탐지부(350)에 의해 다양한 방식으로 시각화되어 제공될 수 있다.The malicious code detection unit 350 may detect a malicious code using a malicious code detection model. The malicious code detection model may provide as an output whether the related content (eg, file, network packet, etc.) is malicious or normal based on the input feature vector, and the malicious code detection unit 350 based on this Code detection results can be provided. The malicious code detection result may be graphically processed and provided through a display panel, and may be provided together with related information through a separate message, but is not limited thereto, and may be visualized in various ways by the malicious code detection unit 350 . can be provided.

제어부(360)는 학습 벡터 생성 장치(130)의 전체적인 동작을 제어하고, 제1 피처 생성부(310), 제2 피처 생성부(320), 학습 벡터 생성부(330), 악성코드 탐지 학습부(340) 및 악성코드 탐지부(350) 간의 제어 흐름 또는 데이터 흐름을 관리할 수 있다.The control unit 360 controls the overall operation of the learning vector generating apparatus 130 , the first feature generating unit 310 , the second feature generating unit 320 , the learning vector generating unit 330 , and the malicious code detection and learning unit The control flow or data flow between the 340 and the malicious code detection unit 350 may be managed.

도 4는 도 1에 있는 학습 벡터 생성 장치에서 수행되는 보안관제 데이터 분석을 위한 머신러닝 기반의 학습 벡터 생성 과정을 설명하는 순서도이다.4 is a flowchart illustrating a machine learning-based learning vector generation process for security control data analysis performed by the learning vector generating apparatus shown in FIG. 1 .

도 4를 참조하면, 학습 벡터 생성 장치(130)는 제1 피처 생성부(310)를 통해 학습 모집단에 대해 학습데이터에 포함된 복수의 고정 필드들 각각에 대한 제1 피처를 생성할 수 있다(단계 S410). 학습 벡터 생성 장치(130)는 제2 피처 생성부(320)를 통해 학습데이터에 포함된 적어도 하나의 가변 필드에 대한 제2 피처로서 n 차원의 피처 벡터를 생성할 수 있다(단계 S430). 학습 벡터 생성 장치(130)는 학습 벡터 생성부(330)를 통해 제1 피처와 제2 피처를 성분으로 포함하는 벡터를 학습데이터에 관한 학습 벡터로서 생성할 수 있다(단계 S450).Referring to FIG. 4 , the training vector generating apparatus 130 may generate a first feature for each of a plurality of fixed fields included in the training data for the training population through the first feature generator 310 ( step S410). The training vector generating apparatus 130 may generate an n-dimensional feature vector as a second feature for at least one variable field included in the training data through the second feature generating unit 320 (step S430). The training vector generating apparatus 130 may generate a vector including the first feature and the second feature as a component through the training vector generator 330 as a training vector for the training data (step S450).

또한, 학습 벡터 생성 장치(130)는 악성코드 탐지 학습부(340)를 통해 학습 데이터에 대한 특징 정보로서 학습 벡터를 학습하여 악성코드 탐지 모델을 생성할 수 있다(단계 S470). 학습 벡터 생성 장치(130)는 악성코드 탐지부(350)를 통해 악성코드 탐지 모델을 이용하여 악성코드를 탐지할 수 있다(단계 S490).Also, the training vector generating apparatus 130 may generate a malicious code detection model by learning the training vector as feature information on the training data through the malicious code detection and learning unit 340 (step S470). The training vector generating apparatus 130 may detect the malicious code using the malicious code detection model through the malicious code detection unit 350 (step S490).

도 5는 도 1에 있는 학습 벡터 생성 장치에서 수행되는 학습 벡터 생성 과정을 설명하는 도면이다.FIG. 5 is a view for explaining a learning vector generating process performed by the learning vector generating apparatus shown in FIG. 1 .

도 5를 참조하면, 학습 벡터 생성 장치(130)는 학습데이터(510)에 대한 학습 벡터(530)를 생성할 수 있다. 이 때, 학습데이터(510)는 침입 탐지 시스템 또는 침입 방지 시스템에서 보완관제 로그(log)에 포함된 네트워크 패킷에 해당할 수 있다. 네트워크 패킷은 고정 필드(511)와 가변 필드(513)를 포함할 수 있다.Referring to FIG. 5 , the training vector generating apparatus 130 may generate a training vector 530 for the training data 510 . In this case, the training data 510 may correspond to a network packet included in a supplementary control log in an intrusion detection system or an intrusion prevention system. The network packet may include a fixed field 511 and a variable field 513 .

학습 벡터 생성 장치(130)는 고정 필드(511)에 대해 하나의 정수로 변환하여 학습 벡터(530)의 일 성분으로 사용할 수 있고, 가변 필드(513)에 대해 n차원의 벡터로 변환하여 학습 벡터(530)의 n개 성분들로 사용할 수 있다. 결과적으로, 학습 벡터 생성 장치(130)는 학습데이터(510)를 구성하는 특정 필드에 대해 피처 확장을 통해 머신러닝에 따른 보안관제의 정확도를 향상시킬 수 있다.The training vector generating apparatus 130 may convert the fixed field 511 into an integer and use it as a component of the training vector 530 , and convert it into an n-dimensional vector for the variable field 513 to learn the vector The n components of (530) can be used. As a result, the training vector generating apparatus 130 may improve the accuracy of security control according to machine learning through feature expansion for a specific field constituting the training data 510 .

도 6은 도 1에 있는 학습 벡터 생성 장치에서 수행되는 가변 필드에 대한 피처 벡터 생성 과정을 설명하는 도면이다.FIG. 6 is a view for explaining a feature vector generation process for a variable field performed in the apparatus for generating a learning vector shown in FIG. 1 .

도 6을 참조하면, 학습 벡터 생성 장치(130)는 TF 또는 DF를 사용하여 가변 필드에 대한 피처 벡터를 생성할 수 있다. 예를 들어, 학습 벡터 생성 장치(130)는 학습데이터의 DecodePayload(이하, dp라 한다.) 필드에 대한 피처 벡터를 생성할 수 있다. Referring to FIG. 6 , the training vector generating apparatus 130 may generate a feature vector for a variable field using TF or DF. For example, the training vector generating apparatus 130 may generate a feature vector for a DecodePayload (hereinafter, referred to as dp) field of the training data.

도 6에서, 3개의 dp, dp1 = ['b', 'a', 'b', 'd', 'b', b'], dp2 = ['d', 'b', 'b', 'd', 'd', d', 'a'], dp3 = ['c', 'c', 'c', 'c', 'a']를 학습데이터로 사용한다고 가정한다. 각 학습데이터는 토큰화되어 복수의 워드들로 분할될 수 있다. 6, three dp, dp1 = ['b', 'a', 'b', 'd', 'b', b'], dp2 = ['d', 'b', 'b', Assume that 'd', 'd', d', 'a'], dp3 = ['c', 'c', 'c', 'c', 'a'] are used as training data. Each training data may be tokenized and divided into a plurality of words.

TF 기준 상위 3개의 워드들은 각각 TF('b') = 6, TF('d') = 5, TF('c') = 4에 해당할 수 있다. 이 경우, dp에 대한 피처 벡터 FV(dp) = ['b' 출현여부, 'd' 출현여부, 'c' 출현여부]로 정의될 수 있고, 각 dp별 피처 벡터는 FV(dp1) = [1, 1, 0], FV(dp2) = [1, 1, 0], FV(dp3) = [0, 0, 1]에 해당할 수 있다.The upper three words based on TF may correspond to TF('b') = 6, TF('d') = 5, and TF('c') = 4, respectively. In this case, it can be defined as the feature vector FV(dp) for dp = [whether 'b' appears, 'd' appears, 'c' appears], and the feature vector for each dp is FV(dp1) = [ 1, 1, 0], FV(dp2) = [1, 1, 0], FV(dp3) = [0, 0, 1].

DF 기준 상위 3개의 워드들은 DF('a') = 3, DF('d') = 2, DF('b') = 2에 해당할 수 있다. 이 경우, dp에 대한 피처 벡터 FV(dp) = ['a' 출현여부, 'd' 출현여부, 'b' 출현여부]로 정의될 수 있고, 각 dp별 피처 벡터는 FV(dp1) = [1, 1, 1], FV(dp2) = [1, 1, 1], FV(dp3) = [1, 0, 0]에 해당할 수 있다.The top three words based on DF may correspond to DF('a') = 3, DF('d') = 2, and DF('b') = 2. In this case, the feature vector FV(dp) for dp = [whether 'a' appears, 'd' appears, 'b' appears] can be defined, and the feature vector for each dp is FV(dp1) = [ 1, 1, 1], FV(dp2) = [1, 1, 1], FV(dp3) = [1, 0, 0].

도 7은 학습데이터의 일 실시예를 설명하는 도면이다.7 is a view for explaining an embodiment of learning data.

도 7을 참조하면, 학습 벡터 생성 장치(130)는 보완관제 로그 중에서 네트워크 패킷을 학습데이터로 사용할 수 있다. 이 경우, 학습데이터는 고정 필드(710)로서 시간(time), 출발지 IP(sIP), 출발지 Port(sPort), 도착지 IP(dIP), 도착지 Port(dPort), 프로토콜(protocol), 패킷 크기(size), 이벤트(event)를 포함할 수 있고, 가변 필드(730)로서 페이로드(payload)를 포함할 수 있다.Referring to FIG. 7 , the training vector generating apparatus 130 may use a network packet from the supplementary control log as training data. In this case, the training data is a fixed field 710 that includes time, source IP (sIP), source Port (sPort), destination IP (dIP), destination Port (dPort), protocol (protocol), and packet size (size). ), an event, and a payload as the variable field 730 .

학습 벡터 생성 장치(130)는 가변 필드(730)로서 페이로드(payload)에 대해 텍스트 마이닝 기법을 적용할 수 있고, 스페이스(space)를 구분자로 사용하여 토큰화된 각 워드의 출현빈도를 기초로 n 차원의 피처 벡터를 생성함으로써 학습 벡터의 차원을 확장할 수 있다. 한편, 학습 벡터 생성 장치(130)는 가변 필드(730)를 제외한 나머지 필드에 대해서는 일반적인 특징 추출 알고리즘을 통해 피처를 추출하여 학습 벡터의 나머지 성분값으로 사용할 수 있다. 즉, 학습 벡터 생성 장치(130)는 고정 필드(710)는 그대로 사용하되 가변 필드(730)에 대해서는 피처 확장을 통해 머신러닝에 따른 분류의 정확도를 향상시킬 수 있다.The training vector generating apparatus 130 may apply a text mining technique to a payload as a variable field 730, and based on the frequency of occurrence of each tokenized word using a space as a delimiter. By creating an n-dimensional feature vector, the dimension of the training vector can be extended. Meanwhile, the training vector generating apparatus 130 may extract features from the remaining fields except for the variable field 730 through a general feature extraction algorithm and use them as the remaining component values of the training vector. That is, the training vector generating apparatus 130 may use the fixed field 710 as it is, but improve the accuracy of classification according to machine learning through feature expansion for the variable field 730 .

도 8은 학습데이터의 특정 필드에 대한 피처 확장을 설명하는 도면이다.8 is a diagram for explaining feature expansion for a specific field of training data.

도 8을 참조하면, 학습 벡터 생성 장치(130)는 학습데이터에 포함된 적어도 하나의 가변 필드(830)에 대해 피처 확장을 수행하여 확장된 학습 벡터를 생성할 수 있다.Referring to FIG. 8 , the training vector generating apparatus 130 may generate an extended training vector by performing feature expansion on at least one variable field 830 included in training data.

학습 벡터 생성 장치(130)는 학습데이터 e1 및 e2에 포함된 가변 필드(830) A와 C에 대해 피처 확장을 수행할 수 있다. 보다 구체적으로, 학습 벡터 생성 장치(130)는 가변 필드(830) A와 C에 대해 각각 2차원과 3차원으로 피처를 확장시킬 수 있다. 학습 벡터 생성 장치(130)는 가변 필드(830) A에 대해서는 DF를 기준으로 상위 2개의 워드들 각각의 출현여부를 기초로 피처 벡터의 성분값을 부여할 수 있고, 가변 필드(830)에 대해서는 TF를 기준으로 상위 3개의 워드들 각각의 출현여부를 기초로 피처 벡터의 성분값을 부여할 수 있다. 한편, 학습 벡터 생성 장치(130)는 학습데이터에 포함된 고정 필드(810)에 대해서는 기존 기법을 적용하여 피처 벡터의 성분값을 산출할 수 있다.The training vector generating apparatus 130 may perform feature expansion on the variable fields 830 A and C included in the training data e1 and e2 . More specifically, the training vector generating apparatus 130 may expand the features of the variable fields A and C in 2D and 3D, respectively. The training vector generating apparatus 130 may assign a component value of the feature vector to the variable field 830 A based on the appearance of each of the upper two words based on the DF, and for the variable field 830 A component value of the feature vector may be assigned based on the appearance of each of the upper three words based on the TF. Meanwhile, the training vector generating apparatus 130 may calculate the component values of the feature vector by applying the existing technique to the fixed field 810 included in the training data.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to preferred embodiments of the present invention, those skilled in the art can variously modify and change the present invention within the scope without departing from the spirit and scope of the present invention as set forth in the claims below. You will understand that it can be done.

100: 학습 벡터 생성 시스템
110: 사용자 단말 130: 학습 벡터 생성 장치
150: 데이터베이스
210: 프로세서 230: 메모리
250: 사용자 입출력부 270: 네트워크 입출력부
310: 제1 피처 생성부 320: 제2 피처 생성부
330: 학습 벡터 생성부 340: 악성코드 탐지 학습부
350: 악성코드 탐지부 360: 제어부
510: 학습데이터 511: 고정 필드
513: 가변 필드 530: 학습 벡터
710, 810: 고정 필드 730, 830: 가변 필드100: learning vector generation system
110: user terminal 130: learning vector generating device
150: database
210: processor 230: memory
250: user input/output unit 270: network input/output unit
310: first feature generator 320: second feature generator
330: learning vector generation unit 340: malicious code detection learning unit
350: malware detection unit 360: control unit
510: training data 511: fixed field
513: variable field 530: training vector
710, 810: fixed field 730, 830: variable field

Claims

a first feature generator for generating a first feature for each of a plurality of fixed fields included in the training data with respect to the training population;
A plurality of words are extracted from the at least one variable field as a second feature of the at least one variable field included in the training data, and an n dimension (the n above) is extracted based on the frequency of occurrence of each of the plurality of words. is a natural number) a second feature generator for generating a feature vector; and
A machine learning-based learning vector generating apparatus for security control data analysis, comprising a learning vector generator for generating a vector including the first feature and the second feature as a learning vector for the learning data.

The method of claim 1, wherein the first feature generator
Machine learning-based learning vector generating apparatus for security control data analysis, characterized in that by applying a feature extraction algorithm to the plurality of fixed fields to generate one integer as the first feature.

delete

The method of claim 1, wherein the second feature generator
A first step of determining a TF (Term Frequency) or DF (Document Frequency) ranking for the learning population based on the plurality of words, a first step of determining a word corresponding to the n dimension based on the TF or DF ranking Machine learning-based learning vector generation for security control data analysis, characterized in that the second feature is generated by performing the second step and the third step of determining the component value according to the appearance of the corresponding word for each n dimension Device.

The method of claim 4, wherein the second feature generator
In the third step, a machine learning-based learning vector for security control data analysis, wherein '1' is determined as the component value when it appears for each word, and '0' is determined as the component value when it does not appear. generating device.

The method of claim 1, wherein the second feature generator
A machine learning-based learning vector generating apparatus for security control data analysis, characterized in that the n-dimensional feature vector is generated by applying feature hashing to the at least one variable field.

According to claim 1,
a malicious code detection learning unit that generates a malicious code detection model by learning the training vector as the feature information on the training data; and
Machine learning-based learning vector generating apparatus for security control data analysis, characterized in that it further comprises a malicious code detection unit for detecting malicious code using the malicious code detection model.

In the method performed in the learning vector generating apparatus,
generating a first feature for each of a plurality of fixed fields included in the training data;
A plurality of words are extracted from the at least one variable field as a second feature of the at least one variable field included in the training data, and n-dimensional (the n above) based on the frequency of occurrence of each of the plurality of words is a natural number) generating a feature vector;
generating a vector including the first feature and the second feature as components as a learning vector for the learning data;
generating a malicious code detection model by learning the training vector as feature information on the training data; and
A method for generating a machine learning-based learning vector for security control data analysis, comprising the step of detecting a malicious code using the malicious code detection model.