KR20200129353A

KR20200129353A - method for generating similar malicious codes and method for improving malicious code detection performance using the same

Info

Publication number: KR20200129353A
Application number: KR1020190053640A
Authority: KR
Inventors: 배장성; 이창기; 김건영; 박천음; 전재원; 황현선
Original assignee: 한국전력공사; 강원대학교산학협력단
Priority date: 2019-05-08
Filing date: 2019-05-08
Publication date: 2020-11-18

Abstract

Disclosed is a method for effectively learning a malicious code detection machine learning model. Provided is the method for effectively learning a malicious code detection machine learning model by augmenting data used for machine learning using variational auto encoder-convolutional neural network (VAE-CNN) and utilizing the augmented data.

Description

Method for generating similar malicious codes and method for improving malicious code detection performance using the same}

본 발명은 악성코드 탐지 기계학습 모델을 효과적으로 학습시키기 위한 방법에 관한 것으로, 변분 자기부호기 합성곱 신경망(VAE-CNN: Variational Auto Encoder - Convolutional Neural Network)을 이용하여 기계학습에 사용되는 데이터를 증강(Augmentation)시키고 이 증강된 데이터를 활용하여 악성코드 탐지 기계학습 모델을 효과적으로 학습시키는 방법에 관한 것이다.The present invention relates to a method for effectively learning a malicious code detection machine learning model, wherein data used for machine learning is augmented using a VAE-CNN: Variational Auto Encoder-Convolutional Neural Network (VAE-CNN). Augmentation) and using this augmented data to effectively learn a malware detection machine learning model.

악성코드(malware)는 악의적 목적을 위해 작성된 프로그램으로 컴퓨터의 속도를 저하시키거나, 네트워크에 대량의 트래픽을 유도하여 컴퓨터나 네트워크에 부정적인 영향을 주는 프로그램을 말한다.Malware is a program written for malicious purposes and refers to a program that negatively affects a computer or network by slowing down a computer or inducing a large amount of traffic to the network.

악성코드 탐지 기술은 시그니쳐를 이용하는 전통적인 방법과 딥러닝과 같은 기계학습을 이용하는 방법이 있다. 시그니쳐를 이용한 악성코드 탐지는 높은 정확률을 가지고 있으나 적용률이 낮다는 단점이 있다. 반면에, 딥러닝을 이용한 악성코드 탐지는 정확률 및 적용률 모든 측면에서 좋은 성능을 보이고 있다. Malware detection technologies include traditional methods using signatures and methods using machine learning such as deep learning. Malware detection using signatures has a high accuracy rate, but has a disadvantage of low application rate. On the other hand, malware detection using deep learning shows good performance in all aspects of accuracy and application rate.

도 1은 기계학습을 활용한 기본적인 악성코드 탐지 시스템을 설명하기 위한 개념도이다. 1 is a conceptual diagram illustrating a basic malicious code detection system using machine learning.

기계학습 기반의 악성코드 탐지란 정상파일 및 악성코드가 포함된 파일로 학습 모델을 학습시킨 후, 학습된 모델로 의심스러운 파일의 악성 여부를 탐지하는 것이다. Malware detection based on machine learning is to train a learning model with a normal file and a file containing the malicious code, and then detect whether a suspicious file is malicious with the learned model.

학습 단계에서 학습 모델을 파일들의 특징정보(문자열, 명령어, 바이트 정보, API 호출 기록 등)와 레이블(normal/abnormal)로 학습시킴으로써 학습모델이 악성코드 탐지에 최적화되도록 한다.In the learning stage, the learning model is optimized for detection of malicious code by learning the characteristic information of files (string, command, byte information, API call record, etc.) and label (normal/abnormal).

그리고 탐지 단계에서 학습이 완료된 모델을 이용하여 입력된 파일이 정상(normal) 파일인지 악성코드(abnormal)인지를 구별한다.And, by using the model that has been trained in the detection stage, it distinguishes whether the input file is a normal file or a malicious code.

하지만 악성코드 탐지에 좋은 성능을 보이는 딥러닝을 이용하기 위해서는 많은 양의 악성코드 데이터가 필요하다. 이는 대부분의 기계학습 기반 악성코드 탐지 기술에서 성능에 크게 영향을 미치는 부분이 기계학습 알고리즘과 학습 데이터이기 때문이다. However, a large amount of malicious code data is required to use deep learning that has good performance in detecting malicious codes. This is because machine learning algorithms and learning data are the parts that greatly affect performance in most machine learning-based malware detection technologies.

그러나 끊임없이 증가하는 악성코드와 달리 딥러닝과 같이 실제 기계학습에 사용할 수 있는 악성코드 데이터를 얻는 일은 쉽지 않다. However, unlike the ever-increasing malware, it is not easy to obtain malware data that can be used for actual machine learning, such as deep learning.

왜냐하면, 악성코드에 직접 감염이 되어야 악성코드인지 확인이 가능하며, 이러한 악성코드 데이터는 일종의 자산으로 취급되어 공개되지 않기 때문이다. This is because it is possible to check if it is a malicious code only when it is directly infected with the malicious code, and this malicious code data is treated as a kind of asset and is not disclosed.

따라서 많은 양의 악성코드 데이터를 확보하는데 어려움으로 인하여, 기계학습을 이용한 악성코드 탐지 방법에 제약이 있다.Therefore, due to the difficulty in securing a large amount of malicious code data, there is a limitation in the method of detecting malicious code using machine learning.

전술한 배경기술은 발명자가 본 발명의 도출을 위해 보유하고 있었거나, 본 발명의 도출 과정에서 습득한 기술 정보로서, 반드시 본 발명의 출원 전에 일반 공중에게 공개된 공지기술이라 할 수는 없다. The above-described background technology is technical information possessed by the inventors for derivation of the present invention or acquired during the derivation process of the present invention, and is not necessarily known to be publicly known prior to filing the present invention.

본 발명은 전술한 문제를 해결하기 위한 것으로, 변분 자기부호기 합성곱 신경망(VAE-CNN: Variational Auto Encoder - Convolutional Neural Network)을 이용하여 기계학습에 사용되는 데이터를 증강(Augmentation)시키고 이 증강된 데이터를 활용하여 악성코드 탐지 기계학습 모델을 효과적으로 학습시키는데 목적이 있다.The present invention is to solve the above-described problem, using a VAE-CNN: Variational Auto Encoder-Convolutional Neural Network, augmentation of data used for machine learning The purpose is to effectively learn a malicious code detection machine learning model by using.

또한, 본 발명은 상기의 방법으로 학습이 완료된 기계학습 모델을 이용하여 의심스러운 파일의 악성 여부를 탐지하는데 목적이 있다.In addition, an object of the present invention is to detect whether a suspicious file is malicious by using a machine learning model that has been learned by the above method.

전술한 과제를 해결하기 위한 수단으로, 본 발명은 다음과 같은 특징이 있는 실시예를 가진다.As a means for solving the above problems, the present invention has an embodiment having the following characteristics.

본 발명은, 기계학습 모델을 이용하여 이미 보유하고 있는 원본 악성코드 데이터로부터 상기 원본 악성코드 데이터의 레이블에 속하는 유사한 악성코드 데이터를 생성하는 것을 특징으로 한다.The present invention is characterized by generating similar malicious code data belonging to a label of the original malicious code data from the original malicious code data already possessed by using a machine learning model.

상기 기계학습 모델은 VAE-CNN(Variational Auto Encoder - Convolutional Neural Network) 인 것을 특징으로 한다. The machine learning model is characterized by being a VAE-CNN (Variational Auto Encoder-Convolutional Neural Network).

상기 유사 악성코드 생성 방법은 이미 보유하고 있는 악성코드 데이터로부터 레이블 및 악성코드 명령어 시퀀스 정보를 추출하는 단계; 를 포함하는 것을 특징으로 한다. The similar malicious code generation method includes the steps of extracting label and malicious code command sequence information from malicious code data already possessed; It characterized in that it comprises a.

상기 유사 악성코드 생성 방법은 기계학습 단계; 를 포함하고, 상기 기계학습 단계는 입력 받은 데이터에 악성코드가 있는지 여부를 판별할 수 있도록, 상기 원본 악성코드 데이터를 학습(

)하고, 상기 유사 악성코드 데이터를 학습(

)하는 판별기 학습단계; 를 포함하는 것을 특징으로 한다. The method of generating the similar malicious code includes a machine learning step; Including, the machine learning step learns the original malicious code data to determine whether there is a malicious code in the input data (

), and learning the similar malicious code data (

) The discriminator learning step; It characterized in that it comprises a.

상기 기계학습 단계는 VAE(Variational Auto Encoder)의 손실 함수를 학습하는 것을 특징으로 한다. The machine learning step is characterized by learning a loss function of a VAE (Variational Auto Encoder).

또한 본 발명은 전술한 유사 악성코드 생성 방법으로 생성된 유사 악성코드를 이용하여 악성코드 탐지 기계학습 모델을 학습시키는 것을 특징으로 한다. In addition, the present invention is characterized in that a malicious code detection machine learning model is trained using the similar malicious code generated by the above-described similar malicious code generation method.

또한 본 발명이 제시하는 악성코드 탐지를 수행하는 컴퓨팅 장치는, 기계학습 모델을 이용하여 이미 보유하고 있는 원본 악성코드 데이터로부터 상기 원본 악성코드 데이터의 레이블에 속하는 유사한 악성코드 데이터를 생성하는 유사 악성코드 생성부; 를 포함하는 것을 특징으로 한다. In addition, the computing device for detecting malicious code proposed by the present invention uses a machine learning model to generate similar malicious code data belonging to the label of the original malicious code data from the original malicious code data already possessed. Generation unit; It characterized in that it comprises a.

VAE-CNN은 판별기, 디코더, 및 인코더를 포함하고, 상기 판별기는 원본 악성코드 데이터와 유사 악성코드 생성부에 의해 생성된 유사 악성코드를 동시에 학습하고, 상기 디코더는 레이블 제어 생성을 학습하기 위해 생성된 데이터를 상기 판별기로 넘기는 것을 특징으로 한다. The VAE-CNN includes a discriminator, a decoder, and an encoder, and the discriminator simultaneously learns the original malicious code data and the similar malicious code generated by the similar malicious code generator, and the decoder learns to generate label control. It is characterized in that the generated data is passed to the discriminator.

상기 컴퓨팅 장치는 상기 판별기, 상기 디코더, 및 상기 인코더를 학습시키거나, 타 장치로 하여금 상기 판별기, 상기 디코더, 및 상기 인코더를 학습시키도록 지원하는 프로세서; 를 포함하는 것을 특징으로 한다.The computing device may include a processor configured to train the discriminator, the decoder, and the encoder, or to enable other devices to learn the discriminator, the decoder, and the encoder; It characterized in that it comprises a.

본 발명은 변분 자기부호기 합성곱 신경망(VAE-CNN: Variational Auto Encoder - Convolutional Neural Network)을 이용하여 기계학습에 사용되는 데이터를 증강(Augmentation)시킬 수 있다.The present invention can augment data used for machine learning by using a VAE-CNN: Variational Auto Encoder-Convolutional Neural Network.

또한, 본 발명은 증강된 데이터를 활용하여 악성코드 탐지 기계학습 모델을 효과적으로 학습시킬 수 있다.In addition, the present invention can effectively train a malware detection machine learning model by using the augmented data.

또한, 본 발명은 상기의 방법으로 효과적으로 학습이 완료된 기계학습 모델을 이용하여 기계학습 기반 악성코드 탐지 기술의 성능 향상을 유도할 수 있다.In addition, the present invention can induce performance improvement of a machine learning-based malicious code detection technology by using a machine learning model that has been effectively learned by the above method.

도 1은 기계학습을 활용한 기본적인 악성코드 탐지 시스템을 설명하기 위한 개념도이다.
도 2는 종래의 RNN 기계학습 모델을 이용한 악성코드 탐지 방법을 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에 따른 VAE-CNN 기계학습 모델을 이용한 악성코드 탐지 방법을 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따른 VAE-CNN 모델에 관한 도면이다.
도 5는 본 발명의 일 실시예에 따른 악성코드 탐지를 수행하는 컴퓨팅 장치의 블록도이다.
도 6은 본 발명의 일 실시예에 따른 VAE-CNN 모델의 블록도이다.
도 7은 RNN 기반의 악성코드 탐지 모델에 관한 도면이다.1 is a conceptual diagram illustrating a basic malicious code detection system using machine learning.
2 is a diagram illustrating a method of detecting a malicious code using a conventional RNN machine learning model.
3 is a diagram illustrating a method of detecting a malicious code using a VAE-CNN machine learning model according to an embodiment of the present invention.
4 is a diagram of a VAE-CNN model according to an embodiment of the present invention.
5 is a block diagram of a computing device that detects malicious code according to an embodiment of the present invention.
6 is a block diagram of a VAE-CNN model according to an embodiment of the present invention.
7 is a diagram of an RNN-based malicious code detection model.

이하, 첨부되는 도면을 참고하여 본 발명의 실시예들에 대해 상세히 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 2는 종래의 RNN 기계학습 모델을 이용한 악성코드 탐지 방법을 설명하기 위한 도면이다. 2 is a diagram illustrating a method of detecting a malicious code using a conventional RNN machine learning model.

먼저 악성코드 데이터(O)가 포함된 파일로 기계학습 모델(RNN: Recurrent Neural Network)을 학습시킨다(S100). 그리고 학습된 RNN 모델로 의심스러운 파일의 악성 여부를 탐지한다(S200). First, a machine learning model (RNN: Recurrent Neural Network) is trained with a file containing malicious code data (O) (S100). Then, whether the suspicious file is malicious is detected using the learned RNN model (S200).

기계학습 모델 학습 단계(S100)에서는 학습 모델을 파일들의 특징정보(문자열, 명령어, 바이트 정보, API 호출 기록 등)와 레이블(normal/abnormal)로 학습시킴으로써 학습모델이 악성코드 탐지에 최적화되도록 한다.In the machine learning model learning step (S100), the learning model is optimized for detection of malicious codes by learning the learning model with feature information (string, command, byte information, API call record, etc.) of files and labels (normal/abnormal).

그리고 탐지 단계(S200)에서는 학습이 완료된 모델을 이용하여 입력된 파일이 정상(normal) 파일인지 악성코드(abnormal)인지를 구별한다.In the detection step (S200), it is discriminated whether the input file is a normal file or a malicious code by using the model on which the training has been completed.

그러나 이러한 딥러닝을 이용하기 위해서는 기계학습 모델을 학습시키기 위한 많은 양의 악성코드 데이터가 필요하다. 하지만, 학습에 필요한 악성코드 데이터를 얻는 일은 쉽지 않다. 왜냐하면, 악성코드에 직접 감염이 되어야 악성코드인지 확인이 가능하며, 이러한 악성코드 데이터는 일종의 자산으로 취급되어 공개되지 않기 때문이다. However, in order to use such deep learning, a large amount of malicious code data is required to train a machine learning model. However, it is not easy to obtain malicious code data necessary for learning. This is because it is possible to check if it is a malicious code only when it is directly infected with the malicious code, and this malicious code data is treated as a kind of asset and is not disclosed.

본 발명은 전술한 문제를 해결하기 위한 방법으로 변분 자기부호기 합성곱 신경망(VAE-CNN: Variational Auto Encoder - Convolutional Neural Network)을 이용하여 기계학습에 사용되는 데이터를 증강(Augmentation)시키고 이 증강된 데이터를 활용하여 악성코드 탐지 기계학습 모델을 효과적으로 학습시키는 방법을 제시하고자 한다. The present invention is a method for solving the above-described problem, using a VAE-CNN: Variational Auto Encoder-Convolutional Neural Network, to augment data used for machine learning and this augmented data. We would like to present a method to effectively learn a malware detection machine learning model by using.

본 명세서에서는 이미 보유하고 있는 악성코드 데이터를 원본 악성코드 데이터(O, Original)라 칭하고, 데이터를 증강시키기 위하여 생성한 데이터를 유사 악성코드 데이터(A, Augmentation)라 호칭한다. 유사 악성코드 데이터(A)가 원본 악성코드 데이터의 레이블에 속한다는 의미에서 유사 악성코드 데이터로 호칭하였다. In this specification, the malicious code data already held is referred to as original malicious code data (O, Original), and the data generated to augment the data is referred to as similar malicious code data (A, Augmentation). Similar malicious code data (A) was called similar malicious code data in the sense that it belongs to the label of the original malicious code data.

도 3은 본 발명의 일 실시예에 따른 VAE-CNN 기계학습 모델을 이용한 악성코드 탐지 방법을 설명하기 위한 도면이다. 3 is a diagram illustrating a method of detecting a malicious code using a VAE-CNN machine learning model according to an embodiment of the present invention.

본 발명의 악성코드 탐지 성능 향상 방법은, 유사 악성코드 데이터를 생성하고 이를 이용하여 기계학습 모델을 학습시키는 단계(S100), 학습이 완료된 기계학습 모델을 이용하여 악성코드를 탐지하는 단계(S200)를 포함한다. The method for improving malicious code detection performance of the present invention includes the steps of generating similar malicious code data and using the same to train a machine learning model (S100), and detecting a malicious code using a machine learning model (S200). Includes.

S100 단계는 레이블 및 악성코드 명령어 시퀀스 정보를 추출하는 단계(S110), 기계학습 모델을 학습시키는 단계(S120), 유사 악성코드 데이터(A) 생성 단계(S130)를 포함한다. Step S100 includes extracting label and malicious code instruction sequence information (S110), training a machine learning model (S120), and generating similar malicious code data (A) (S130).

본 발명은 이미 보유하고 있는 악성코드 데이터를 이용하여 그와 유사한 악성코드 데이터를 생성하는 기술이다. The present invention is a technology for generating similar malicious code data by using malicious code data already possessed.

본 발명을 설명하기 전에 먼저 악성코드 데이터에 대해 정의 및 설명한다. Before describing the present invention, malicious code data is first defined and described.

악성코드는 하나의 프로그램으로, 프로그램은 다시 연속해 나타나는 바이트들로 표현될 수 있으며, 연속해 나타나는 바이트들은 다시 명령어(OPCODE)의 시퀀스로 나타낼 수 있다. 따라서 본 발명에서는 명령어 시퀀스를 악성코드 데이터로 정의한다. The malicious code is a program, and the program can be expressed as bytes that appear consecutively, and the bytes that appear consecutively can be expressed as a sequence of instructions (OPCODE). Therefore, in the present invention, the command sequence is defined as malicious code data.

[표 1]은 이미 보유한 악성코드 데이터의 구조를 나타낸다. [Table 1] shows the structure of malicious code data already possessed.

파일 이름File name 명령어 길이Instruction length 레이블Label 악성코드 명령어 시퀀스Malware instruction sequence 0000abecf335.vir.asm0000abecf335.vir.asm 200/400200/400 abnormalabnormal PUSH, PUSH, PUSH,
CALL, CALL, CALL, RET, …PUSH, PUSH, PUSH,
CALL, CALL, CALL, RET,…

악성코드 데이터 구조는 [표 1]과 같이 파일 이름, 명령어 길이, 레이블, 악성코드 명령어 시퀀스로 이루어져 있다. 그러나, 악성코드 데이터를 탐지하기 위한 기계학습 모델을 학습시키는 데에는 레이블과, 악성코드 명령어 시퀀스만으로 족하다. The malware data structure is composed of the file name, command length, label, and malicious code command sequence as shown in [Table 1]. However, only a label and a sequence of malicious code commands are sufficient to train a machine learning model for detecting malicious code data.

단계 S110에서는 이와 같이 기계학습 모델을 학습시키는데 필요한 정보인 레이블 및 악성코드 명령어 시퀀스 정보를, 악성코드 데이터(O)로부터 추출한다. In step S110, label and malicious code instruction sequence information, which are information necessary to train the machine learning model, are extracted from the malicious code data (O).

[표 2]는 악성코드 데이터(O)에서 불필요한 정보인 파일이름, 명령어 길이 정보를 제거한 후 레이블과, 악성코드 명령어 시퀀스 정보를 나타낸 예이다. [Table 2] is an example of label and malicious code command sequence information after removing unnecessary information such as file name and command length information from malicious code data (O).

0 PUSH MOV SUB PUSH PUSH LEA PUSH PUSH MOV CALL OR MOV LEA STOS STOS STOS STOS ADD PUSH STOS POP MOV CALL MOV TEST JE PUSH PUSH MOV CALL PUSH LEA PUSH PUSH CALL TEST JE LEA LEA MOV IN TEST JNE PUSH LEA SUB PUSH PUSH LEA PUSH CALL TEST JE OR LEA MOV REP LEA LEA CALL LEA PUSH CALL TEST JNE LEA PUSH CALL MOV TEST JE AND MOV LEA POP LEA RETN PUSH MOV SUB PUSH PUSH PUSH MOV OR PUSH LEA PUSH PUSH MOV CALL PUSH LEA PUSH PUSH MOV CALL ADD IN PUSH LEA PUSH PUSH CALL LEA PUSH CALL CMP JE MOV LEA PUSH PUSH PUSH CALL PUSH PUSH CALL TEST JS LEA PUSH LEA CALL TEST JE MOV CALL TEST JE MOV PUSH CALL ...0 PUSH MOV SUB PUSH PUSH LEA PUSH PUSH MOV CALL OR MOV LEA STOS STOS STOS STOS ADD PUSH STOS POP MOV CALL MOV TEST JE PUSH PUSH MOV CALL PUSH LEA PUSH PUSH CALL TEST JE LEA LEA MOV IN TEST JNE PUSH LEA SUB PUSH PUSH LEA PUSH CALL TEST JE OR LEA MOV REP LEA LEA CALL LEA PUSH CALL TEST JNE LEA PUSH CALL MOV TEST JE AND MOV LEA POP LEA RETN PUSH MOV SUB PUSH PUSH PUSH MOV OR PUSH LEA PUSH PUSH MOV CALL PUSH LEA PUSH PUSH MOV CALL ADD IN PUSH LEA PUSH PUSH CALL LEA PUSH CALL CMP JE MOV LEA PUSH PUSH PUSH CALL PUSH PUSH CALL TEST JS LEA PUSH LEA CALL TEST JE MOV CALL TEST JE MOV PUSH CALL ...

실제 악성코드 명령어 시퀀스는 매우 긴 길이를 가지기 때문에 기존 악성코드 탐지 연구를 참고하여 명령어 길이를 최대 400으로 제한하여 표현하였다.단계 S110에서 추출된 레이블과, 악성코드 명령어 시퀀스 정보는 기계학습 모델의 학습에 사용되고, 유사 악성코드 데이터(A) 생성에 활용된다. Since the actual malicious code command sequence has a very long length, the command length is limited to 400 by referring to the existing malicious code detection research. The label extracted in step S110 and the malicious code command sequence information are trained in a machine learning model. It is used for generating similar malicious code data (A).

단계 S120은 기계학습 모델을 학습시키는 단계이다. 본 발명은 기계학습 모델로 VAE-CNN(Variational Auto Encoder - Convolutional Neural Network)를 사용하였다. Step S120 is a step of training a machine learning model. The present invention uses VAE-CNN (Variational Auto Encoder-Convolutional Neural Network) as a machine learning model.

단계 S130은 유사 악성코드 데이터(A)를 생성하는 단계이다. 생성된 유사 악성 코드 데이터(A)는 기계학습 모델을 학습(S120)시키는데 사용된다. Step S130 is a step of generating similar malicious code data (A). The generated similar malicious code data (A) is used to train the machine learning model (S120).

S200 단계는 학습이 완료된 기계학습 모델을 이용하여 입력된 파일(Unknown File)이 정상 파일인지 악성코드인지를 탐지하는 단계이다. Step S200 is a step of detecting whether an input file (Unknown File) is a normal file or a malicious code by using the machine learning model on which the learning has been completed.

여기서 입력된 파일(Unknown File)은 학습이 완료된 기계학습 모델인 CNN 모델에 입력되기 전에, 명령어 시퀀스 정보 추출 단계를 거칠 수 있다. 이는 기계학습 모델에 이용되는 데이터의 구조를 통일시키기 위함이다. Here, the inputted file (Unknown File) may go through a step of extracting command sequence information before being input to the CNN model, which is a machine learning model that has been trained. This is to unify the structure of data used in machine learning models.

도 4는 본 발명의 일 실시예에 따른 VAE-CNN 모델에 관한 도면이다. 4 is a diagram of a VAE-CNN model according to an embodiment of the present invention.

VAE-CNN 모델은 유사 악성코드 데이터 생성을 위해 인코더와 디코더가 결합된 VAE 부분과, 이미 보유한 악성코드 데이터와 VAE가 생성한 악성코드 데이터를 이용해 실제 악성코드 탐지에 적용하여 VAE 부분이 제대로 된 유사 악성코드 데이터를 생성했는지 확인하기 위한 판별기인 CNN으로 이루어져 있다.The VAE-CNN model uses the VAE part in which the encoder and decoder are combined to generate similar malicious code data, and the malicious code data already possessed and the malicious code data generated by the VAE are applied to the actual malicious code detection, so that the VAE part is properly similar. It is composed of CNN, which is a discriminator to check whether malicious code data has been generated.

인코더(Encoder)에 [표 2]와 같이 이미 보유하고 있는 악성코드 데이터를 입력으로 넣어주게 된다. 입력된 악성코드 데이터는 정규 분포를 사전 확률로 갖는 은닉 표현 z가 된다. 은닉 표현 z는 디코더(Decoder)에 의해 악성코드 데이터의 시작을 의미하는 <s>토큰과, 원래 데이터의 레이블 정보(악성파일/정상파일)를 이용해 유사 악성코드 데이터를 생성하게 된다. Malicious code data already possessed as shown in [Table 2] is input to the encoder. The entered malicious code data becomes a hidden expression z having a normal distribution as a prior probability. The hidden expression z generates similar malicious code data by using the <s> token, which means the start of malicious code data by a decoder, and label information (malware/normal file) of the original data.

[표 3]은 상기 VAE를 통해 생성된 유사 악성코드 데이터 예제이다. [Table 3] is an example of similar malicious code data generated through the VAE.

0 SAHF LEA PUSH JNP MOV JNO OR JL SHL AAD CWD CALL OUT MUL DIV JB NOT AND ADC XCHG SAHF SUB WAIT LDS SUB JB AND OR JAE SUB NOT CMP CMP DEC SUB JNO XCHG OR LOCK SUB JE SUB ADC ADD SHL LEA POP LOCK LAHF RETN IN RCL AAD IN SCAS DEC ADD JA IN MOV SUB RCR NOP DAS MOV PUSH ADC DIV STC IN POP CLI CALL TEST IN IN JNE RCL PUSH CMP ADD LOCK JA DAS LES SHL DIV ADD PUSH ROR POP IN IN OUT POP TEST CALL DEC PUSH CMP LAHF PUSH ADD MOV POP LEA OUT POP ADD ADD JE CMP ADD PUSH SBB JP JO LEA SUB STOS LAHF DEC OR CMP JB OUT AND OUT MOV CMC PUSH AND POP MOV ADD IN XCHG JE POP SCAS SUB ADD JNE MOV TEST CMP OUT LOCK OR JNE POP ...0 SAHF LEA PUSH JNP MOV JNO OR JL SHL AAD CWD CALL OUT MUL DIV JB NOT AND ADC XCHG SAHF SUB WAIT LDS SUB JB AND OR JAE SUB NOT CMP CMP DEC SUB JNO XCHG OR LOCK SUB JE SUB ADC ADD SHL LEA POP LOCK LAHF RETN IN RCL AAD IN SCAS DEC ADD JA IN MOV SUB RCR NOP DAS MOV PUSH ADC DIV STC IN POP CLI CALL TEST IN IN JNE RCL PUSH CMP ADD LOCK JA DAS LES SHL DIV ADD PUSH ROR POP IN IN OUT POP TEST CALL DEC PUSH CMP LAHF PUSH ADD MOV POP LEA OUT POP ADD ADD JE CMP ADD PUSH SBB JP JO LEA SUB STOS LAHF DEC OR CMP JB OUT AND OUT MOV CMC PUSH AND POP MOV ADD IN XCHG JE POP SCAS SUB ADD JNE MOV TEST CMP OUT LOCK OR JNE POP...

즉, VAE 모델은 입력된 악성코드 데이터를 이용해 유사 악성코드 데이터를 생성하게 된다. In other words, the VAE model generates similar malicious code data using the input malicious code data.

판별기(Discriminator)인 CNN 부분은 이미 보유한 악성코드 데이터와 VAE를 통해 만들어진 유사 데이터를 이용하여 악성코드 여부를 판별하게 된다. The CNN part, which is a discriminator, uses the malicious code data already possessed and similar data created through the VAE to determine whether there is a malicious code.

즉, 상기 판별기는 입력 받은 데이터가 악성코드인지 여부를 판별할 수 있도록, 상기 원본 악성코드 데이터를 학습(

)하고, 상기 유사 악성코드 데이터를 학습(

)한다. That is, the discriminator learns the original malicious code data to determine whether the received data is a malicious code (

), and learning the similar malicious code data (

)do.

상기 판별기의 학습 모델을 수식으로 표현하면 아래 수학식 1 내지 3과 같다.When the learning model of the discriminator is expressed by an equation, it is as shown in Equations 1 to 3 below.

수학식 1은 원본 악성코드 데이터(O)를 학습, 수학식 2는 유사 악성코드 데이터(A)를 학습, 수학식 3은 원본 악성코드 데이터(O)와 유사 악성코드 데이터(A)를 학습하는 학습 모델을 수학식으로 표현한 것이다. Equation 1 is to learn original malicious code data (O), Equation 2 is to learn similar malicious code data (A), Equation 3 is to learn original malicious code data (O) and similar malicious code data (A) It is an expression of the learning model as an equation.

는 empirical entropy로 판별기 확률이 1 or 0에 가깝게 만들어 주며,

는 밸런싱 파라미터이다.

Is empirical entropy, making the discriminator probability close to 1 or 0,

Is the balancing parameter.

상기 디코더와 인코더는 VAE(Variational Auto Encoder)의 손실 함수를 학습한다. The decoder and encoder learn a loss function of VAE (Variational Auto Encoder).

또한, 상기 디코더는 레이블(정상프로그램/악성코드) 제어 생성을 학습하기 위해 생성된 데이터를 판별기로 넘기고 P_Discriminator(레이블|명령어 시퀀스)를 최대화한다. In addition, the decoder passes the generated data to the discriminator to learn to generate a label (normal program/malware) control and maximizes the P_Discriminator (label|command sequence).

또한, 상기 디코더는 인코더의 입력으로 들어온 데이터와 생성된 데이터의 잠재 변수가 비슷한 값을 가지게 하기 위해 생성된 데이터를 인코더로 다시 압축하여 기존 데이터의 잠재변수와 L2 norm 최소화를 거친다.In addition, the decoder compresses the generated data again by the encoder to minimize the L2 norm of the existing data by compressing the generated data again so that the data input through the encoder and the latent variable of the generated data have similar values.

상기 디코더의 학습 모델을 수식으로 표현하면 아래 수학식 4 내지 8과 같다When the learning model of the decoder is expressed by an equation, it is as shown in Equations 4 to 8

수학식 4는 VAE 손실을 의미한다.Equation 4 means VAE loss.

수학식 5를 참조하면, 상기 디코더는 [비]정상 생성 데이터가 판별기에 [비]정상으로 판단되도록 학습한다.Referring to Equation 5, the decoder learns to determine the [non] normal generated data to the discriminator as [non] normal.

수학식 6을 참조하면, 상기 디코더는 Gumbel softmax를 이용하여 출력층의 모든 단어를 학습한다. Referring to Equation 6, the decoder learns all words of the output layer using Gumbel softmax.

수학식 7을 참조하면, 상기 디코더는 생성데이터의 잠재변수가 입력데이터의 잠재변수가 되도록 학습한다.Referring to Equation 7, the decoder learns so that the latent variable of the generated data becomes the latent variable of the input data.

상기 인코더의 학습 모델을 수식으로 표현하면, 앞에 설명한 수학식 4와 같이 표현할 수 있다. When the learning model of the encoder is expressed as an equation, it can be expressed as in Equation 4 described above.

상기 인코더는 VAE 손실만 학습한다. The encoder learns only the VAE loss.

도 5는 본 발명의 일 실시예에 따른 악성코드 탐지를 수행하는 컴퓨팅 장치의 블록도이고, 도 6은 본 발명의 일 실시예에 따른 VAE-CNN 모델의 블록도이다. 5 is a block diagram of a computing device that detects malicious code according to an embodiment of the present invention, and FIG. 6 is a block diagram of a VAE-CNN model according to an embodiment of the present invention.

악성코드 탐지를 수행하는 컴퓨팅 장치는 악성코드 분석부(510), 유사 악성코드 생성부(520), 악성코드 탐지부(530)을 포함한다. A computing device that performs malicious code detection includes a malicious code analysis unit 510, a similar malicious code generation unit 520, and a malicious code detection unit 530.

상기 악성코드 분석부(510)는 이미 보유하고 있는 악성코드 데이터를 분석하여 원본 악성코드 데이터(O)로부터 레이블 및 악성코드 명령어 시퀀스 정보를 추출한다. The malicious code analysis unit 510 analyzes malicious code data already held and extracts label and malicious code command sequence information from the original malicious code data O.

상기 유사 악성코드 생성부(520)는 기계학습 모델을 이용하여 이미 보유하고 있는 원본 악성코드 데이터로부터 상기 원본 악성코드 데이터의 레이블에 속하는 유사한 악성코드 데이터를 생성한다. The similar malicious code generation unit 520 generates similar malicious code data belonging to a label of the original malicious code data from the original malicious code data already possessed by using a machine learning model.

상기 유사 악성코드 생성부(520)는 판별기(610), 디코더(620), 인코더(630)을 포함한다. The similar malicious code generation unit 520 includes a discriminator 610, a decoder 620, and an encoder 630.

상기 기계학습 모델은 VAE-CNN(Variational Auto Encoder - Convolutional Neural Network)가 될 수 있다. The machine learning model may be a VAE-CNN (Variational Auto Encoder-Convolutional Neural Network).

상기 판별기(610)는 이미 보유한 악성코드 데이터와 VAE(Variational Auto Encoder)를 통해 만들어진 유사 데이터를 이용하여 악성코드 여부를 판별하게 된다. The discriminator 610 determines whether there is a malicious code by using malicious code data already possessed and similar data created through a VAE (Variational Auto Encoder).

상기 디코더(620)와 인코더(630)는 VAE의 손실 함수를 학습한다. The decoder 620 and the encoder 630 learn the loss function of the VAE.

도 7은 RNN 기반의 악성코드 탐지 모델에 관한 도면이다. 7 is a diagram of an RNN-based malicious code detection model.

도 7의 악성코드 탐지 모델은, 본 발명에 해당하는 도 4의 VAE를 통하여 생성한 유사 악성코드 데이터가 실제 다른 악성코드 탐지 알고리즘에 잘 작동하는지 알아보기 위한 탐지 모델이다. The malicious code detection model of FIG. 7 is a detection model for determining whether similar malicious code data generated through the VAE of FIG. 4 corresponding to the present invention actually works well with another malicious code detection algorithm.

도 7의 RNN 기반의 악성코드 탐지 모델은 악성코드 데이터의 시퀀스가 입력으로 주어지고 이를 이용해 입력된 데이터가 악성코드인지 정상파일인지 판단하는 Recurrent Neural Network(RNN) 모델로, 기계학습 알고리즘에 해당된다.The RNN-based malicious code detection model of FIG. 7 is a Recurrent Neural Network (RNN) model that determines whether a sequence of malicious code data is given as an input and the input data is a malicious code or a normal file, and corresponds to a machine learning algorithm. .

도 7의 모델에 대한 실험에 사용한 데이터의 구조는 [표 1] 및 [표 2]의 데이터와 같은 형식을 가지고 있으며, 통계 정보는 [표 4]와 같다.The data structure used in the experiment for the model of FIG. 7 has the same format as the data in [Table 1] and [Table 2], and statistical information is shown in [Table 4].

학습 데이터Training data 평가 데이터Evaluation data 사전 크기Dictionary size 기존 데이터Existing data 23,38923,389 5,8475,847 7575 VAE-CNN 모델을 이용하여 생성한 데이터Data generated using VAE-CNN model 23,38923,389 --

도 7의 모델에 대한 실험 결과는 [표 5]와 같다. Experimental results for the model of Figure 7 are shown in [Table 5].

명령어 시퀀스 길이Instruction sequence length 탐지율Detection rate 오탐율False positives 정확도accuracy 전I'm 후after 전I'm 후after 전I'm 후after 200200 90.5390.53 91.4291.42 6.316.31 5.895.89 90.0590.05 92.8392.83 400400 90.1290.12 90.8190.81 8.538.53 6.866.86 90.8090.80 92.0092.00

표 5에서 실험결과의 “전”은 이미 보유한 악성코드 데이터(O)만을 이용하여 기계학습한 경우이고, “후”는 본 발명의 VAE-CNN 모델을 통하여 생성한 유사 악성코드 데이터(A)를 이용하여 기계학습에 사용되는 데이터를 증강시켜서 이 증강된 데이터를 활용하여 기계학습한 경우이다. In Table 5, "before" in the experiment result is a case of machine learning using only the malicious code data (O) already possessed, and "after" is the similar malicious code data (A) created through the VAE-CNN model of the present invention. This is the case of machine learning using the augmented data by augmenting the data used for machine learning.

실험 결과의 신뢰도를 높이기 위해 K-fold 교차 검증(K fold cross validation)을 이용하여 평가를 진행하였다. K-fold 교차 검증은 데이터 셋을 K 개의 크기로 나눈 다음 K 개를 각각 평가에 사용하여 (나머지 K-1개는 학습에 사용) 평균을 내는 것을 말하며, K 값을 5 로 하여 실험하였다.In order to increase the reliability of the experimental results, evaluation was performed using K-fold cross validation. The K-fold cross-validation refers to dividing the data set into K sizes and then using K for each evaluation (the remaining K-1 for learning) to average, and the K value was 5 for the experiment.

생성한 데이터를 추가한 실험에서는, 악성코드 탐지 모델 학습시에 기존 데이터에는 0.9의 가중치를 부여하였고, 새로 추가한 악성코드 유사 데이터에는 0.1의 가중치를 부여하여 학습하였다.In the experiment in which the generated data was added, a weight of 0.9 was assigned to the existing data when training a malicious code detection model, and a weight of 0.1 was assigned to the newly added malicious code-like data.

표 5에 나타난 바와 같이 탐지율, 오탐율, 및 정확도 모든 면에서 악성코드 탐지 성능이 향상된 것을 알 수 있다. 또한 명령어 시퀀스의 길이가 길어질수록 성능 하락이 나타남을 알 수 있다.As shown in Table 5, it can be seen that the malicious code detection performance is improved in all aspects of detection rate, false detection rate, and accuracy. In addition, it can be seen that as the length of the instruction sequence increases, the performance decreases.

이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다. 본 발명의 범위는 상기 상세한 설명보다는 후술하는 청구범위에 의하여 나타내어지며, 청구범위의 의미 및 범위 그리고 그 등가 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 청구범위에 포함되는 것으로 해석되어야 한다.It should be understood that the embodiments described above are illustrative in all respects and not limiting. The scope of the present invention is indicated by the claims to be described later rather than the detailed description, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the claims of the present invention. .

510: 악성코드 분석부
520: 유사 악성코드 생성부
530: 악성코드 탐지부
610: 판별기
620: 디코더
630: 인코더510: Malware analysis unit
520: similar malicious code generation unit
530: Malware detection unit
610: discriminator
620: decoder
630: encoder

Claims

A similar malicious code generation method for generating similar malicious code data belonging to a label of the original malicious code data from the original malicious code data already possessed by using a machine learning model.

The method according to claim 1,
The machine learning model is a similar malicious code generation method, characterized in that the VAE-CNN (Variational Auto Encoder-Convolutional Neural Network).

The method according to claim 1,
The similar malicious code generation method
Extracting label and malicious code instruction sequence information from malicious code data already possessed; A method of generating similar malicious code including a.

The method according to claim 1,
The method of generating the similar malicious code includes a machine learning step; Including,
The machine learning step
Learning the original malicious code data to determine whether there is a malicious code in the input data (

), and learning the similar malicious code data (

) The discriminator learning step; A method of generating similar malicious code including a.

The method of claim 4,
The machine learning step
A similar malicious code generation method, characterized in that it learns the loss function of VAE (Variational Auto Encoder).

6. The method of improving malicious code detection performance in which a malicious code detection machine learning model is trained using the similar malicious code generated by the method of generating a similar malicious code according to any one of claims 1 to 5.

A similar malicious code generator for generating similar malicious code data belonging to a label of the original malicious code data from the original malicious code data already possessed by using a machine learning model; Computing device for performing malicious code detection comprising a.

The method of claim 7
The machine learning model is a computing device that performs malicious coat detection, characterized in that the VAE-CNN (Variational Auto Encoder-Convolutional Neural Network).

The method of claim 8,
VAE-CNN includes a discriminator, a decoder, and an encoder,
The discriminator simultaneously learns the original malicious code data and the similar malicious code generated by the similar malicious code generator,
The decoder passes the generated data to the discriminator to learn label control generation.

The method of claim 9,
The computing device
Learning the discriminator, the decoder, and the encoder, or
A processor that assists other devices to learn the discriminator, the decoder, and the encoder; Computing device for performing malicious code detection comprising a.