KR20200067044A

KR20200067044A - Method and apparatus for detecting malicious file

Info

Publication number: KR20200067044A
Application number: KR1020180153916A
Authority: KR
Inventors: 최선오
Original assignee: 한국전자통신연구원
Priority date: 2018-12-03
Filing date: 2018-12-03
Publication date: 2020-06-11

Abstract

Provided are an apparatus and a method for detecting a malicious file. The method for detecting a malicious file comprises the steps of: extracting feature data from feature data sequences of files collected on a network; and determining whether the file is a normal file or a malicious file by inputting the feature data into a deep learning model based on machine learning performed by inputting the feature data of a previously known normal file and a malicious file.

Description

Method and device for detecting malicious files{METHOD AND APPARATUS FOR DETECTING MALICIOUS FILE}

본 기재는 파일의 특징 데이터를 이용하여 악성 파일을 탐지하는 방법 및 장치에 관한 것이다. The present description relates to a method and apparatus for detecting a malicious file using characteristic data of the file.

근래 머신 러닝이나 딥러닝 기법을 사용하여 컴퓨터에 악역향을 끼칠 수 있는 악성 파일(또는 악성 소프트웨어, 맬웨어 등)을 탐지하려는 연구가 많이 이루어지고 있다. 머신 러닝 또는 딥러닝 기법을 사용하여 악성 파일을 탐지하기 위해서, 악성 파일로부터 추출된 특징 데이터에 기반하여 인공지능 모델이 학습된다. 이때 특징 데이터의 종류가 많게 되면, 기계 학습의 효율을 위해서 분석자가 일부 종류의 특징 데이터를 수작업으로 선정해야 한다.Recently, many studies have been conducted to detect malicious files (or malicious software, malware, etc.) that may adversely affect a computer by using machine learning or deep learning techniques. To detect malicious files using machine learning or deep learning techniques, artificial intelligence models are trained based on feature data extracted from malicious files. At this time, if there are many types of feature data, for efficiency of machine learning, the analyst must manually select some kinds of feature data.

한 실시예는, 파일로부터 추출된 특징 데이터에 기반하여 악성 파일을 탐지하는 방법을 제공한다.One embodiment provides a method for detecting a malicious file based on feature data extracted from the file.

다른 실시예는, 파일로부터 추출된 특징 데이터에 기반하여 악성 파일을 탐지하는 장치를 제공한다.Another embodiment provides an apparatus for detecting a malicious file based on feature data extracted from the file.

한 실시예에 따르면, 네트워크에서 악성 파일을 탐지하는 방법이 제공된다. 상기 악성 파일 탐지 방법은, 네트워크 상에서 수집되는 파일의 특징 데이터 시퀀스로부터 특징 데이터를 추출하는 단계, 그리고 특징 데이터를, 미리 알려진 정상 파일 및 악성 파일의 특징 데이터를 입력으로 하여 수행되는 머신 러닝에 기반하는 딥러닝 모델에 입력하여, 파일이 정상 파일인지 또는 악성 파일인지 여부를 판단하는 단계를 포함한다.According to one embodiment, a method for detecting malicious files in a network is provided. The malicious file detection method is based on machine learning performed by extracting feature data from a feature data sequence of files collected on a network, and using feature data as input to previously known normal files and feature data of malicious files. And inputting the deep learning model to determine whether the file is a normal file or a malicious file.

상기 악성 파일 탐지 방법에서 특징 데이터 시퀀스는 파일의 시스템 콜 시퀀스이고, 특징 데이터는 시스템 콜 시퀀스 중 일부 시스템 콜일 수 있다.In the malicious file detection method, the feature data sequence is a system call sequence of the file, and the feature data may be a part of the system call sequence.

상기 악성 파일 탐지 방법에서 특징 데이터를 추출하는 단계는, 어텐션 메커니즘을 통해 시스템 콜 시퀀스에 포함된 각 시스템 콜에 가중치를 부여하는 단계, 그리고 가중치가 큰 순서대로 시스템 콜 시퀀스에서 중요 시스템 콜을 선택하는 단계를 포함할 수 있다.Extracting feature data from the malicious file detection method includes: assigning a weight to each system call included in the system call sequence through the attention mechanism, and selecting important system calls from the system call sequence in the order of weighting It may include steps.

상기 악성 파일 탐지 방법에서 딥러닝 모델은, 장단기간 메모리(Long Short Term Memory, LSTM) 네트워크일 수 있다.In the malicious file detection method, the deep learning model may be a long short term memory (LSTM) network.

상기 악성 파일 탐지 방법에서 딥러닝 모델은, 스킵-연결(Skip-Connected) 장단기간 메모리(Long Short Term Memory, LSTM) 네트워크일 수 있다.In the malicious file detection method, the deep learning model may be a Skip-Connected Long Short Term Memory (LSTM) network.

다른 실시예에 따르면, 악성 파일 탐지 장치가 제공된다. 상기 악성 파일 탐지 장치는 프로세서, 메모리, 및 네트워크 인터페이스를 포함하고, 프로세서는 메모리에 저장된 프로그램을 실행하여, 상기 네트워크 인터페이스를 통해 네트워크 상에서 수집되는 파일의 특징 데이터 시퀀스로부터 특징 데이터를 추출하는 단계, 그리고 특징 데이터를, 미리 알려진 정상 파일 및 악성 파일의 특징 데이터를 입력으로 하여 수행되는 머신 러닝에 기반하는 딥러닝 모델에 입력하여, 파일이 정상 파일인지 또는 악성 파일인지 여부를 판단하는 단계를 수행한다.According to another embodiment, a malicious file detection device is provided. The malicious file detection apparatus includes a processor, a memory, and a network interface, and the processor executes a program stored in the memory to extract feature data from a feature data sequence of files collected on the network through the network interface, and The feature data is input to a deep learning model based on machine learning, which is performed by inputting feature data of known normal files and malicious files as inputs to determine whether the file is a normal file or a malicious file.

상기 악성 파일 탐지 장치에서 특징 데이터 시퀀스는 파일의 시스템 콜 시퀀스이고, 특징 데이터는 시스템 콜 시퀀스 중 일부 시스템 콜일 수 있다.In the malicious file detection apparatus, the feature data sequence is a system call sequence of a file, and the feature data may be a part of the system call sequence.

상기 악성 파일 탐지 장치에서 프로세서는 특징 데이터를 추출하는 단계를 수행할 때, 어텐션 메커니즘을 통해 시스템 콜 시퀀스에 포함된 각 시스템 콜에 가중치를 부여하는 단계, 그리고 가중치가 큰 순서대로 시스템 콜 시퀀스에서 중요 시스템 콜을 선택하는 단계를 수행할 수 있다.When performing the step of extracting feature data from the malicious file detection device, the processor assigns weights to each system call included in the system call sequence through the attention mechanism, and is important in the system call sequence in the order of the highest weight. The step of selecting a system call may be performed.

상기 악성 파일 탐지 장치에서 딥러닝 모델은, 장단기간 메모리(Long Short Term Memory, LSTM) 네트워크일 수 있다.In the malicious file detection apparatus, the deep learning model may be a long short term memory (LSTM) network.

상기 악성 파일 탐지 장치에서 딥러닝 모델은, 스킵-연결(Skip-Connected) 장단기간 메모리(Long Short Term Memory, LSTM) 네트워크일 수 있다.In the malicious file detection apparatus, the deep learning model may be a skip-connected long short term memory (LSTM) network.

높은 가중치 순서대로 미리 결정된 개수의 특징 데이터를 사용하여 딥러닝을 수행함으로써, 악성 파일 탐지 장치의 컴퓨팅 자원이 절약되고, 악성 파일의 탐지 속도가 향상될 수 있다. 또한, 어텐션 메커니즘을 통해 결정된 특징 데이터만을 이용하여 악성 파일의 탐지율도 향상될 수 있다.By performing deep learning using a predetermined number of feature data in a high weight order, computing resources of the malicious file detection device are saved, and the detection speed of the malicious file can be improved. In addition, the detection rate of malicious files may be improved by using only feature data determined through the attention mechanism.

도 1은 한 실시예에 따른 악성 파일 탐지 방법을 나타낸 흐름도이다.
도 2는 한 실시예에 따른 악성 파일 탐지 방법의 딥러닝 모델을 나타낸 개념도이다.
도 3은 한 실시예에 따른 악성 파일 탐지 장치를 나타낸 블록도이다.1 is a flowchart illustrating a malicious file detection method according to an embodiment.
2 is a conceptual diagram showing a deep learning model of a malicious file detection method according to an embodiment.
3 is a block diagram showing a malicious file detection device according to an embodiment.

아래에서는 첨부한 도면을 참고로 하여 본 기재의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 기재는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 기재를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present invention pertains may easily practice. However, the present description can be implemented in many different forms and is not limited to the embodiments described herein. In addition, in order to clearly describe the present description in the drawings, parts irrelevant to the description are omitted, and like reference numerals are assigned to similar parts throughout the specification.

도 1은 한 실시예에 따른 악성 파일 탐지 방법을 나타낸 흐름도이고, 도 2는 한 실시예에 따른 악성 파일 탐지 방법의 딥러닝 모델을 나타낸 개념도이다.1 is a flowchart illustrating a malicious file detection method according to an embodiment, and FIG. 2 is a conceptual diagram showing a deep learning model of a malicious file detection method according to an embodiment.

한 실시예에 따른 악성 파일 탐지 장치는 먼저, 특징 데이터를 추출하기 위해 파일의 특징 데이터 시퀀스의 가중치를 계산한다(S110). 파일의 특징 데이터 시퀀스는 시스템 콜 시퀀스일 수 있다. 도 1을 참조하면, 악성 파일 탐지 장치는, 파일의 시스템 콜 시퀀스 {S ₁,S ₂,...,S _n}의 각 시스템 콜의 가중치를 계산하고, 가중치의 크기에 따라 중요 시스템 콜을 특징 데이터로서 추출할 수 있다. 예를 들어, 동적 분석 기반의 악성코드 분석 유틸리티(예를 들어, 쿠쿠 샌드박스)를 통해 파일의 시스템 콜 시퀀스가 다음과 같이 수집될 수 있다.The malicious file detection apparatus according to an embodiment first calculates the weight of the feature data sequence of the file in order to extract the feature data (S110). The feature data sequence of the file may be a system call sequence. Referring to FIG. 1, the malicious file detection apparatus calculates the weight of each system call in the file system call sequence { S ₁ , S ₂ ,..., S _n }, and makes important system calls according to the size of the weight. It can be extracted as feature data. For example, a system call sequence of a file may be collected as follows through a malicious code analysis utility based on dynamic analysis (eg, Kuku sandbox).

{LoadLibrary, LoadCursor, RegisterClass, GetThreadLocal, strcmp, GlobalAlloc, GlobalFree, FindResource, LoadResource, VirtualProtect}{LoadLibrary, LoadCursor, RegisterClass, GetThreadLocal, strcmp, GlobalAlloc, GlobalFree, FindResource, LoadResource, VirtualProtect}

위를 참조하면, 파일의 시스템 콜 시퀀스의 길이는 10이다. 그리고 악성 파일은 이보다 더 긴 시스템 콜 시퀀스를 가질 수 있다. 한 실시예에 따르면, 시스템 콜 시퀀스의 길이가 동일하더라도, 어텐션(attention) 메커니즘을 통해 가중치가 큰 것으로 판단된 시스템 콜 시퀀스가 악성 파일의 탐지에 사용되면 탐지율이 향상될 수 있다. 어텐션 메커니즘은 아래에서 상세히 설명한다.Referring to the above, the length of the file's system call sequence is 10. And malicious files can have a longer system call sequence. According to an embodiment, even if the length of the system call sequence is the same, the detection rate may be improved when the system call sequence determined to have a high weight through the attention mechanism is used for the detection of the malicious file. The attention mechanism is described in detail below.

한 실시예에 따른 악성 파일 탐지 장치는 미리 알려진 정상 파일 및 악성 파일의 시스템 콜 시퀀스로부터 중요 시스템 콜을 추출하고, 추출된 중요 시스템 콜을 입력으로 사용하여 기계 학습을 수행할 수 있다. 이후 악성 파일 탐지 장치는, 기계 학습의 결과를 바탕으로 생성된 딥러닝 모델에 네트워크 상에서 수집되는 파일의 특징 데이터를 입력하여 파일이 정상 파일인지 또는 악성 파일인지 여부를 결정할 수 있다. The apparatus for detecting malicious files according to an embodiment may extract a critical system call from a system call sequence of known normal files and malicious files, and perform machine learning using the extracted critical system call as an input. Thereafter, the apparatus for detecting malicious files may determine whether the file is a normal file or a malicious file by inputting feature data of a file collected on a network into a deep learning model generated based on a result of machine learning.

이때 한 실시예에 따른 악성 파일 탐지 장치는, 어텐션 메커니즘에 기반하여 파일의 특징 데이터 시퀀스에서 특징 데이터를 추출할 수 있다. 어텐션 메커니즘은 순환 인공 신경망(Recurrent Neural Network, RNN) 계열의 딥러닝 모델로서, 딥러닝 모델이 입력 소스(즉, 특징 데이터 시퀀스)의 중요한 부분(즉, 특징 데이터)에 집중할 수 있도록 하기 위한 방식이다. 한 실시예에 따른 악성 파일 탐지 장치는, 어텐션 메커니즘을 사용하여 머신 러닝의 입력으로서, 파일의 특징 데이터 시퀀스의 특징 데이터를 추출하고, 추출된 특징 데이터를 입력으로 하여 머신 러닝을 수행하고, 머신 러닝의 결과로 생성된 딥러닝 모델을 이용하여 파일의 악성 여부를 결정할 수 있다. 도 2를 참조하면, 한 실시예에 따른 악성 파일 탐지 장치는 입력 {X₁, X₂, X₃, ..., X_T}으로서 파일의 시스템 콜 시퀀스가 제공되면, 파일이 정상인지 또는 악성인지 여부에 관한 결과를 Y_t로서 출력한다. At this time, the malicious file detection apparatus according to an embodiment may extract feature data from the feature data sequence of the file based on the attention mechanism. Attention mechanism is a deep learning model of the Recurrent Neural Network (RNN) family, which is a method for deep learning model to focus on an important part of input source (i.e., feature data sequence) (i.e., feature data). . The malicious file detection apparatus according to an embodiment, as an input of machine learning using an attention mechanism, extracts feature data of a feature data sequence of a file, performs machine learning using the extracted feature data as an input, and machine learning By using the deep learning model generated as a result, it is possible to determine whether the file is malicious. Referring to FIG. 2, the apparatus for detecting a malicious file according to an embodiment, if a system call sequence of a file is provided as input {X ₁ , X ₂ , X ₃ , ..., X _T }, whether the file is normal or malicious The result of recognition is output as Y _t .

이후, 악성 파일 탐지 장치는 가중치가 큰 순서대로 시스템 콜 시퀀스에서 k개의 중요 시스템 콜을 선택하는 방법으로 특징 데이터를 결정할 수 있다(S120)(k<n). 도 1에서, 시스템 콜 시퀀스의 각 시스템 콜 X_i에 대한 가중치는 {a(t,1), a(t,2), a(t,3), ..., a(t,T)}이고, 따라서 시스템 콜 X_i에 대응하는 가중치 W_i는 a(t,i)이다. 이때 시스템 콜과 가중치로 구성된 열 벡터는 아래와 같다.Thereafter, the apparatus for detecting malicious files may determine feature data by selecting k critical system calls from the system call sequence in the order of the highest weight (S120) (k<n). In FIG. 1, the weight for each system call X _i in the system call sequence is {a(t,1), a(t,2), a(t,3), ..., a(t,T)} and therefore the weight W _i is a (t, i) corresponding to the system call X _i. At this time, the column vector composed of system call and weight is as follows.

{(S₁,W₁),(S₂,W₂),...,(S_n,W_n)}{(S ₁ ,W ₁ ),(S ₂ ,W ₂ ),...,(S _n ,W _n )}

그리고 중요 시스템 콜로 구성된 새로운 시스템 콜 시퀀스는 아래와 같다.And the new system call sequence composed of important system calls is as follows.

{(S₁',W₁'),(S₂',W₂'),...,(S_k',W_k')}{(S ₁ ',W ₁ '),(S ₂ ',W ₂ '),...,(S _k ',W _k ')}

이후 한 실시예에 따른 악성 파일 탐지 장치는, 중요 시스템 콜로 구성된 새로운 시스템 콜 시퀀스를 딥러닝 모델에 입력하여 파일의 악성 여부를 판단한다(S130). 이때 딥러닝 모델은 미리 알려진 정상 파일 및 악성 파일의 특징 데이터를 입력으로 하여 수행되는 머신 러닝에 기반하여 생성된 것이다. 한 실시예에 따른 딥러닝 모델은, 장기 의존성 문제(The Problem of Long-Term Dependencies)를 해결하기 위한 장단기간 메모리(Long Short Term Memory, LSTM) 네트워크일 수 있다. 또는, 한 실시예에 따른 딥러닝 모델은, RNN 셀 간 연결을 추가/변경/삭제하는 스킵-연결(Skip-Connected) LSTM 네트워크일 수 있다. Then, the malicious file detection apparatus according to an embodiment determines whether the file is malicious by inputting a new system call sequence composed of important system calls into a deep learning model (S130). At this time, the deep learning model is generated based on machine learning performed by inputting characteristic data of known and malicious files in advance. The deep learning model according to an embodiment may be a Long Short Term Memory (LSTM) network for solving the Problem of Long-Term Dependencies. Alternatively, the deep learning model according to an embodiment may be a skip-connected LSTM network that adds/modifies/deletes connections between RNN cells.

위와 같이 어텐션 메커니즘을 통해 결정된 가중치가 높은, 미리 결정된 개수의 특징 데이터를 사용하여 딥러닝이 수행되면, 특징 데이터 시퀀스 전체가 사용되는 경우에 비하여 악성 파일 탐지 장치의 컴퓨팅 자원이 절약되고, 악성 파일 탐지 장치의 악성 파일의 탐지 속도도 향상될 수 있다. 또한, 한 실시예에 따른 악성 파일 탐지 장치는 어텐션 메커니즘을 통해 결정된 특징 데이터만을 이용하여 악성 파일의 탐지율도 향상시킬 수 있다.When deep learning is performed using a predetermined number of feature data having a high weight determined through the attention mechanism as described above, computing resource of the malicious file detection device is saved and malicious file detection compared to when the entire feature data sequence is used The speed of detection of malicious files on the device can also be improved. In addition, the malicious file detection apparatus according to an embodiment may improve the detection rate of the malicious file by using only feature data determined through the attention mechanism.

도 3은 한 실시예에 따른 악성 파일 탐지 장치를 나타낸 블록도이다.3 is a block diagram showing a malicious file detection device according to an embodiment.

한 실시예에 따른 악성 파일 탐지 장치는, 컴퓨터 시스템, 예를 들어 컴퓨터 판독 가능 매체로 구현될 수 있다. 도 6을 참조하면, 컴퓨터 시스템(300)은, 버스(320)를 통해 통신하는 프로세서(310), 메모리(330), 저장 장치(340), 입력 인터페이스 장치(350), 및 출력 인터페이스 장치(360) 중 적어도 하나를 포함할 수 있다. 컴퓨터 시스템(300)은 또한 네트워크에 결합된 통시 장치(370)를 포함할 수 있다. 프로세서(310)는 중앙 처리 장치(central processing unit, CPU)이거나, 또는 메모리(330) 또는 저장 장치(340)에 저장된 명령을 실행하는 반도체 장치일 수 있다. 메모리(330) 및 저장 장치(340)는 다양한 형태의 휘발성 또는 비휘발성 저장 매체를 포함할 수 있다. 예를 들어, 메모리는 ROM(read only memory) 및 RAM(random access memory)를 포함할 수 있다. The apparatus for detecting malicious files according to an embodiment may be implemented as a computer system, for example, a computer-readable medium. Referring to FIG. 6, the computer system 300 includes a processor 310, a memory 330, a storage device 340, an input interface device 350, and an output interface device 360 that communicates through the bus 320. ). Computer system 300 may also include a communication device 370 coupled to the network. The processor 310 may be a central processing unit (CPU) or a semiconductor device that executes instructions stored in the memory 330 or the storage device 340. The memory 330 and the storage device 340 may include various types of volatile or nonvolatile storage media. For example, the memory may include read only memory (ROM) and random access memory (RAM).

본 기재의 실시예에서 메모리는 프로세서의 내부 또는 외부에 위치할 수 있고, 메모리는 이미 알려진 다양한 수단을 통해 프로세서와 연결될 수 있다. 메모리는 다양한 형태의 휘발성 또는 비휘발성 저장 매체이며, 예를 들어, 메모리는 읽기 전용 메모리(read-only memory, ROM) 또는 랜덤 액세스 메모리(random access memory, RAM)를 포함할 수 있다.In the embodiments of the present disclosure, the memory may be located inside or outside the processor, and the memory may be connected to the processor through various known means. Memory is a volatile or non-volatile storage medium of various types, and for example, the memory may include read-only memory (ROM) or random access memory (RAM).

이상에서 실시예에 대하여 상세하게 설명하였지만 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 권리범위에 속하는 것이다.Although the embodiments have been described in detail above, the scope of rights is not limited to this, and various modifications and improvements of those skilled in the art using the basic concepts defined in the following claims also belong to the scope of rights.

Claims

As a method for detecting malicious files on the network,
Extracting feature data from a feature data sequence of files collected on the network, and
Step of determining whether the file is a normal file or a malicious file by inputting the feature data into a deep learning model based on machine learning performed by inputting the feature data of known normal files and malicious files as input.
Malicious file detection method comprising a.

In claim 1,
The feature data sequence is a system call sequence of the file, and the feature data is a partial system call of the system call sequence, a malicious file detection method.

In claim 2,
The step of extracting the feature data,
Weighting each system call included in the system call sequence through the attention mechanism; and
Selecting an important system call from the system call sequence in order of the weighting order
Including, malicious file detection method.

In claim 1,
The deep learning model is a Long Short Term Memory (LSTM) network, a method for detecting malicious files.

In claim 1,
The deep learning model is a Skip-Connected Long Short Term Memory (LSTM) network, a method for detecting malicious files.

As a malicious file detection device,
Includes a processor, memory, and network interface,
The processor executes a program stored in the memory,
Extracting feature data from a feature data sequence of files collected on the network through the network interface, and
Step of determining whether the file is a normal file or a malicious file by inputting the feature data into a deep learning model based on machine learning performed by inputting the feature data of known normal files and malicious files as input.
Malicious file detection device that performs.

In claim 6,
The feature data sequence is a system call sequence of the file, and the feature data is a partial system call in the system call sequence, a malicious file detection device.

In claim 7,
When the processor performs the step of extracting the feature data,
Weighting each system call included in the system call sequence through the attention mechanism; and
Selecting an important system call from the system call sequence in order of the weighting order
A device for detecting malicious files.

In claim 6,
The deep learning model is a Long Short Term Memory (LSTM) network, a malicious file detection device.

In claim 6,
The deep learning model is a Skip-Connected Long Short Term Memory (LSTM) network, a malicious file detection device.