KR102031592B1

KR102031592B1 - Method and apparatus for detecting the malware

Info

Publication number: KR102031592B1
Application number: KR1020180023603A
Authority: KR
Inventors: 곽진; 김득훈; 이태진; 이은지; 배원일; 우시재
Original assignee: 아주대학교산학협력단
Priority date: 2018-02-27
Filing date: 2018-02-27
Publication date: 2019-10-14
Also published as: KR20190102744A

Abstract

본 발명은 악성 코드를 분석하기 위한 방법 및 그 장치에 관한 것이다. 보다 구체적으로, 본 발명에 의한 악성 코드를 분석하기 위한 방법은 사용자로부터 특정 개수의 제 1 데이터 및 제 2 데이터를 획득하고, 상기 제 1 데이터 및 상기 제 2 데이터로부터 상기 악성 코드의 분석을 위한 데이터 셋을 각각 생성하되, 상기 데이터 셋 각각은 상기 제 1 데이터 또는 상기 제 2 데이터로부터 임의로 목록화된 복수의 심볼을 포함하고, 상기 복수의 심볼 중 상기 악성 코드를 판별하기 위한 특정 특성을 분석하며, 상기 특정 특성에 기초하여 악성 코드 여부를 결정하기 위한 통계적 분석을 수행하되, 상기 제 1 데이터는 악성 코드를 포함하지 않는 데이터이고, 상기 제 2 데이터는 악성 코드를 포함하는 데이터일 수 있다.The present invention relates to a method and apparatus for analyzing malicious code. More specifically, the method for analyzing malicious code according to the present invention obtains a certain number of first data and second data from a user, and the data for analyzing the malicious code from the first data and the second data. Generating each set, each of the data sets including a plurality of symbols arbitrarily listed from the first data or the second data, analyzing a specific characteristic for determining the malicious code among the plurality of symbols, Statistical analysis may be performed to determine whether malicious code is based on the specific characteristic, wherein the first data is data that does not include malicious code, and the second data may be data that includes malicious code.

Description

METHOD AND APPARATUS FOR DETECTING THE MALWARE}

본 명세서는 악성코드를 탐지하기 위한 방법 및 장치에 관한 것이다. 보다 구체적으로, 통계 분석 방법을 이용하여 악성코드를 탐지하기 위한 방법 및 장치에 관한 발명이다.The present disclosure relates to a method and apparatus for detecting malware. More specifically, the present invention relates to a method and apparatus for detecting malicious code using a statistical analysis method.

종래의 악성 코드 또는 바이러스 탐지는 주로 파일의 기본적인 정보 또는 패턴 기반으로 수행되었다. Conventional malicious code or virus detection is mainly performed based on basic information or patterns of files.

즉, 악성 코드를 탐지하고자 하는 각종 기본 정보를 데이터베이스화 시키고 이를 기반으로 모든 파일에 대한 정보와 데이터베이스에 저장된 정보를 교차 검색하여 악성 파일 여부를 파악할 수 있도록 하였다. In other words, various basic information to detect malicious code is made into database and based on this, cross-searching information about all files and information stored in database is made to identify whether malicious file is present.

이와 같은 종래 기술에 의하면 악성 코드 파일의 특성을 보유하고 있는 경우 해당 악성 코드를 빠르고 정확하게 탐지할 수 있다는 장점이 있다.According to the prior art as described above, if the malicious code file has the characteristics, the malicious code can be detected quickly and accurately.

그러나, 악성 코드 파일의 특성을 보유하고 있지 않은 경우, 즉, 알려지지 않은 악성 코드의 경우에는 탐지 자체가 불가능하며, 기 알려진 악성 코드라도 그 변종이 발생되면 동일한 유해행위를 일으키는 악성 코드임에도 불구하고 탐지가 어렵다는 단점이 있다.However, if the malicious code file does not possess the characteristics of the malicious file, that is, the unknown malicious code is impossible to detect itself, and even the known malicious code is detected even though the malicious code causes the same harmful behavior even if the variant occurs. Has the disadvantage of being difficult.

악성코드 정적 분석 연구는 PE 기반, N-gram 기반 등의 접근법을 통해 활발히 진행되고 있다. 하지만, 악성코드는 난독화와 같은 기술을 통해 변형된 형태로 발전되고 있기 때문에 기존의 악성코드 정적 분석 방식은 한계가 있으며, 동적 분석의 경우 상대적으로 오버헤드가 크다. Research on static analysis of malware is actively conducted through approaches based on PE and N-gram. However, since malicious code is being developed in a modified form through obfuscation techniques, the existing static code static analysis method is limited, and dynamic analysis has a relatively large overhead.

따라서, 난독화에 영향을 받지 않는 악성코드 정적 분석 방안을 연구할 필요가 있다. 이에 따라, 본 발명은 난독화된 악성코드에서 특정 심볼 문자가 상대적으로 많이 나타난다는 사실을 통해 특정 심볼을 기반으로 악성 코드를 탐지하기 위한 분석 모델을 제안한다.Therefore, it is necessary to study the static analysis method that is not affected by obfuscation. Accordingly, the present invention proposes an analysis model for detecting malicious code based on a specific symbol through the fact that a certain symbol character appears relatively in obfuscated malware.

본 명세서에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved in the present specification are not limited to the technical problems mentioned above, and other technical problems not mentioned above will be clearly understood by those skilled in the art from the following description. Could be.

본 발명에 의한 악성 코드를 분석하기 위한 방법은, 사용자로부터 특정 개수의 제 1 데이터 및 제 2 데이터를 획득하는 단계; 상기 제 1 데이터 및 상기 제 2 데이터로부터 상기 악성 코드의 분석을 위한 데이터 셋을 각각 생성하는 단계, 상기 데이터 셋 각각은 상기 제 1 데이터 또는 상기 제 2 데이터로부터 임의로 목록화된 복수의 심볼을 포함하고; 상기 복수의 심볼 중 상기 악성 코드를 판별하기 위한 특정 특성을 분석하는 단계; 및 상기 특정 특성에 기초하여 악성 코드 여부를 결정하기 위한 통계적 분석을 수행하는 단계를 포함하되, 상기 제 1 데이터는 악성 코드를 포함하지 않는 데이터이고, 상기 제 2 데이터는 악성 코드를 포함하는 데이터이다.The method for analyzing malicious code according to the present invention includes obtaining a specific number of first data and second data from a user; Generating a data set for analyzing the malicious code from the first data and the second data, respectively, each of the data sets including a plurality of symbols arbitrarily listed from the first data or the second data; ; Analyzing a specific characteristic for determining the malicious code among the plurality of symbols; And performing statistical analysis to determine whether malicious code is based on the specific characteristic, wherein the first data is data that does not include malicious code, and the second data is data that includes malicious code. .

또한, 본 발명은, 사용자로부터 특정 개수의 제 1 데이터 및 제 2 데이터을 획득하는 입력부; 상기 제 1 데이터 및 상기 제 2 데이터로부터 상기 악성 코드의 분석을 위한 데이터 셋을 각각 생성하는 데이터 처리부, 상기 데이터 셋 각각은 상기 제 1 데이터 또는 상기 제 2 데이터로부터 임의로 목록화된 복수의 심볼을 포함하고; 상기 복수의 심볼 중 상기 악성 코드를 판별하기 위한 특정 특성을 분석하는 분석부; 및 상기 특정 특성에 기초하여 악성 코드 여부를 결정하기 위한 통계적 분석을 수행하는 결정부를 포함하되, 상기 제 1 데이터는 악성 코드를 포함하지 않는 데이터이고, 상기 제 2 데이터는 악성 코드를 포함하는 데이터인 장치를 제공한다.In addition, the present invention, the input unit for obtaining a specific number of first data and the second data from the user; A data processor for generating a data set for analyzing the malicious code from the first data and the second data, each of the data sets including a plurality of symbols arbitrarily listed from the first data or the second data and; An analysis unit for analyzing a specific characteristic for determining the malicious code among the plurality of symbols; And a determining unit configured to perform statistical analysis for determining whether the malicious code is based on the specific characteristic, wherein the first data is data that does not include malicious code, and the second data is data including malicious code. Provide the device.

본 발명의 실시 예에 따르면 특정 심볼을 이용하여 통계적 분석을 통해 난독화된 악성코드를 탐지할 수 있다.According to an embodiment of the present invention, obfuscated malware can be detected through statistical analysis using a specific symbol.

또한, 본 발명의 실시 예에 따르면, 악성코드에 감염되지 않는 데이터와 감염된 데이터의 특정 심볼 문자열의 빈도를 통계적으로 분석함으로써, 악의적으로 숨겨진 코드 난독화, 안티 디버깅 등의 기술이 도입된 악성 코드를 탐지할 수 있다.In addition, according to an embodiment of the present invention, by analyzing the frequency of the data not infected with the malicious code and the specific symbol string of the infected data, the malicious code introduced techniques such as malicious code obfuscation, anti-debugging, etc. Can be detected.

본 명세서에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.Effects obtained in the present specification are not limited to the above-mentioned effects, and other effects not mentioned above may be clearly understood by those skilled in the art from the following description. will be.

본 발명에 관한 이해를 돕기 위해 상세한 설명의 일부로 포함되는, 첨부 도면은 본 발명에 대한 실시예를 제공하고, 상세한 설명과 함께 본 발명의 기술적 특징을 설명한다.
도 1은 본 명세서에서 제안하는 악성 코드를 탐지하기 위한 장치의 전체적인 구성을 예시하는 도면이다.
도 2는 본 명세서에서 제안하는 악성 코드를 탐지하기 위한 장치의 일 부분을 예시하는 도면이다.
도 3은 본 명세서에서 제안하는 심볼 기반 악성 코드 탐지 모델을 예시하는 도면이다.
도 4는 본 명세서에서 제안하는 심볼에 기반하여 악성 코드를 탐지하기 위한 방법을 예시하는 도면이다.
도 5는 본 명세서에서 제안하는 악성 코드의 문자열의 개수를 비교하기 위한 그래프의 일 예이다.BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are included as part of the detailed description in order to provide a thorough understanding of the present invention, provide examples of the present invention and together with the description, describe the technical features of the present invention.
1 is a diagram illustrating the overall configuration of an apparatus for detecting malicious code proposed herein.
2 is a diagram illustrating a part of an apparatus for detecting malicious code proposed herein.
3 is a diagram illustrating a symbol-based malicious code detection model proposed in the present specification.
4 is a diagram illustrating a method for detecting malicious code based on a symbol proposed in the present specification.
5 is an example of a graph for comparing the number of strings of malicious code proposed in the present specification.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.As the invention allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the written description. However, this is not intended to be limited to the particular embodiment of the present invention, it should be understood to include all transformations, equivalents, and substitutes included in the spirit and scope of the present invention. In the following description of the present invention, if it is determined that the detailed description of the related known technology may obscure the gist of the present invention, the detailed description thereof will be omitted.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

악성코드 대부분이 바이너리 파일로 구성되어있는 특성을 통해 바이너리 파일을 기반으로 악성코드를 정적분석하기 위한 다양한 접근법이 연구되고 있다. Due to the fact that most malicious codes are composed of binary files, various approaches for static analysis of malicious codes based on binary files have been studied.

그 예로, PE 포맷과 DLL 간 계층적 종속관계를 분석하여 정적 의존성 트리를 추출하는 악성코드 정적 분석 시스템 등과 같은 악성 코드를 분석하기 위한 시스템이 존재한다. For example, there is a system for analyzing malicious code such as a static analysis system that extracts a static dependency tree by analyzing hierarchical dependencies between the PE format and the DLL.

하지만, 이러한 악성코드 정적 분석은 패킹 또는 난독화를 해결해야 하는 문제점을 갖는다. 이에 따라, 본 발명은 패킹 또는 난독화의 문제점을 해결하기 위한 심볼 기반 악성코드 분석 모델을 제안한다.However, this malicious code static analysis has a problem that must solve the packing or obfuscation. Accordingly, the present invention proposes a symbol-based malware analysis model for solving the problem of packing or obfuscation.

도 1은 본 명세서에서 제안하는 악성 코드를 탐지하기 위한 장치의 전체적인 구성을 예시하는 도면이다.1 is a diagram illustrating the overall configuration of an apparatus for detecting malicious code proposed herein.

심볼에 기반하여 악성 코드를 측정하기 위한 장치(100)는 입력부(110), 저장부(120), 제어부(130) 및/또는 출력부(140) 등을 포함할 수 있다.The apparatus 100 for measuring malicious code based on a symbol may include an input unit 110, a storage unit 120, a control unit 130, and / or an output unit 140.

도 1에 도시된 구성요소들이 필수적인 것은 아니어서, 그보다 많은 구성요소들을 갖거나 그보다 적은 구성요소들을 갖는 전자기기가 구현될 수도 있다.Since the components shown in FIG. 1 are not essential, an electronic device having more or fewer components may be implemented.

이하, 상기 구성요소들에 대해 차례로 살펴본다.Hereinafter, the components will be described in order.

입력부(110)는 사용자가 장치의 동작을 위한 입력 데이터를 발생시킬 수 있으며, 사용자로부터 심볼 기반 악성 코드 탐지 모델을 위한 샘플 파일을 입력 받을 수 있다. 여기서 입력 데이터는 바이너리 파일일 수 있다.The input unit 110 may generate input data for the operation of the device by the user, and may receive a sample file for the symbol-based malicious code detection model from the user. The input data may be a binary file.

예를 들면, 입력부(110)는 후술할 심볼 기반 악성 코드 탐지 모델을 위한 다양한 샘플 파일을 입력받을 수 있다.For example, the input unit 110 may receive various sample files for a symbol-based malicious code detection model, which will be described later.

저장부(120)는 제어부(130)의 동작을 위한 프로그램을 저장할 수 있고, 입/출력되는 데이터들을 임시 저장할 수도 있다. 사용자로부터 심볼 기반 악성 코드 탐지 모델을 위한 샘플 파일을 저장할 수 있으며, 악성코드의 분석 결과를 저장할 수 있다.The storage unit 120 may store a program for the operation of the controller 130 and may temporarily store input / output data. Sample files for symbol-based malware detection models can be saved from users, and the analysis results of malware can be saved.

상기 저장부(120)는 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 램(Random Access Memory, RAM), SRAM(Static Random Access Memory), 롬(ReadOnly Memory, ROM), EEPROM(Electrically Erasable Programmable ReadOnly Memory), PROM(Programmable ReadOnly Memory) 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있다. The storage unit 120 may include a flash memory type, a hard disk type, a multimedia card micro type, and a card type memory (for example, SD or XD memory). RAM, Random Access Memory (RAM), Static Random Access Memory (SRAM), ReadOnly Memory (ROM), Electrically Erasable Programmable ReadOnly Memory (EEPROM), Programmable ReadOnly Memory (PROM) magnetic memory, magnetic disk, optical disk It may include one type of storage medium.

악성 코드 탐지를 위한 장치는 인터넷(internet)상에서 저장부(120)의 저장 기능을 수행하는 웹 스토리지(web storage)와 관련되어 동작할 수도 있다.The apparatus for detecting malicious codes may operate in association with a web storage that performs a storage function of the storage 120 on the Internet.

제어부(130)는 통상적으로 장치의 전반적인 동작을 제어한다. The controller 130 typically controls the overall operation of the device.

예를 들어 제어부(130)는 사용자로부터 입력된 샘플 파일들을 분석하여 데이터 셋을 생성하고, 생성된 데이터 셋에 기초하여 악성 코드를 탐지하기 위한 심볼 특성을 분석할 수 있다.For example, the controller 130 may generate a data set by analyzing sample files input from a user, and analyze a symbol characteristic for detecting malicious code based on the generated data set.

또한, 제어부(130)는 분석된 심볼 특성을 이용하여 통계 분석을 통해 입력되는 데이터가 악성코드인지 여부를 판단할 수 있다.In addition, the controller 130 may determine whether the data input through the statistical analysis is malicious code using the analyzed symbol characteristics.

출력부(140)는 시각, 청각 등과 관련된 출력을 발생시키기 위한 것으로, 장치(100)에 의해 처리되는 정보를 출력한다.The output unit 140 is for generating output related to time, hearing, and the like, and outputs information processed by the apparatus 100.

예를 들어, 제어부(130)에서 처리된 특정한 심볼에 대한 문자열의 개수를 그래프 형식으로 출력할 수 있으며, 입력된 데이터가 악성 코드인지 여부를 출력할 수 있다.For example, the controller 130 may output the number of character strings for a specific symbol processed in a graph format, and may output whether the input data is malicious code.

도 2는 본 명세서에서 제안하는 악성 코드를 탐지하기 위한 장치의 일 부분을 예시하는 도면이다.2 is a diagram illustrating a part of an apparatus for detecting malicious code proposed herein.

도 2를 참조하면, 제어부(130)는 특정 심볼에 기반하여 악성 코드를 탐지하기 위해서 데이터 처리부(132), 분석부(134) 및/또는 결정부(136) 등을 포함할 수 있다.Referring to FIG. 2, the controller 130 may include a data processor 132, an analyzer 134, and / or a determiner 136 to detect malicious code based on a specific symbol.

도 2에 도시된 구성요소들이 필수적인 것은 아니어서, 그보다 많은 구성요소들을 갖거나 그보다 적은 구성요소들을 갖는 전자기기가 구현될 수도 있다.The components shown in FIG. 2 are not essential, so that an electronic device having more or fewer components may be implemented.

데이터 처리부(132)는 입력부(110)를 통해 입력된 데이터로부터 악성 코드의 분석을 위한 복수의 심볼 문자를 선택하고, 선택된 복수의 심볼 문자들을 목록화할 수 있다.The data processor 132 may select a plurality of symbol characters for analyzing a malicious code from data input through the input unit 110, and list the selected plurality of symbol characters.

이때 악성 코드의 분석을 위한 복수의 심볼 문자들은 악성코드 정적 분석 작업에 의해 발생되는 @, +, * 등과 같은 특수한 문자 심볼을 의미할 수 있다. 따라서, 심볼 문자는 특수 문자를 의미할 수 있다.In this case, the plurality of symbol characters for analyzing the malicious code may mean special character symbols such as @, +, and * generated by the static analysis of the malicious code. Thus, the symbol character may mean a special character.

또한, 데이터 처리부(132)는 목록화된 심볼 문자들 각각을 포함하는 심볼 문자열의 개수를 추출하여 입력된 데이터에 대한 데이터 셋을 추출 또는 생성할 수 있다.In addition, the data processor 132 may extract or generate a data set for the input data by extracting the number of symbol strings including each of the listed symbol characters.

분석부(134)는 데이터 처리부(132)에 의해서 추출 또는 생성된 데이터 셋에 포함된 문자열의 개수를 특정한 범위로 분류하고, 각 범위에 해당하는 파일의 개수를 심볼에 따라 분석할 수 있다.The analyzer 134 may classify the number of character strings included in the data set extracted or generated by the data processor 132 into a specific range, and analyze the number of files corresponding to each range according to a symbol.

또한, 분석부(134)는 분석 결과에 기초하여 심볼에 따른 일반 데이터와 악성코드 데이터의 개수 분포를 비교하고, 개수 분포에 따라 일반 데이터와 악성코드 데이터의 개수가 일정 개수 이상 차이나는 심볼들의 집합을 악성 코드를 판별하기 위한 심볼 특성으로 결정할 수 있다.In addition, the analysis unit 134 compares the number distribution of the normal data and the malicious code data according to the symbol based on the analysis result, and the set of symbols in which the number of the general data and the malicious code data differs by a predetermined number or more according to the number distribution May be determined as a symbol characteristic for discriminating malicious code.

결정부(136)는 분석부(134)에 의해서 결정된 심볼 특성을 이용하여 통계적 분석을 통해 악성 코드를 판단할 수 있다.The determination unit 136 may determine the malicious code through statistical analysis using the symbol characteristics determined by the analysis unit 134.

예를 들면, 결정부(136)는 악성 코드를 판단하기 위해서 독립 변수 및 종속 변수를 선택하고, 선택된 독립 변수 및 종속 변수를 로지스틱 회귀(Logistic Regression) 분석에 적용하여 악성 코드를 판단할 수 있다.For example, the determination unit 136 may select an independent variable and a dependent variable to determine malicious code, and determine the malicious code by applying the selected independent variable and the dependent variable to a logistic regression analysis.

이때, 독립 변수는 분석부(134)에 의해서 결정된 심볼 특성에 포함된 특정 문자 심볼들 중 하나를 의미하며, 종속 변수는 입력된 데이터가 악성코드인지 여부를 판단하기 위한 변수를 의미한다.In this case, the independent variable refers to one of the specific character symbols included in the symbol characteristic determined by the analyzer 134, and the dependent variable refers to a variable for determining whether the input data is malicious code.

예를 들면, 로지스틱 회귀 분석을 통해서 입력된 데이터의 종속 변수의 값이 '0'이면 정상적인 데이터로 판단하고, '1'이면 악성코드로 판단할 수 있다.For example, through the logistic regression analysis, if the value of the dependent variable of the input data is '0', it may be determined as normal data, and if it is '1', it may be determined as malicious code.

이와 같은 방법을 이용하여 제어부(130)는 심볼 기반 악성코드 정적 분석 모델을 생성할 수 있으며, 생성된 악성코드 정적 분석 모델을 통해서 입력되는 데이터가 악성코드인지 여부를 판단하여 출력부(140)를 통해서 출력할 수 있다.By using the above method, the controller 130 may generate a symbol-based malicious code static analysis model, and determine whether the input data is malicious code through the generated malicious code static analysis model to determine the output unit 140. You can output it through

도 3은 본 명세서에서 제안하는 심볼 기반 악성 코드 탐지 모델을 예시하는 도면이다.3 is a diagram illustrating a symbol-based malicious code detection model proposed in the present specification.

도 3을 참조하면, 특정 개수의 악성코드와 정상적인 데이터를 각각 샘플로 입력하여 심볼 기반 악성 코드 탐지 모델을 생성할 수 있다. 이하 구체적인 방법에 대해 살펴보도록 한다.Referring to FIG. 3, a symbol-based malicious code detection model may be generated by inputting a specific number of malicious codes and normal data as samples. Hereinafter, the specific method will be described.

1. 샘플 파일 입력(Input sample file)1. Input sample file

심볼 기반 악성 코드 탐지 모델을 생성하기 위한 정상적인 바이너리 데이터(제 1 데이터)와 악성코드 바이너리 데이터(제 2 데이터)를 각각 n개의 샘플 데이터를 사용자용부터 획득한다. 여기서 바이너리 데이터는 바이너리 파일을 의미할 수 있다.N samples of normal binary data (first data) and malware binary data (second data) for generating a symbol-based malicious code detection model are obtained from a user. In this case, the binary data may mean a binary file.

2. 데이터 셋 추출(Extract Dataset)2. Extract Dataset

이후, 입력된 n개씩의 제 1 데이터와 제 2 데이터로부터 각각 데이터 셋을 추출 또는 생성한다.Thereafter, a data set is extracted or generated from the n input first data and the second data, respectively.

우선, 악성 코드를 판별하기 위해 분석하고자 하는 복수개의 임의의 심볼들을 선택하고, 선택된 복수개의 임의의 심볼들을 목록화한다. First, a plurality of random symbols to be analyzed are selected to determine malicious code, and a plurality of selected random symbols are listed.

제 1 데이터 및 제 2 데이터 각각에서 목록화된 복수개의 임의의 심볼들 각각을 포함하는 심볼 문자열의 개수를 추출하여 각각의 데이터 셋을 추출 또는 생성한다.Each data set is extracted or generated by extracting the number of symbol strings each including a plurality of arbitrary symbols listed in each of the first data and the second data.

3. 특성 분석(Feature Analysis)3. Feature Analysis

이후, 추출 또는 생성된 각각의 데이터 셋으로부터 정상적인 데이터와 악성코드를 포함하는 데이터를 구별하기 위한 특정 특성을 분석한다.Then, specific characteristics are analyzed to distinguish normal data from data including malicious codes from each extracted data set.

구체적으로, 비선형 값인 심볼 문자열의 개수를 특정한 범위로 분류하고, 분류한 각각의 범위에 해당하는 제 1 데이터 및 제 2 데이터의 개수를 심볼 별로 분석한다. Specifically, the number of symbol strings that are nonlinear values are classified into specific ranges, and the number of first data and second data corresponding to each classified range is analyzed for each symbol.

이후, 분석 결과에 기초하여 심볼에 따른 제 1 데이터 및 제 2 데이터의 개수 분포를 비교하여, 비교 결과 제 1 데이터의 개수와 제 2 데이터의 개수간 차이가 일정 개수 이상인 심볼들을 선택한다.Subsequently, the number distribution of the first data and the second data according to the symbol is compared based on the analysis result, and as a result of the comparison, the number of symbols having a difference between the number of the first data and the number of the second data is selected.

이렇게 선택된 심볼들은 악성 코드를 판별하기 위한 심볼 특성으로 선정될 수 있다.The symbols thus selected may be selected as symbol characteristics for discriminating malicious codes.

선정된 심볼 특성에 대해 특정한 범위에 기초하여 단계 2에서 추출 또는 생성된 데이터 셋은 가공 또는 처리될 수 있다.The data set extracted or generated in step 2 based on a specific range for the selected symbol characteristic may be processed or processed.

예를 들면, 단계 2에서 추출 또는 생성된 데이터 셋의 데이터 별 심볼 문자열의 개수를 분류한 특정한 범위 각각의 중간 값으로 변환할 수 있다. 이러한 변환은 비선형 값으로 출력된 데이터를 범주형 데이터로 변환시킬 수 있어 통계적 분석의 효율을 높일 수 있다.For example, the number of symbol strings for each data of the data set extracted or generated in step 2 may be converted into an intermediate value of each specific range. This transformation can convert the data output as non-linear values into categorical data, thereby increasing the efficiency of statistical analysis.

4. 통계적 분석(Statics Analysis)4. Statisticals Analysis

악성 코드 분석은 다양한 기계학습 기반의 분석이 이용될 수 있다. 기계 학습 또는 머신 러닝은 인공 지능의 한 분야로, 컴퓨터가 학습할 수 있도록 하는 알고리즘과 기술을 개발하는 분야를 의미한다.Malicious code analysis can be used for various machine learning based analysis. Machine learning or machine learning is an area of artificial intelligence that means the development of algorithms and technologies that enable computers to learn.

이에 따라, 단계 3에서 가공 또는 처리된 데이터 셋의 통계 분석을 위해 로지스틱 회귀(Logistic Regression) 분석이 이용될 수 있다.Accordingly, logistic regression analysis can be used for statistical analysis of the data set processed or processed in step 3.

먼저, 로지스틱 회귀 분석을 이용하여 악성 코드를 포함하는 데이터를 판별하기 위해 로지스틱 회귀 분석에 적용되는 독립 변수 및 종속 변수가 선택될 수 있다.First, independent variables and dependent variables applied to logistic regression analysis may be selected to determine data including malicious code using logistic regression analysis.

이때, 독립 변수는 악성 코드를 판별하기 위해 사용되는 심볼 특성에 포함된 복수의 심볼들 중 하나를 의미하고, 종속 변수는 독립 변수의 데이터로 판별하고자 하는 데이터가 악성 코드를 포함하고 있는지 여부를 나타내는 변수를 의미할 수 있다.In this case, the independent variable refers to one of a plurality of symbols included in a symbol characteristic used to determine malicious code, and the dependent variable indicates whether the data to be determined as data of the independent variable includes malicious code. It can mean a variable.

즉, 독립변수인 심볼 특성의 심볼들의 데이터(예를 들면, 심볼 출현 빈도 등)로부터 확률 계산을 통해 입력되는 데이터가 정상인지 악성코드인지 여부가 종속 변수를 통해 나타내어진다.That is, whether the data input through the probability calculation from the data (for example, the symbol occurrence frequency, etc.) of the symbol characteristic as an independent variable is normal or malicious code is indicated through the dependent variable.

이후, 선정된 독립 변수, 종속 변수 및 가공된 데이터를 로지스틱 회기 분석에 적용하여 입력된 각 데이터의 정상 또는 악성코드 여부를 판별할 수 있다.Thereafter, the selected independent variable, dependent variable, and processed data may be applied to logistic regression analysis to determine whether each input data is normal or malicious.

예를 들면, 로지스틱 회귀분석을 통해 분석된 특정 입력 데이터의 결과로 출력된 종속 변수가 '0'인 경우, 특정 입력 데이터는 정상적인 데이터이고, '1'인 경우, 악성 코드를 포함하고 있는 데이터일 수 있다.For example, if the dependent variable outputted as a result of specific input data analyzed through logistic regression is '0', the specific input data is normal data, and if '1', it is data containing malicious code. Can be.

이와 같은 심볼 기반 악성 코드 판별 모델을 통해서 난독화가 된 데이터들이 악성코드를 포함하고 있는지 여부를 판별할 수 있다.Through this symbol-based malicious code discrimination model, it is possible to determine whether or not obfuscated data contains malicious code.

도 4는 본 명세서에서 제안하는 심볼에 기반하여 악성 코드를 탐지하기 위한 방법을 예시하는 도면이다.4 is a diagram illustrating a method for detecting malicious code based on a symbol proposed in the present specification.

구체적으로, 각각 n 개의 정상적인 데이터 및 악성 코드를 포함하는 데이터를 통해 심볼 기반의 악성 코드 판별 모델을 생성하여 난독화된 악성 코드를 판별할 수 있다.Specifically, the obfuscated malicious code may be determined by generating a symbol-based malicious code discrimination model through data including n normal data and malicious code, respectively.

구체적으로, 악성 코드 판별 모델을 생성하기 위해 장치는 사용자로부터 각각 특정 개수의 정상적인 바이너리 데이터(제 1 데이터)와 악성코드 바이너리 데이터(제 2 데이터)를 획득한다(S4010).Specifically, in order to generate a malicious code discrimination model, the device obtains a specific number of normal binary data (first data) and malicious code binary data (second data) from the user (S4010).

이후, 제 1 데이터 및 제 2 데이터로부터 상기 악성 코드의 분석을 위한 데이터 셋을 각각 생성한다(S4020).Thereafter, a data set for analyzing the malicious code is generated from the first data and the second data, respectively (S4020).

이때, 각각의 데이터 셋은 도 3에서 설명한 단계 2와 같은 방법을 통해서 생성될 수 있으며, 제 1 데이터 또는 제 2 데이터로부터 임의로 목록화된 복수의 심볼을 포함할 수 있다.In this case, each data set may be generated by the same method as step 2 described in FIG. 3, and may include a plurality of symbols arbitrarily listed from the first data or the second data.

이후, 장치는 데이터 셋 각각에 포함된 복수의 심볼 중 상기 악성 코드를 판별하기 위한 특정 특성을 분석한다(S4030).Thereafter, the device analyzes a specific characteristic for determining the malicious code among a plurality of symbols included in each data set (S4030).

이때, 특정 특성의 분석은 도 3에서 살펴본 단계 3과 같은 방법을 통해서 분석될 수 있다.At this time, the analysis of the specific characteristics may be analyzed through the same method as step 3 described in FIG.

이후, 장치는 상기 특정 특성에 기초하여 악성 코드 여부를 결정하기 위한 통계적 분석을 수행하여 입력된 데이터가 악성 코드를 포함하고 있는지 여부를 결정할 수 있다(S4040).Thereafter, the device may perform a statistical analysis to determine whether the malicious code is based on the specific characteristic to determine whether the input data includes the malicious code (S4040).

이때, 악성 코드 여부를 결정하기 위한 통계적 분석은 도 3에서 살펴본 단계 4와 같은 방법을 통해서 수행될 수 있다.At this time, statistical analysis for determining whether the malicious code may be performed through the same method as step 4 described in FIG.

악성 코드를 판별하기 위한 실험 결과 및 분석Experimental results and analysis to determine malicious code

도 5는 본 명세서에서 제안하는 악성 코드의 문자열의 개수를 비교하기 위한 그래프의 일 예이다.5 is an example of a graph for comparing the number of strings of malicious code proposed in the present specification.

도 5는 정상적인 데이터 및 악성 코드를 포함하는 데이터 각각 1000개를 사용한 경우, 정상적인 데이터 및 악성 코드를 포함하는 데이터의 데이터 셋의 특정 심볼에 대한 분포 차이를 나타낸다.FIG. 5 illustrates a distribution difference for a specific symbol of a data set of data including normal data and malicious code when 1000 pieces of data including normal data and malicious code are used.

구체적으로, 정상적인 데이터 및 악성 코드를 포함하는 데이터 각각 1000개를 샘플 데이터로 입력하고, 심볼 문자열 추출, 데이터 셋 추출 및 통계 분석에 아래 표 1의 도구를 이용한다.Specifically, 1000 pieces of data including normal data and malicious code are input as sample data, and the tool of Table 1 is used for symbol string extraction, data set extraction, and statistical analysis.

processprocess tooltool 심볼 문자열 추출Extract symbol string StringsStrings 데이터 셋 추출Extract data set C-based program implementationC-based program implementation 통계 분석Statistical analysis SPSSSPSS

먼저, 심볼 특성 분석을 위해 앞의 도 3에서 설명한 추출한 데이터셋의 심볼 문자열 개수를 임의의 특정 범위로 분류하고, 분류된 특정 범위에 기초하여 정상적인 데이터 및 악성 코드를 포함하는 데이터 간 분포 차이를 비교한다.First, for symbol characterization, the number of symbol strings of the extracted data set described in FIG. 3 is classified into an arbitrary range, and the distribution difference between normal data and data including malicious code is compared based on the classified range. do.

예를 들면, 문자 심볼 '+'에 대한 결과는 도 5와 같을 수 있다.For example, the result for the character symbol '+' may be as shown in FIG. 5.

도 5에서, x축은 분류한 심볼 문자열 개수의 범위이고 y축은 파일 개수이다.In FIG. 5, the x-axis is the range of the number of classified symbol strings and the y-axis is the number of files.

이때, 정상적인 데이터 및 악성 코드를 포함하는 데이터 간 분포 결과가 큰 차이를 보이는 심볼은 정상적인 데이터 와 악성 코드를 포함하는 데이터 정상/악성을 구분할 수 있는 심볼 특성으로 선정될 수 있다.In this case, a symbol having a large difference in the distribution result between the normal data and the data including malicious code may be selected as a symbol characteristic to distinguish between normal data and malicious data including normal data and malicious code.

또한, 기존 데이터 셋의 심볼 문자열 개수 데이터를 분류한 범위의 중간 값으로 변환함으로써 각 심볼 특성에 대해 기존 데이터 셋을 가공할 수 있다.In addition, by converting the symbol string number data of the existing data set into an intermediate value of the classified range, the existing data set may be processed for each symbol characteristic.

이후, 심볼 특성의 분석 결과 가공된 데이터 셋을 로지스틱 회귀 분석에 적용하여 통계적 분석을 진행할 수 있다.Subsequently, statistical analysis can be performed by applying the processed data set to logistic regression as a result of analyzing the symbol characteristics.

이때, 통계적 분석은 SPSS(Statistical Package for Social Science) 통계 분석 도구가 사용될 수 있다.In this case, the statistical analysis may be a statistical package for social science (SSSS) statistical analysis tool.

정상적인 데이터 및 악성 코드를 포함하는 데이터 각각 1000개씩의 샘플 데이터를 대상으로 통계 분석을 이용하여 악성 코드를 탐지한 결과, 87.6%의 탐지율을 보였다.Malicious code was detected using statistical analysis of 1000 samples of data including normal data and malicious code, and the detection rate was 87.6%.

또한, 정상적인 데이터를 악성 코드로 탐지한 경우는 10.6%에 불과하였다. 이에 대한 통계 분석 결과는 아래 표 2와 같다.In addition, only 10.6% of malicious data were detected as malicious code. Statistical analysis results are shown in Table 2 below.

Malware detection rateMalware detection rate Normal false rateNormal false rate 87.6%87.6% 10.6%10.6%

상기 표 2를 참조하면, 특정 심볼을 포함하는 문자열의 빈도에 기초하여 통계적으로 악성 코드를 판별하는 모델은 높은 확률로 악성 코드를 판별할 수 있다는 것을 확인할 수 있다.Referring to Table 2, it can be seen that the model that statistically determines the malicious code based on the frequency of the string including the specific symbol can determine the malicious code with a high probability.

이제까지 본 발명에 대하여 그 바람직한 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far I looked at the center of the preferred embodiment for the present invention. Those skilled in the art will appreciate that the present invention can be implemented in a modified form without departing from the essential features of the present invention. Therefore, the disclosed embodiments should be considered in descriptive sense only and not for purposes of limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope will be construed as being included in the present invention.

나아가, 설명의 편의를 위하여 각 도면을 나누어 설명하였으나, 각 도면에 서술되어 있는 실시 예들을 병합하여 새로운 실시 예를 구현하도록 설계하는 것도 가능하다.Further, for convenience of description, the drawings are divided and described, but it is also possible to design a new embodiment by merging the embodiments described in each drawing.

또한, 전술한 본 발명은, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 매체의 예로는, HDD(Hard Disk Drive), SSD(Solid State Disk), SDD(Silicon Disk Drive), ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등을 포함한다. 또한, 상기 컴퓨터는 장치의 제어부(130)를 포함할 수도 있다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.In addition, the present invention described above can be embodied as computer readable codes on a medium on which a program is recorded. The computer-readable medium includes all kinds of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable media include hard disk drives (HDDs), solid state disks (SSDs), silicon disk drives (SDDs), ROMs, RAMs, CD-ROMs, magnetic tapes, floppy disks, optical data storage devices, and the like. It includes. The computer may also include a controller 130 of the device. Accordingly, the above detailed description should not be construed as limiting in all aspects and should be considered as illustrative. The scope of the invention should be determined by reasonable interpretation of the appended claims, and all changes within the equivalent scope of the invention are included in the scope of the invention.

100: 장치 110: 입력부
120: 저장부 130: 제어부
140: 출력부100: device 110: input unit
120: storage unit 130: control unit
140: output unit

Claims

In the method for analyzing malicious code performed in the device for analyzing malicious code,
Obtaining a specific number of first data and second data from a user;
Generating a data set for analyzing the malicious code from the first data and the second data, respectively;
Each of the data sets comprises a plurality of symbols arbitrarily listed from the first data or the second data;
Analyzing a specific characteristic for determining the malicious code among the plurality of symbols; And
Performing statistical analysis to determine whether malicious code is based on the specific characteristics,
The first data is data that does not contain malicious code,
And said second data is data comprising malicious code.

The method of claim 1, wherein the generating of each data set comprises:
Selecting the plurality of symbols;
Listing the selected plurality of symbols; And
Extracting a number of character strings each of the plurality of symbols from each of the first data and the second data,
And the data set is generated based on the number of the character strings.

The method of claim 2, wherein analyzing the specific property comprises:
Classifying the number of strings extracted from each of the first data and the second data into a specific range; And
The method may further include selecting at least one symbol of the plurality of symbols, wherein the number difference between the first data and the second data included in the specific range is a predetermined number or more,
And wherein said particular characteristic consists of said at least one symbol.

The method of claim 1,
Wherein said statistical analysis is performed using a logistic regression analysis.

The method of claim 4, wherein performing the statistical analysis comprises:
Selecting independent and dependent variables to use the logistic regression analysis; And
Applying the selected independent variable and the selected dependent variable to the logistic regression analysis,
The independent variable represents a specific symbol for analyzing the malicious code,
The dependent variable is a method for analyzing malicious code that is a variable indicating whether or not malicious code.

In the device for analyzing malicious code, the device,
An input unit for obtaining a specific number of first data and second data from a user;
A data processor which generates a data set for analyzing the malicious code from the first data and the second data, respectively;
Each of the data sets comprises a plurality of symbols arbitrarily listed from the first data or the second data;
An analysis unit for analyzing a specific characteristic for determining the malicious code among the plurality of symbols; And
It includes a determination unit for performing a statistical analysis for determining whether malicious code based on the specific characteristics,
The first data is data that does not contain malicious code,
And said second data is data comprising malicious code.

The method of claim 6, wherein the data processing unit,
Select the plurality of symbols,
Listing the selected plurality of symbols,
Extracting a number of character strings each of the plurality of symbols from each of the first data and the second data,
And the data set is generated based on the number of the strings.

The method of claim 7, wherein the analysis unit,
Classify the number of strings extracted from each of the first data and the second data into a specific range,
At least one symbol having a number difference between the first data and the second data included in the specific range among the plurality of symbols is a predetermined number or more,
The specific characteristic consists of the at least one symbol.

The method of claim 6,
Wherein said statistical analysis is performed using logistic regression analysis.

The method of claim 9, wherein the analysis unit,
Select independent and dependent variables to use the logistic regression analysis,
Apply the selected independent variable and the selected dependent variable to the logistic regression analysis,
The independent variable represents a specific symbol for analyzing the malicious code,
The dependent variable is a device indicating whether or not malicious code.