KR101228899B1

KR101228899B1 - Method and Apparatus for categorizing and analyzing Malicious Code Using Vector Calculation

Info

Publication number: KR101228899B1
Application number: KR1020110013209A
Authority: KR
Inventors: 고흥환
Original assignee: 주식회사 안랩
Priority date: 2011-02-15
Filing date: 2011-02-15
Publication date: 2013-02-06
Also published as: KR20120093564A

Abstract

벡터량 산출을 이용한 악성코드의 분석 및 진단에 활용되는 기술이다. 본 발명의 일 실시예에 따르면, 실행파일의 바이너리 코드를 분석하는 단계와, 분석된 상기 바이너리 코드를 2차원 구조의 페이지 단위로 메모리에 로드하는 단계와, 분석된 상기 바이너리 코드에서, 분기 명령어 코드(Opcode)를 식별하는 단계와, 식별된 상기 분기 명령어 코드의 바이너리 위치로부터, 분기되는 명령어 코드의 바이너리 위치에 대한, 상기 2 차원 구조의 페이지 단위의 메모리 상에서의 거리와 방향을 포함하는 벡터값을 계산하는 단계와, 상기 산출된 벡터값 및 상기 식별된 분기 명령어 코드를 이용하여 매트릭스 테이블을 산출하는 단계와, 상기 매트릭스로부터 해시값을 산출하는 단계와 상기 산출된 해시값을 기 산출된 상이한 실행파일에 대한 해시값과 비교하여 동일 여부를 판단하는 단계를 포함하는 벡터량 산출을 이용한 악성코드의 분류 및 진단 방법을 제공한다. 본 발명에 따르면 동일 유형의 실행파일 및 악성코드를 효율적으로 진단 및 분류할 수 있으며, 따라서 시그니처 자원을 효과적으로 관리할 수 있는 효과가 있다.This technology is used to analyze and diagnose malicious codes using vector quantity calculation. According to one embodiment of the invention, the step of analyzing the binary code of the executable file, loading the analyzed binary code in the page unit of the two-dimensional structure in the memory, and in the analyzed binary code, branch instruction code (Opcode) and a vector value including a distance and a direction on a memory in page units of the two-dimensional structure from the identified binary position of the branch instruction code to the binary position of the branched instruction code. Calculating a matrix table; calculating a matrix table using the calculated vector value and the identified branch instruction code; calculating a hash value from the matrix; and calculating a different executable file based on the calculated hash value. Malicious nose using a vector amount calculation comprising the step of determining whether the same by comparing the hash value for It provides the classification and diagnosis. According to the present invention, it is possible to efficiently diagnose and classify executable files and malicious codes of the same type, and thus, it is possible to effectively manage signature resources.

Description

Method and Apparatus for Classification and Diagnosis of Malicious Code Using Vector Amount Calculation {Method and Apparatus for categorizing and analyzing Malicious Code Using Vector Calculation}

본 발명은 악성코드의 분류 및 진단 방법과 장치에 관한 것이다. 보다 구체적으로, 본 발명은 실행파일의 바이너리를 분석하여 동일한 유형의 악성코드를 분류하고 진단하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for classifying and diagnosing malicious codes. More specifically, the present invention relates to a method and apparatus for classifying and diagnosing malware of the same type by analyzing a binary of an executable file.

통상적으로, 현대의 악성코드는 다양한 컴파일러를 통해 PE(Portable Executable) 파일형식의 바이너리(binary data)가 실행압축되어 배포되며, 또한 많은 수의 악성코드는 하나의 원본 바이너리에 대해 자동생성기(Generator) 등을 이용하여 다양한 바이너리로 재생산되어 배포된다. 즉, 하나의 원본 바이너리에 대해서 다양한 수의 패커(Packer)를 통해서 원본 바이너리를 난독화(암호화)하여 배포함으로써, 특정 악성코드에 대한 분석을 회피하려는 시도가 이루어지고 있다.In general, modern malicious code is distributed by executing binary compression of Portable Executable (PE) file format through various compilers, and a large number of malicious codes are automatically generated for one original binary. It is reproduced and distributed in various binaries using. That is, attempts are made to evade analysis of specific malicious codes by obfuscating (encrypting) and distributing original binaries through various numbers of packers for one original binary.

현재로써는, 이렇게 생성된 악성코드는 그에 맞는 다형성 진단용 전용함수(Heuristic 또는 Generic)를 제적하여 대처하여야 하며, 이에 따라 분석과 대응에 상당한 시간과 노력이 필요로 되고 있다.At present, the generated malicious code must deal with the polymorphism diagnostic dedicated function (Heuristic or Generic) accordingly, which requires considerable time and effort for analysis and response.

본 발명은 전술한 종래기술의 문제점을 해결하기 위한 것으로 본 발명의 일 목적은 원본의 악성코드 바이너리로부터 본 발명이 제안하는 방법에 의하여 해시값(Hash Value)을 산출하고, 또한 검사대상이 되는 실행파일의 바이너리로부터 본 발명이 제안하는 방법에 의하여 해시값을 산출하고 이를 이용하여 악성코드를 분류 및 진단하는 방법과 장치를 제공하는 것이다.One object of the present invention is to calculate a hash value by the method proposed by the present invention from an original malicious code binary, and also to be a test subject. The present invention provides a method and apparatus for calculating a hash value from the binary of a file and classifying and diagnosing malicious codes using the method.

본 발명의 다른 목적은 바이너리 파일이 시스템의 메모리 상에 로드되어 실행되면, 실행을 검출하고, 본 발명이 제안하는 방법에 의하여 산출되는 바이너리 파일의 해시값을 이용하여 악성코드를 진단 및 분류하는 방법과 장치를 제공하는 것이다.Another object of the present invention is to detect execution when a binary file is loaded and executed in the memory of the system, and to diagnose and classify malicious code using the hash value of the binary file calculated by the method proposed by the present invention. And to provide a device.

본 발명의 또 다른 목적은 바이너리 파일이 실행되기 전에 본 발명이 제안하는 방법에 의하여 산출되는 바이너리 파일의 해시값을 이용하여 바이너리 파일을 진단 및 분류하는 방법과 장치를 제공하는 것이다.Another object of the present invention is to provide a method and apparatus for diagnosing and classifying a binary file using a hash value of the binary file calculated by the method proposed by the present invention before the binary file is executed.

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The object of the present invention is not limited to the above-mentioned object, and other objects that are not mentioned will be clearly understood by those skilled in the art from the following description.

상기와 같은 목적을 달성하기 위한 본 발명의 일 실시예에 따르면, 벡터량 산출을 이용한 악성코드의 분류 및 진단 방법에 있어서, 실행파일의 바이너리 코드를 분석하는 단계와, 분석된 상기 바이너리 코드를 2차원 구조의 페이지 단위로 메모리에 로드하는 단계와, 분석된 상기 바이너리 코드에서, 분기 명령어 코드(Opcode)를 식별하는 단계와, 식별된 상기 분기 명령어 코드의 바이너리 위치로부터, 분기되는 명령어 코드의 바이너리 위치에 대한, 상기 2 차원 구조의 페이지 단위의 메모리 상에서의 거리와 방향을 포함하는 벡터값을 계산하는 단계와, 상기 산출된 벡터값 및 상기 식별된 분기 명령어 코드를 이용하여 매트릭스 테이블을 산출하는 단계와, 상기 매트릭스로부터 해시값을 산출하는 단계와 상기 산출된 해시값을 기 산출된 상이한 실행파일에 대한 해시값과 비교하여 동일 여부를 판단하는 단계를 포함하는 벡터량 산출을 이용한 악성코드의 분류 및 진단 방법을 제공한다.According to an embodiment of the present invention for achieving the above object, in the method for classifying and diagnosing malicious codes using vector quantity calculation, analyzing binary code of an executable file, and analyzing the binary code in two dimensions Loading into memory in units of pages of the structure, identifying the branch instruction code (Opcode) in the analyzed binary code, and from the binary position of the identified branch instruction code to a binary position of the branched instruction code. Calculating a vector value including a distance and a direction on a page unit of a memory having a two-dimensional structure, calculating a matrix table using the calculated vector value and the identified branch instruction code; Calculating a hash value from the matrix and a different execution wave from which the calculated hash value is calculated As it compared with the hash value for a provides a classification and diagnosis of the infection by the vector quantity calculation comprises the step of determining whether or not the same.

또한, 상기 실행파일의 바이너리 코드를 분석하는 단계는, 상기 실행파일이 시스템 상에서 실행되는지 여부를 API 후킹을 통하여 검출하는 단계와, 상기 API 후킹을 통하여 상기 실행파일이 실행되는 것으로 검출되면, 상기 실행파일의 바이너리 코드를 상기 실행파일이 실행된 시스템 상의 코드 영역으로부터 역어셈블(disassemble)하여 분석하는 단계를 포함하는 방법을 제공한다.The analyzing of the binary code of the executable file may include detecting whether the executable file is executed on a system through API hooking, and if the executable file is detected to be executed through the API hooking, executing the executable file. And disassembling and analyzing the binary code of the file from the code region on the system on which the executable file is executed.

또한, 상기 실행파일의 바이너리 코드를 분석하는 단계는, 상기 실행파일이 시스템 상에서 실행되기 전에, 상기 실행파일의 바이너리 코드를 역어셈블(disassemble)하여 분석하는 단계를 포함하는 방법을 제공한다.In addition, analyzing the binary code of the executable file provides a method comprising disassembling and analyzing the binary code of the executable file before the executable file is executed on a system.

또한, 상기 실행파일은 패커(Packer)에 의해 실행압축되어 있는 방법을 제공한다.In addition, the executable file provides a method in which the executable file is compressed by a packer.

또한, 상기 실행파일의 바이너리 코드를 분석하는 단계는, 상기 실행파일의 엔트리 포인트(Entry Point) 위치로부터 또는 컴파일러에서 사용하는 스텁코드 API 위치로부터 상기 바이너리 코드를 분석하는, 방법을 제공한다.In addition, analyzing the binary code of the executable file provides a method for analyzing the binary code from an entry point location of the executable file or from a stub code API location used by a compiler.

또한, 상기 분기 명령어 코드는 강제분기 명령어 또는 조건분기 명령어 중 어느 하나를 포함하는 방법을 제공한다.In addition, the branch instruction code provides a method including any one of a forced branch instruction or a conditional branch instruction.

또한, 상기 거리는 상기 2 차원 구조의 페이지 단위의 메모리 상에서의 하나의 바이너리 위치로부터 다른 바이너리 위치에 대한 거리이며, 상기 방향은 상기 2 차원 구조의 페이지 단위의 메모리 상에서 메모리의 주소값의 크기에 따라 상측 방향 또는 하측 방향 중 어느 하나로 표현되되, 상기 거리는 상기 하나의 바이너리 위치와 다른 바이너리 위치의 메모리 주소 값의 차이로 표현되는, 방법을 제공한다.Further, the distance is a distance from one binary position on the page unit of the memory of the two-dimensional structure to the other binary position, the direction is higher depending on the size of the address value of the memory on the page unit of the memory of the two-dimensional structure Or a downward direction, wherein the distance is expressed as a difference between a memory address value of the one binary location and the other binary location.

또한, 상기 거리는 상기 2 차원 구조의 페이지 단위의 메모리 상에서의 하나의 바이너리 위치로부터 다른 바이너리 위치에 대한 거리이며, 상기 방향은 상기 2 차원 구조의 페이지 단위의 메모리 상에서 하나의 바이너리 위치에 대한 다른 바이너리 위치의 2 차원적인 방향으로 표현되되, 상기 거리 및 상기 방향은, 상기 하나의 바이너리 위치와 다른 바이너리 위치를 각각 2차원 좌표로 표현하고, 이들 좌표로부터 계산되는 방법을 제공한다.Further, the distance is a distance from one binary position on the page unit of memory of the two-dimensional structure to another binary position, and the direction is another binary position on one binary position on the memory of the page unit of the two-dimensional structure It is expressed in a two-dimensional direction of, wherein the distance and the direction, respectively, represent the one binary position and the other binary position in two-dimensional coordinates, and provides a method calculated from these coordinates.

또한, 상기 매트릭스 테이블을 산출하는 단계는 상기 산출된 벡터값 및 상기 식별된 분기 명령어 코드를 소정 크기의 용량까지 버퍼에 저장하는 단계를 더 포함하고, 상기 해시값을 산출하는 단계는 상기 버퍼에 저장된 값을 이용하여 상기 해시값을 산출하는 방법을 제공한다.The calculating of the matrix table may further include storing the calculated vector value and the identified branch instruction code in a buffer up to a predetermined size, and calculating the hash value comprises storing the hash value in the buffer. It provides a method for calculating the hash value using a value.

또한, 상기 매트릭스 테이블을 산출하는 단계는 상기 산출된 벡터값의 방향 및 그 벡터값을 산출하기 위해 사용한 식별된 분기 명령어 코드의 세트를 소정 크기의 바이트 형식으로 제 1 버퍼에 저장하는 단계와, 상기 산출된 벡터값의 크기를 소정 크기의 바이트 형식으로 제 2 버퍼에 저장하는 단계를 포함하고, 상기 해시값을 산출하는 단계는 상기 제 1 버퍼 및 제 2 버퍼에 저장된 값을 이용하여 상기 해시값을 산출하는 방법을 제공한다.The calculating of the matrix table may further include storing a direction of the calculated vector value and a set of identified branch instruction codes used to calculate the vector value in a first buffer in a byte format having a predetermined size; And storing the calculated size of the vector value in a second buffer in a byte format, and calculating the hash value using the values stored in the first buffer and the second buffer. Provide a method for calculating.

또한, 상기 산출된 해시값을, 해시 데이터 시그니처 데이터베이스에 저장하는 단계를 더 포함하는 방법을 제공한다.The method may further include storing the calculated hash value in a hash data signature database.

또한, 상기 기 산출된 해시값은 악성코드라고 판정된 바이너리 코드로부터 산출된 해시값이며, 해시 데이터 시그니처 데이터베이스에 저장되어 있으며, 상기 비교 결과 양자가 동일하다면, 동일 유형의 동종 악성 코드라고 판단하는 단계를 더 포함하는 방법을 제공한다.In addition, the calculated hash value is a hash value calculated from a binary code determined to be a malicious code, stored in a hash data signature database, and if the comparison result is the same, determining that the same type of malware is the same type of malware. It provides a method comprising more.

또한, 상기 기 산출된 해시값은 특정 악성코드에 대한 식별자, 상기 실행파일을 실행압축한 패커(packer)에 대한 식별자, 상기 실행파일을 컴파일한 컴파일러에 대한 식별자 중 어느 하나로 사용되는 방법을 제공한다.In addition, the calculated hash value provides a method of using any one of an identifier for a specific malicious code, an identifier for a packer (packer) that the executable file is compressed, and an identifier for the compiler that compiled the executable file. .

또한, 상기 벡터값을 산출하는 단계는 상기 실행파일의 바이너리 코드 상의 분기 명령어 코드가 실행되는 순서에 따라서, 순차적으로 분기 명령어 코드에 대한 벡터값을 산출하는 방법을 제공한다.The calculating of the vector value may provide a method of sequentially calculating vector values of branch instruction codes according to an order in which branch instruction codes on binary codes of the executable file are executed.

또한, 상기 실행파일의 바이너리 코드 상의 분기 명령어 코드가 상기 실행파일의 프로세스의 이미지 영역을 벗어나게 하는 분기 명령어 코드인 경우, 다음 산출 대상 분기 명령어는 상기 이미지 영역 내의 다음 분기 명령어로 설정하는 방법을 제공한다. In addition, when the branch instruction code on the binary code of the executable file is a branch instruction code that leaves the image area of the process of the executable file, the next calculation target branch instruction provides a method of setting a next branch instruction in the image area. .

본 발명의 다른 일 실시예에 따르면, 컴퓨터 프로그램이 저장된 컴퓨터 판독가능 기록 매체로서, 상기 컴퓨터 프로그램은 컴퓨터에서 실행되는 경우 전술한 방법을 수행하는 컴퓨터 판독가능 기록 매체를 제공한다.According to another embodiment of the present invention, a computer readable recording medium having a computer program stored thereon, the computer program provides a computer readable recording medium which performs the above-described method when executed in a computer.

본 발명의 또 다른 일 실시예에 따르면, 벡터량 산출을 이용한 악성코드의 분류 및 진단 장치에 있어서, 2차원 구조의 페이지 단위로 실행파일의 바이너리 코드를 분석하는 코드 분석부와, 분석된 상기 바이너리 코드에서, 분기 명령어 코드를 식별하는 코드 식별부와, 식별된 상기 분기 명령어 코드의 바이너리 위치로부터, 분기되는 명령어 코드의 바이너리 위치에 대한, 상기 2 차원 구조의 페이지 단위의 메모리 상에서의 거리와 방향을 포함하는 벡터값을 계산하는 벡터 계산부와, 상기 산출된 벡터값 및 상기 식별된 분기 명령어 코드를 이용하여 매트릭스 테이블을 산출하는 매트릭스 테이블 산출부와, 상기 매트릭스로부터 해시값을 산출하는 해시값 산출부와 상기 산출된 해시값을 기 산출된 상이한 실행파일에 대한 해시값과 비교하여 동일 여부를 판단하는 비교부를 포함하는 벡터량 산출을 이용한 악성코드의 분류 및 진단 장치를 제공한다.According to still another embodiment of the present invention, in the apparatus for classifying and diagnosing malicious codes using vector quantity calculation, a code analyzing unit for analyzing binary codes of executable files in units of pages having a two-dimensional structure, and the analyzed binary codes A code identification unit for identifying a branch instruction code, and a distance and a direction on a page-based memory of the two-dimensional structure from the identified binary position of the branch instruction code to a binary position of the branched instruction code. A vector calculation unit for calculating a vector value to be performed, a matrix table calculator for calculating a matrix table using the calculated vector value and the identified branch instruction code, a hash value calculator for calculating a hash value from the matrix, and Whether the calculated hash value is the same by comparing the calculated hash values for different executable files Using the vector quantity calculation including a determination of comparison provides a classification and diagnostic system for the malicious code.

또한, 상기 산출된 해시값을 저장하는 해시 데이터 시그니처 데이터베이스를 더 포함하는 장치를 제공한다.The present invention also provides an apparatus further comprising a hash data signature database for storing the calculated hash value.

본 발명에 따른 악성코드 분류 및 진단 방법에 따르면 동일한 유형의 동종 악성코드에 대한 변종 악성코드들을 효과적으로 진단 및 분류하여 대비할 수 있는 효과가 있다. According to the method for classifying and diagnosing malicious codes according to the present invention, there is an effect of effectively diagnosing and classifying variant malicious codes for homogeneous malware of the same type.

또한 본 발명에 따른 악성코드 분류 및 진단 방법에 따르면 시그니처 등을 미리 보유하지 않은 악성코드에 대해서도 악성코드인지 여부를 판단하여 그 동작에 대하여 대비할 수 있는 효과가 있다.In addition, according to the method for classifying and diagnosing malicious codes according to the present invention, it is possible to determine whether the malicious code is malicious code even if the signature does not have a signature, etc., and prepare for the operation.

또한 본 발명에 따른 악성코드 분류 및 진단 방법에 따르면 시그니처 자원에 대한 효율적인 관리가 가능한 효과가 있다.In addition, according to the method for classifying and diagnosing malware according to the present invention, it is possible to efficiently manage signature resources.

또한 본 발명에 따른 악성코드 분류 및 진단 방법에 따르면, 예컨대, 특정 컴파일러, 특정 패커에 의하여 제작된 파일 등을 분류할 수 있는 효과가 있다.In addition, according to the malicious code classification and diagnostic method according to the present invention, for example, there is an effect that can classify files produced by a specific compiler, a specific packer, and the like.

도 1은 실행파일의 바이너리 코드를 역어셈블하여 페이지 단위로 로드된 메모리의 예시를 도시하는 도면,
도 2은 실행파일의 바이너리 코드를 역어셈블하여 명령어코드를 재구성한 예시를 도시하는 도면,
도 3는 본 발명의 일 실시예에 따른 코드라우팅벡터 테이블의 일 실시예를 도시하는 도면,
도 4는 본 발명의 일 실시예에 따른 악성코드 진단 및 분류 방법의 일련의 과정을 도시하는 도면,
도 5는 본 발명의 일 실시예에 따른 악성코드 진단 및 분류 장치의 개략도를 도시하는 도면,
도 6은 본 발명의 일 실시예에 따른 코드라우팅벡터 산출부의 개략도를 도시하는 도면이다.1 is a diagram illustrating an example of a memory loaded in units of pages by disassembling binary code of an executable file;
2 is a diagram illustrating an example of reconstructing instruction code by disassembling binary code of an executable file;
3 is a diagram illustrating an embodiment of a code routing vector table according to an embodiment of the present invention;
4 is a diagram showing a series of processes of a method for diagnosing and classifying malware according to an embodiment of the present invention;
5 is a diagram showing a schematic diagram of an apparatus for diagnosing and classifying malware according to an embodiment of the present invention;
6 is a diagram illustrating a schematic diagram of a code routing vector calculator according to an exemplary embodiment of the present invention.

본 발명의 특징과 장점 및 이를 달성하기 위한 방법과 시스템은 첨부되는 도면과 함께 상세하게 후술되는 실시예들을 참조하면 더욱 명확하게 이해될 수 있다. 그러나 본 발명은 아래에서 개시되는 실시예들에 한정되는 것이 아니며 이와 상이한 다른 다양한 형태로도 구현될 수 있다. 즉, 아래의 실시예들은 본 발명의 충분한 개시를 위해 제공되는 것일 뿐이며 본 발명의 범위를 제한하기 위한 것이 아니다. 또한 본 명세서 전체에 걸쳐 동일한 참조 부호는 동일한 구성요소를 가리킨다.BRIEF DESCRIPTION OF THE DRAWINGS The features and advantages of the present invention and methods and systems for achieving the same can be more clearly understood with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various other forms. In other words, the following examples are merely provided for a sufficient disclosure of the present invention and are not intended to limit the scope of the present invention. Also, like reference numerals refer to like elements throughout.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하기로 한다. 첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들에 의해 수행될 수 있음을 이해할 수 있을 것이다. 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 블록도의 각 블록 또는 흐름도 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다. 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명된 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. It will be appreciated that the combination of each block in the accompanying block diagram and each step in the flowchart may be performed by computer program instructions. These computer program instructions may be mounted on a processor of a general purpose computer, special purpose computer, or other programmable data processing equipment such that instructions executed through the processor of the computer or other programmable data processing equipment may not be included in each block or flowchart of the block diagram. It will create means for performing the functions described in each step. These computer program instructions may be stored in a computer usable or computer readable memory that can be directed to a computer or other programmable data processing equipment to implement functionality in a particular manner, and thus the computer usable or computer readable memory. It is also possible for the instructions stored in to produce an article of manufacture containing instruction means for performing the functions described in each block or flowchart of each step of the block diagram. The computer program instructions It can also be mounted on a computer or other programmable data processing equipment, so a series of operating steps are performed on the computer or other programmable data processing equipment to create a computer-implemented process to perform the computer or other programmable data processing equipment. It is also possible for the instructions to provide steps for performing the functions described in each block of the block diagram and in each step of the flowchart.

또한, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 실행예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들 또는 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.In addition, each block or step may represent a portion of a module, segment or code that includes one or more executable instructions for executing a specified logical function (s). It should also be noted that in some alternative implementations, the functions noted in the blocks or steps may occur out of order. For example, the two blocks or steps shown in succession may in fact be executed substantially concurrently or the blocks or steps may sometimes be performed in the reverse order, depending on the functionality involved.

이하에서는 첨부 도면을 참조하여 본 발명의 실시예들이 상세히 설명된다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 실행파일의 바이너리 코드를 역어셈블하여 페이지 단위로 로드된 메모리의 예시를 도시하는 도면이다.1 is a diagram illustrating an example of a memory loaded in units of pages by disassembling binary code of an executable file.

도 1에 도시된 바와 같이, 실행파일의 바이너리 코드를 코드 역어셈블(disassemble)하여 살펴본다면, 각각의 바이트 값이 소정 단위(예컨대, 16 바이트)로 하나의 가로 행을 이루면서 2차원적으로 메모리 상에서 표시된다. 바이너리 코드의 값을 역어셈블하는 과정에서 이와 같이 바이너리 코드를 시스템의 메모리 상에 올려서 그 내용을 확인할 수 있다. 도 1의 경우 첫번째 행은 메모리 주소 0100739D로부터 16바이트씩 한 행을 이루면서 표시됨을 확인할 수 있다. 이와 같이 바이너리 코드를 역어셈블하고 그로부터 프로그램을 구성하는 각각 명령어 코드(Operation Code: OPCode)를 식별하는 것이 가능하다. 즉, 예를 들어 도 1에서 음영으로 표시된 부분을 중심으로 살펴보면 E8 BF 01 00 00 라는 값이 메모리에서 확인되는데, 이를 역어셈블하여 명령어 코드로 해석한다면 CALL 01 00 75 68 즉, 메모리의 01007568번지를 CALL 하는 명령어 코드임을 알 수 있다.As shown in FIG. 1, when disassembling the binary code of an executable file, each byte value is formed on a memory in two dimensions while forming one horizontal row in a predetermined unit (for example, 16 bytes). Is displayed. In the process of disassembling the binary code value, the binary code can be loaded into the system's memory to check its contents. In the case of Figure 1 it can be seen that the first row is displayed by forming a row of 16 bytes from the memory address 0100739D. In this way, it is possible to identify each instruction code (OPCode) that disassembles the binary code and constitutes a program therefrom. That is, for example, if you look at the shaded part in FIG. 1, the value E8 BF 01 00 00 is found in the memory. If you disassemble and interpret it as an instruction code, CALL 01 00 75 68, that is, address 01007568 in the memory You can see that it is a command code to call.

도 1에서 살펴본 바와 같은 바이너리 코드의 분석과 관련하여, 본 발명은 실행파일의 바이너리 코드에 대한 시스템의 CPU 처리에 대해서, 현재의 바이너리 코드의 위치로부터 분기되는 다음코드의 바이너리 위치에 대한 거리와 방향(즉, 벡터값)을 계산하여 대상 실행파일의 CPU 처리의 흐름을 도식화하는 방안을 새롭게 제안한다. 이하에서는 이러한 알고리즘을 코드라우팅벡터(Code Routing Vector) 알고리즘이라 지칭하도록 한다. With respect to the analysis of binary code as seen in FIG. 1, the present invention relates to the CPU processing of the system for binary code of an executable file, the distance and direction of the binary location of the next code branched from the location of the current binary code. We propose a new scheme for calculating the CPU processing flow of target executable files by calculating (ie, vector values). Hereinafter, such an algorithm will be referred to as a code routing vector algorithm.

도 1을 다시 참조하여, 보다 구체적으로 코드라우팅벡터 알고리즘을 서술하면, 전술한 바와 같이, 도 1에서 음영으로 표시한 바이너리 코드는 분기 명령어 코드이고, 이러한 명령어에 따라 CPU는 메모리 주소 0x01007568 번지로 분기하여 그 주소에서 다음 명령어를 실행하게 된다. 따라서 도 1에서 화살표로 도시한 바와 같이, 분기 명령어 코드에 해당하는 현재의 바이너리 코드의 위치로부터 분기되는 다음 코드의 바이너리 위치와의 관계를, 거리와 방향으로, 즉 벡터로써 계산할 수 있다. 이와 같이 실행파일의 바이너리 코드를 분석함에 있어서, CPU 처리의 흐름을 벡터로서 도식화할 수 있다. 즉, 대상 실행파일의 CPU 처리가 어떠한 흐름으로 흘러가는 지를 분기 명령어 코드를 식별하고, 그 분기 명령어 코드에 의해서 분기되는 다음 바이너리 위치를 벡터값으로 산출하고, 이와 같은 벡터값 산출 과정을 대상 실행파일 내의 분기 명령어 코드에 대해서 처리함으로써 CPU의 흐름을 벡터값들로 표현할 수 있게 된다.Referring back to FIG. 1, more specifically describing the code routing vector algorithm, as described above, the shaded binary code in FIG. 1 is a branch instruction code, and the CPU branches to the memory address 0x01007568 according to this instruction. Will execute the next command at that address. Thus, as shown by the arrows in FIG. 1, the relationship between the binary position of the next code branched from the current binary code position corresponding to the branch instruction code can be calculated in distance and direction, that is, as a vector. Thus, in analyzing the binary code of the executable file, the CPU processing flow can be illustrated as a vector. That is, the branch instruction code is identified as to which flow the CPU processing of the target executable flows, the next binary position branched by the branch instruction code is calculated as a vector value, and the process of calculating the vector value is executed. By processing the branch instruction code in the CPU, it is possible to express the flow of the CPU as vector values.

분기 명령어 코드라는 것은 본 기술 분야의 통상의 지식을 가진 자(이하, 당업자)에게 알려져 있는 바와 같이, 예를 들어, CALL, JMP, RET 등과 같은 강제 분기 명령어와 JNZ 등과 같은 조건부 분기 명령어가 있다. 메모리 상의 특정 주소로 분기할 것을 CPU에게 지시하는 명령어 코드이다.Branch instruction code is, for example, a mandatory branch instruction such as CALL, JMP, RET, etc., and a conditional branch instruction such as JNZ, as known to those skilled in the art. Instruction code that instructs the CPU to branch to a specific address in memory.

이와 같이 실행파일의 바이너리 코드에서 분기 명령어 코드를 식별하고, 프로그램을 실행하는 CPU 처리의 흐름을 분기 명령어가 분기되는 바이너리 코드의 위치를 벡터로 표시함으로써, 대상 실행파일의 특징을 추출하는 것이 가능하다.In this way, by identifying the branch instruction code in the binary code of the executable file and indicating the position of the binary code in which the branch instruction branches in a vector, the flow of CPU processing for executing the program can extract the characteristics of the target executable file. .

코드라우팅벡터에서의 벡터값의 추출, 즉, 방향과 거리(크기)값의 산출과 관련하여 설명하면 다음과 같다. 예컨대, 0x1000 페이지 단위로 정렬되어 표시되는 메모리 코드에서, 제 1 바이너리 코드(분기 명령어 코드)와, 이로부터 이동하여 분기되는 목적지의 제 2 바이너리 코드 각각의 메모리 주소값을 기초로 하여 코드라우팅벡터값을 결정할 수 있다. The extraction of the vector value from the code routing vector, that is, the calculation of the direction and the distance (size) value will be described as follows. For example, in a memory code that is displayed aligned in units of 0x1000 pages, a code routing vector value is based on a memory address value of each of the first binary code (branch instruction code) and the second binary code of a destination moved and branched therefrom. Can be determined.

일 실시예에 따르면, 방향값의 경우 제 1 바이너리 코드와 제 2 바이너리 코드의 각각의 주소값을 기준으로 하여, 상위 주소로 이동하는지 또는 하위 주소로 이동하는지를 기준으로 방향값을 이원화하여 산출할 수 있다. 크기(거리)값의 경우, 메모리 주소값의 차이를 그 크기값으로 설정할 수 있다.According to an embodiment, in the case of the direction value, the direction value may be dualized based on whether each of the first binary code and the second binary code moves to an upper address or a lower address. have. In the case of a size (distance) value, the difference between the memory address values can be set to the size value.

본 발명의 다른 실시예에 따르면, 제 1 바이너리 코드와 제 2 바이너리 코드를 페이지 단위로 정렬되어 표시되는 메모리 코드에서 X 좌표 및 Y 좌표를 임의로 할당하여 계산할 수 있다. 보다 구체적으로 설명하면, 도 1을 다시 참조하면, 음영으로 표시된 CALL 명령어 코드 부분이 제 1 바이너리 코드(분기 명령어 코드)이고, 이 분기 명령어 코드의 시작점(도 1에서 0x010073A5에 해당)을 기준점으로 설정할 수 있다. 이러한 CALL 명령어 코드에 의해서 분기되는 위치인 0x01007568은 2차원으로 배열된 도 1의 메모리 코드 상에서 가로축(X축)으로 3, 세로축(Y축)으로 28만큼 이동하는 위치이다. 즉, (0,0)에서 (3, 28)로 이동하는 것으로 매핑할 수 있다. 이와 같이 2차원적으로 메모리 코드의 좌표를 매핑하면 이로부터 두 좌표 사이의 방향 및 크기값을 벡터로 표현하는 것이 가능하다. 제 1 바이너리 값을 (0,0)으로 설정하는 방법뿐만 아니라 두 좌표 사이의 벡터값을 구할 수 있는 임의의 좌표값으로 설정하는 것도 가능함을 본 명세서를 읽은 당업자는 이해할 수 있을 것이다.According to another embodiment of the present invention, the X and Y coordinates may be calculated by randomly assigning the first binary code and the second binary code in a memory code that is displayed by being aligned in page units. More specifically, referring back to FIG. 1, the shaded CALL instruction code portion is the first binary code (branch instruction code), and the starting point of the branch instruction code (corresponding to 0x010073A5 in FIG. 1) is set as a reference point. Can be. The position branched by the CALL instruction code, 0x01007568, is a position that moves by 3 on the horizontal axis (X axis) and 28 on the vertical axis (Y axis) on the memory code of FIG. 1 arranged in two dimensions. That is, it can be mapped to move from (0,0) to (3, 28). By mapping the coordinates of the memory code two-dimensionally as described above, it is possible to express the direction and size values between the two coordinates as vectors. Those skilled in the art will appreciate that it is possible to set the first binary value to (0,0) as well as any coordinate value from which the vector value between the two coordinates can be obtained.

이와 같은 코드라우팅벡터 분석에 있어서, 바이너리 코드의 메모리 코드는 소정 페이지 단위(예를 들어 0x1000 페이지 단위)로 정렬되어 표시되기 때문에, 어느 시스템에서든 그 분기하는 방향과 크기가 동일하게 표현되는 것을 보장할 수 있으므로, 동일한 실행파일에 대해서 산출하는 벡터값이 일정하여, 해당 실행파일의 CPU 처리 흐름을 식별할 수 있다.In such a code routing vector analysis, since the memory codes of binary codes are displayed in a predetermined page unit (for example, 0x1000 page units), it is guaranteed that the branching direction and the size are the same in any system. Therefore, the vector value calculated for the same executable file is constant, so that the CPU processing flow of the executable file can be identified.

도 2은 실행파일의 바이너리 코드를 역어셈블하여 명령어코드를 재구성한 예시를 도시하는 도면이다.2 is a diagram illustrating an example in which an instruction code is reassembled by disassembling binary code of an executable file.

도 2를 살펴보면, 굵은 글씨로 표시된 명령어 코드가 (강제) 분기 명령어 코드로서, 이들 분기 명령어 코드로부터, 전술한 바와 같은 방법을 통해서, 벡터량을 산출해 내는 것이 가능하다. 예를 들어서 ①으로 표시된 명령어는 CALL 명령어로서 0x01007568번지에 있는 바이너리 코드로 분기하게 된다.Referring to Fig. 2, the instruction code shown in bold is the (forced) branch instruction code, and from these branch instruction codes, it is possible to calculate the vector amount through the above-described method. For example, the instruction marked ① is a CALL instruction and branches to the binary code at 0x01007568.

도 3은 본 발명의 일 실시예에 따른 코드라우팅벡터 테이블의 일 실시예를 도시하는 도면이다.3 is a diagram illustrating an embodiment of a code routing vector table according to an embodiment of the present invention.

전술한 바와 같이 분기 명령어 코드로 인하여 분기되는 크기와 방향, 즉, 벡터값을 수치화하면, 수치화된 값을 소정의 버퍼에 저장한다. 버퍼의 경우 도 3에서는 해당 명령어 코드(Opcode)와 벡터의 방향값을 하나의 세트로 하여 버퍼에 입력하였고 (좌측 버퍼의 음영으로 처리된 부분) 벡터의 크기를 다른 버퍼(우측 버퍼의 음영으로 처리된 부분)에 입력하였다. 이와 같이 하나의 분기 명령어에 대해서 벡터값을 계산하고, 분기 명령어 코드, 벡터의 방향, 벡터의 크기를 변수로 하여 버퍼에 입력한다. 도 3에서는 두 개의 버퍼에 변수를 저장하는 것으로 도시하였지만, 본 발명이 이와 같은 실시예에 한정되는 것은 아니다. 즉, 예를 들어 하나의 버퍼에 전술한 세 개의 변수(명령어 코드, 방향, 크기)를 저장하는 것도 가능하며, 버퍼에 저장하는 변수의 순서도 다양한 변형이 가능함을 이해할 수 있을 것이다. 또한 도 3에서는 코드 명령어 및 방향을 나타내는데 2 바이트를 사용하였고, 벡터 크기를 나타내는데 4 바이트를 사용한 것으로 도시되어 있지만, 이러한 값들로 한정되는 것은 아님을 본 명세서의 내용을 습득한 당업자는 이해할 수 있을 것이다.As described above, when the size and direction branched by the branch instruction code, that is, the vector value are digitized, the digitized value is stored in a predetermined buffer. In the case of a buffer, in FIG. 3, a corresponding instruction code (Opcode) and a direction value of a vector are inputted into a buffer (a shaded portion of the left buffer) and the size of the vector is processed into another buffer (a shade of the right buffer). Part). In this way, a vector value is calculated for one branch instruction, and the branch instruction code, the direction of the vector, and the size of the vector are input to the buffer as variables. In FIG. 3, the variable is stored in two buffers, but the present invention is not limited to the embodiment. That is, for example, it is also possible to store the above-described three variables (command code, direction, size) in one buffer, and it will be understood that the order of the variables stored in the buffer can be variously modified. Also, in FIG. 3, two bytes are used to indicate a code instruction and a direction, and four bytes are used to indicate a vector size, but it will be understood by those skilled in the art that the present disclosure is not limited to these values. .

바이너리 코드를 분석하면서 순차적으로 분기 명령어에 대하여 벡터 값들을 계산하고, 이들을 매트릭스 테이블 형태로 버퍼에 소정의 용량까지 (예컨대, 1 Kbyte 또는 2Kbyte) 저장하였다면, 버퍼에 저장된 값을 이용하여 해시값을 산출한다. 이와 같이 산출된 해시값은 대상 실행파일을 식별할 수 있는 하나의 식별자로서 사용가능하게 된다. 도 3에 도시된 실시예에서는, 코드라우팅 테이블과 벡터크기 테이블로서 두 개의 버퍼(즉, 명령어 코드와 방향값이 저장된 버퍼 및 크기값이 저장된 버퍼)로부터 해시값을 산출하는 것으로 도시되어 있지만, 본 발명은 이러한 실시예에만 한정되는 것은 아니다. 예를 들어 두 개의 해시값을 산출한 후 이들 조합하여 하나의 해시값으로 만드는 것도 가능하며, 이와 달리, 하나의 버퍼에 명령어 코드, 방향, 크기 값을 저장하고 이러한 하나의 버퍼로부터 해시값을 산출하는 것도 가능하다.Calculate vector values for branch instructions sequentially while analyzing binary code, and store them in a matrix table in buffers up to a predetermined capacity (for example, 1 Kbyte or 2 Kbytes). do. The hash value calculated in this way can be used as one identifier for identifying the target executable file. In the embodiment shown in Fig. 3, a hash value is calculated from two buffers (i.e., a buffer storing instruction codes and direction values and a buffer storing size values) as a code routing table and a vector size table. The invention is not limited only to these examples. For example, it is possible to calculate two hash values and combine them into one hash value.In contrast, the instruction code, direction, and size values are stored in one buffer and the hash value is calculated from one buffer. It is also possible.

이와 같이 특정 실행파일(예컨대, 악성코드)에 대해서 코드라우팅벡터를 사용하여 해시값을 산출하면, 이 해시값은 그 실행파일을 식별할 수 있는 시그니처로 사용할 수 있다. 따라서 원본 바이너리 코드에 대해서 다양한 패커에 의해 난독화된 변종 바이너리 코드가 있는 경우라도, 그 변종 바이너리 코드가 실행되어 메모리에 로드된 후, CPU에 의해 수행되는 처리 흐름은 동일할 것으로 예상할 수 있으므로, 변종 코드를 분류 및 진단할 수 있다.In this way, when a hash value is calculated using a code routing vector for a specific executable file (for example, malicious code), the hash value can be used as a signature for identifying the executable file. Therefore, even if there is variant binary code obfuscated by various packers with respect to the original binary code, the processing flow performed by the CPU after the variant binary code is executed and loaded into memory can be expected to be the same. Variant codes can be classified and diagnosed.

또한, 변종 바이너리 코드가 메모리에 로드되어 실행되지 않는 상태라고 하더라도, 동일한 패커에 의해서 난독화되어 있거나, 동일한 컴파일러에 의해서 컴파일된 경우라면, 소정 크기까지는 동일한 형식의 바이너리 코드를 가짐을 예상할 수 있다. 즉, 변종 바이너리 코드가 실제로 실행되지 않은 상태에서도, 동일한 패커에 의해서 난독화된 파일들을 분류하거나, 동일한 컴파일러에 의해서 컴파일된 파일들을 분류하는 것도 가능하다.In addition, even when the variant binary code is loaded into memory and not executed, if the binary code is obfuscated by the same packer or compiled by the same compiler, it can be expected to have the binary code of the same format up to a predetermined size. . In other words, even if the variant binary code is not actually executed, it is possible to classify files obfuscated by the same packer or files compiled by the same compiler.

즉, 이상과 같은 코드라우팅벡터 알고리즘을 이용하여, 동일 유형의 동종 악성코드를 효과적으로 진단 및 분류할 수 있으며, 아직 안티 바이러스 업체 등에 정보가 알려지지 않은 악성코드의 경우도 진단 및 분류할 수 있어서, 시그니처 자원을 효율적으로 관리하는 것이 가능하다. 또한, 악성코드를 포함하여 실행파일들을 특정 기준에 따라, 예컨대, 동일 컴파일러, 동일 패커에 의해서 구성된 파일들인지를 분류하는 것도 가능하다.In other words, by using the above code routing vector algorithm, it is possible to effectively diagnose and classify homogeneous malware of the same type, and even to identify and classify malicious code whose information is not known to antivirus companies, and thus, signature It is possible to manage resources efficiently. In addition, it is also possible to classify whether executable files including malicious code are files configured by, for example, the same compiler or the same packer according to a specific criterion.

이하에서는 전술한 코드라우팅벡터를 적용한 본 발명의 일 실시예에 대하여 설명한다.Hereinafter, an embodiment of the present invention to which the above-described code routing vector is applied will be described.

도 4는 본 발명의 일 실시예에 따른 악성코드 진단 및 분류 방법의 일련의 과정을 도시하는 도면이다.4 is a diagram illustrating a series of processes of a method for diagnosing and classifying malware according to an embodiment of the present invention.

먼저 단계(S400)에서 대상 실행파일의 바이너리 코드를 분석한다. 일 실시예에 따르면, 바이너리 코드를 분석하는 단계는 실행파일(예컨대, 악성코드)이 시스템 상에서 실행되는지 여부를 API 후킹을 통하여 검출하고, API 후킹을 통하여 실행파일이 시스템 상에서 실행되는 것으로 검출되면, 실행파일의 바이너리 코드를 실행파일이 실행된 시스템 상의 코드 영역으로부터 역어셈블(disassemble)하여 분석하는 단계를 포함한다. 즉, 바이너리 코드가 메모리 상에서 로드되어 실행되는지 여부를 검출하고 그로부터 메모리 코드를 분석한다. 실행파일이 메모리에 로드된 상태라면, 다양한 패커 등에 의해서 난독화된 실행파일이 언팩되어서 실제로 시스템에서 실행을 시작하는 시점에 해당하며, 따라서 변종 악성코드의 경우라도 결국 원본 악성코드의 형태로 실행이 시작되는 시점에 해당한다. 이와 달리, 시스템 상에서 대상 실행파일이 실행되지 않은 경우라도, 실행되지 않은 상태의 바이너리 코드를 역어셈블하여 분석할 수 도 있다. 이 경우는 역어셈블하여도 원본 바이너리 코드와는 상이한 코드를 보일 수 있으나, 전술한 바와 같이, 동일한 패커에 의해서 난독화되거나, 동일한 컴파일러에 의해서 컴파일된 경우는 바이너리 코드의 소정 부분이 동일한 패턴을 갖게 되므로 이러한 특징으로부터 파일의 진단 및 분류가 가능할 수 있다.First, in step S400, the binary code of the target executable file is analyzed. According to one embodiment, the step of analyzing the binary code detects whether the executable file (eg, malicious code) is executed on the system through API hooking, and if the executable file is detected to be executed on the system through API hooking, Disassembling and analyzing the binary code of the executable file from a code region on the system on which the executable file is executed. That is, it detects whether the binary code is loaded and executed on the memory and analyzes the memory code therefrom. If the executable file is loaded in memory, it is the time when the executable file obfuscated by various packers is unpacked and actually starts to be executed on the system. Corresponds to the starting point. Alternatively, even when the target executable file is not executed on the system, the binary code in the non-executed state can be disassembled and analyzed. In this case, even if disassembled, the code may be different from the original binary code. However, as described above, when the code is obfuscated by the same packer or compiled by the same compiler, certain parts of the binary code have the same pattern. Therefore, it may be possible to diagnose and classify files from these features.

또한, 실행파일의 바이너리 코드의 분석을 시작하는 기준점의 경우 실행파일의 엔트리 포인트(Entry Point) 위치로부터 분석을 시작할 수도 있으며, 또는 컴파일러에서 사용하는 스텁코드 API 위치로부터 바이너리 코드를 분석할 수도 있다.In addition, in the case of a reference point for starting analysis of binary code of an executable file, analysis may be started from an entry point of the executable file, or binary code may be analyzed from a stub code API location used by a compiler.

다음으로 단계(S410)에서, 분석된 바이너리 코드를 CPU에 의해서 실행되는 순서에 따라, 바이너리 코드를 분석하여 분기 명령어 코드를 식별한다. 구체적으로 바이너리 코드를 예를 들어 도 1에 도시된 바와 같이 0x1000 페이지 단위로 표현되는 2차원 구조의 페이지 단위로 메모리에 로드하고, 그로부터 분기 명령어 코드, 예컨대, CALL, JMP, RET과 같은 강제 분기 명령어나 JNZ와 같은 조건 분기 명령어 코드를 식별한다.Next, in step S410, the binary code is analyzed to identify the branch instruction code according to the order in which the analyzed binary code is executed by the CPU. Specifically, the binary code is loaded into the memory in a two-dimensional page unit expressed in units of 0x1000 pages, for example, as shown in FIG. 1, from which branch instruction codes such as CALL, JMP, and RET are forced. Identifies conditional branch instruction code, such as

다음으로 단계(S420)에서 식별된 분기 명령어 코드의 바이너리 위치로부터, 분기되는 명령어 코드의 바이너리 위치에 대한, 2 차원 구조의 페이지 단위의 메모리 코드 상에서의 거리와 방향을 포함하는 코드라우팅벡터값을 계산한다. 즉, 대상 실행파일의 CPU 상의 처리 흐름을 식별하도록, 분기 명령어 코드의 바이너리 코드 위치로부터 분기되는 다음 코드의 바이너리 위치에 대한 거리와 방향을 산출한다. 코드라우팅벡터의 거리값은 두 개의 바이너리 위치의 메모리 주소값의 차이로 표현할 수도 있으며, 이와 달리 두 개의 바이너리 위치를 각각 2차원 좌표로 표현하고 이들 좌표로부터 수학적으로 계산할 수도 있음은 앞서 살펴본 바와 같다. 또한 방향값의 경우도 간단히 메모리 주소값의 크기에 따라 상측 또는 하측으로 이원화하여 표현할 수도 있고, 이와 달리 두 개의 바이너리 위치를 각각 2차원 좌표로 표현하고 이들 좌표로부터 수학적으로 계산할 수도 있음은 앞서 살펴본 바와 같다.Next, from the binary position of the branch instruction code identified in step S420, a code routing vector value including a distance and a direction on the memory code in a page unit of a two-dimensional structure with respect to the binary position of the branching instruction code is calculated. do. In other words, to identify the processing flow on the CPU of the target executable file, the distance and direction with respect to the binary position of the next code branched from the binary code position of the branch instruction code are calculated. The distance value of the code routing vector may be expressed as a difference between the memory address values of the two binary locations. Alternatively, the two binary locations may be represented by two-dimensional coordinates and calculated mathematically from these coordinates. In addition, in the case of the direction value, it may be expressed by simply dualizing it to the upper side or the lower side according to the size of the memory address value. Alternatively, the two binary positions may be represented by two-dimensional coordinates and calculated mathematically from these coordinates. same.

다음으로 단계(S430)에서 산출된 벡터값(방향, 거리)과 그 벡터값을 산출하기 위해 사용되었던 분기 명령어 코드를 이용하여 코드라우팅벡터 매트릭스 테이블을 산출한다. 즉, 도 3에 도시된 바와 같이 명령어 코드와 벡터의 방향 및 벡터의 크기를 소정의 버퍼에 저장함으로써 코드라우팅벡터 매트릭스 테이블을 생성한다. 이러한 매트릭스 테이블은 전술한 바와 같이 하나의 버퍼로 생성할 수도 있고, 두 개의 버퍼를 생성하고 명령어 코드와 벡터의 방향을 하나의 세트로 하나의 버퍼에, 벡터의 크기를 다른 하나의 버퍼에 저장할 수도 있으면, 이와 달리 조합하여 저장하는 것도 가능하다. Next, a code routing vector matrix table is calculated using the vector value (direction, distance) calculated in step S430 and the branch instruction code used to calculate the vector value. That is, as shown in FIG. 3, the code routing vector matrix table is generated by storing the direction of the instruction code and the vector and the size of the vector in a predetermined buffer. As described above, the matrix table may be generated as one buffer, or two buffers may be generated, and the instruction code and the direction of the vector may be stored as one set in one buffer and the size of the vector in another buffer. If so, it is also possible to store them in different combinations.

산출된 벡터값 및 해당 벡터값을 산출하는데 사용된 분기 명령어 코드를 소정 크기의 용량까지 버퍼에 저장한다. 즉, 대상 실행파일의 바이너리 코드를 CPU가 실행하는 순서에 따라서, 코드라우팅벡터를 산출하고 이를 버퍼에 일정 용량이 채워질 때까지 반복한다(S440). 버퍼의 크기는 상황에 맞춰 변경할 수 있음은 당업자에게 자명하다. The calculated vector value and the branch instruction code used to calculate the vector value are stored in a buffer up to a predetermined size. That is, in accordance with the order in which the binary code of the target executable file is executed by the CPU, the code routing vector is calculated and repeated until a predetermined capacity is filled in the buffer (S440). It is apparent to those skilled in the art that the size of the buffer can be changed according to the situation.

다음으로 단계(S450)에서, 생성된 매트릭스 테이블로부터 해시값(Hash Code, Hash Value)을 산출한다. 해시 함수 또는 해시 알고리즘이란 임의의 데이터로부터 일종의 전자 지문을 만들어 내는 방법이다. 이와 같이 산출된 해시값은 해시값을 저장하여 두는 해시 데이터 시그니처 데이터베이스에 저장되어 추후 악성코드 검사를 위한 DB로 활용될 수도 있다.Next, in step S450, a hash code (Hash Code, Hash Value) is calculated from the generated matrix table. A hash function or hash algorithm is a method of generating a kind of electronic fingerprint from arbitrary data. The calculated hash value is stored in a hash data signature database that stores the hash value and may be used as a DB for malware inspection later.

다음으로 단계(S460)에서, 기존에 미리 산출되어 저장되어 있던 해시값과 대상 실행파일의 해시값을 비교하여 동일여부를 판단한다. 해시값이 동일하다는 것은, 원본 바이너리 코드가 동일하다는 것을 의미하며, 따라서 동일 유형의 바이너리로 진단 및 분류하는 것이 가능하다.Next, in step S460, it is determined whether or not the same by comparing the hash value of the previously calculated and stored hash value of the target executable file. The same hash value means that the original binary code is the same, so it is possible to diagnose and classify the same type of binary.

전술한 코드라우팅벡터 알고리즘은 예를 들어, 실행파일의 바이너리 코드 상의 분기 명령어 코드가 실행되는 순서에 따라서 실행파일을 에뮬레이팅하면서 순차적으로 분기 명령어 코드에 대한 벡터값을 산출할 수 있다. 즉, 실행파일이 실제 CPU에 의해서 처리되는 코드를 순차적으로 파악할 수 있다. 이는 다양한 패커에 의해서 난독화된 변종 악성코드라도 실제 프로세스 상에서 실행되는 경우에는 원본 바이너리 코드로 복호화되서 실행된다는 점을 고려하면, 변종 악성코드를 진단 및 분류하는데 유용할 수 있다. 그러나, 분기 명령어 코드가 실행파일의 프로세스의 이미지 영역을 벗어나게 하는 분기 명령어인 경우가 있을 수 있다. 예컨대, 시스템 공통 영역의 자원을 사용하는 경우 등이 있을 수 있다. 이와 같은 경우는 코드라우팅벡터 값을 산출하되, 프로세스의 이미지 영역을 벗어나는 메모리 코드에 대해서는 더 이상 추적을 하지 않고, 프로세스 이미지 영역 내의 다음 분기 명령어 코드에 대하여 분석할 수도 있다. The code routing vector algorithm described above may sequentially calculate the vector value for the branch instruction code while emulating the executable file according to the order in which the branch instruction code on the binary code of the executable file is executed. In other words, it is possible to sequentially determine the code that the executable file is processed by the actual CPU. This may be useful for diagnosing and classifying variant malware, considering that the malicious code obfuscated by various packers is executed by being decrypted with the original binary code when executed in a real process. However, there may be cases where the branch instruction code is a branch instruction that leaves the image area of the process of the executable. For example, there may be a case of using the resources of the system common area. In this case, the code routing vector value is calculated, but the memory code outside the image area of the process is no longer tracked, and the next branch instruction code in the process image area may be analyzed.

도 5는 본 발명의 일 실시예에 따른 악성코드 진단 및 분류 장치의 개략도를 도시하는 도면이고, 도 6은 본 발명의 일 실시예에 따른 코드라우팅벡터 산출부의 개략도를 도시하는 도면이다.5 is a diagram illustrating a schematic diagram of an apparatus for diagnosing and classifying malware according to an embodiment of the present invention, and FIG. 6 is a diagram illustrating a schematic diagram of a code routing vector calculating unit according to an embodiment of the present invention.

도 5에 도시된 본 발명의 일 실시예에 따른 장치는 2차원 구조의 페이지 단위로 실행파일의 바이너리 코드를 분석하는 코드 분석부(500)와 분석된 코드로부터 코드 라우팅 벡터 및 해시값을 산출하는 코드라우팅벡터 산출부(510), 산출된 해시값을 저장하는 데이터베이스(530), 해시값을 비교하는 해시 비교부(520)을 포함한다.The apparatus according to an embodiment of the present invention shown in FIG. 5 calculates a code routing vector and a hash value from the analyzed code and the code analyzer 500 analyzing the binary code of the executable file in units of pages of a two-dimensional structure. The code routing vector calculator 510 includes a database 530 that stores the calculated hash value, and a hash comparer 520 that compares the hash values.

도 6에 도시된 바와 같이 코드라우팅벡터 산출부(510)는 또한, 분석된 바이너리 코드에서, 분기 명령어 코드를 식별하는 코드 식별부(511)와, 식별된 분기 명령어 코드의 바이너리 위치로부터, 분기되는 명령어 코드의 바이너리 위치에 대한, 2 차원 구조의 페이지 단위의 메모리 상에서의 거리와 방향을 포함하는 벡터값을 계산하는 벡터 계산부(512)와, 산출된 벡터값 및 식별된 분기 명령어 코드를 이용하여 매트릭스 테이블을 산출하는 매트릭스 테이블 산출부(513)와, 매트릭스로부터 해시값을 산출하는 해시값 산출부(514)를 더 포함할 수 있다.As shown in FIG. 6, the code routing vector calculator 510 also branches from the analyzed binary code to a code identifier 511 identifying a branch instruction code and a binary position of the identified branch instruction code. A vector calculation unit 512 for calculating a vector value including a distance and a direction in a two-dimensional page memory for a binary position of the instruction code, and using the calculated vector value and the identified branch instruction code. A matrix table calculator 513 for calculating a matrix table and a hash value calculator 514 for calculating a hash value from a matrix may be further included.

보다 구체적으로, 코드 분석부(500)는 분석 대상 실행파일의 바이너리 코드를 분석한다. 일 실시예에 따르면, 코드 분석부(500)는 실행파일(예컨대, 악성코드)이 시스템 상에서 실행되는지 여부를 API 후킹을 통하여 검출하고, API 후킹을 통하여 실행파일이 시스템 상에서 실행되는 것으로 검출되면, 실행파일의 바이너리 코드를 실행파일이 실행된 시스템 상의 코드 영역으로부터 역어셈블(disassemble)하여 분석한다. 이와 달리, 코드 분석부(500)는 시스템 상에서 대상 실행파일이 실행되지 않은 경우라도, 실행되지 않은 상태의 바이너리 코드를 역어셈블하여 분석할 수 도 있다.More specifically, the code analysis unit 500 analyzes the binary code of the analysis target executable file. According to an embodiment, the code analysis unit 500 detects whether an executable file (eg, malicious code) is executed on the system through API hooking, and if it is detected that the executable file is executed on the system through API hooking, The binary code of the executable file is disassembled and analyzed from the code region on the system where the executable file is executed. Alternatively, the code analyzer 500 may disassemble and analyze the binary code in the unexecuted state even when the target executable file is not executed on the system.

코드 식별부(511)는 CPU에 의해서 실행되는 순서에 따라, 바이너리 코드를 분석하여 분기 명령어 코드를 식별한다. 구체적으로 바이너리 코드를 예를 들어 도 1에 도시된 바와 같이 0x1000 페이지 단위로 표현되는 2차원 구조의 페이지 단위로 메모리에 로드하고, 그로부터 분기 명령어 코드, 예컨대, CALL, JMP, RET과 같은 강제 분기 명령어나 JNZ와 같은 조건 분기 명령어 코드를 식별한다.The code identification unit 511 analyzes the binary code and identifies the branch instruction code in the order executed by the CPU. Specifically, the binary code is loaded into the memory in a two-dimensional page unit expressed in units of 0x1000 pages, for example, as shown in FIG. 1, from which branch instruction codes such as CALL, JMP, and RET are forced. Identifies conditional branch instruction code, such as

벡터 계산부(512)는 식별된 분기 명령어 코드의 바이너리 위치로부터, 분기되는 명령어 코드의 바이너리 위치에 대한, 2 차원 구조의 페이지 단위의 메모리 코드 상에서의 거리와 방향을 포함하는 코드라우팅벡터값을 계산한다. 이러한 계산에 사용될 수 있는 구체적인 알고리즘은 전술한 바와 같고, 반복되는 설명은 생략하나, 본 명세서를 읽은 당업자에게 이해될 수 있을 것이다.The vector calculation unit 512 calculates a code routing vector value including a distance and a direction on a two-dimensional page memory code with respect to the binary position of the branched instruction code from the binary position of the identified branch instruction code. do. Specific algorithms that can be used for such calculations are as described above, and repeated descriptions will be omitted, but will be understood by those skilled in the art.

매트릭스 테이블 산출부(513)는 산출된 벡터값(방향, 거리)과 그 벡터값을 산출하기 위해 사용되었던 분기 명령어 코드를 이용하여 코드라우팅벡터 매트릭스 테이블을 산출한다. 매트릭스 테이블과 관련된 구체적인 내용은 전술한 바와 같고, 반복되는 설명은 생략하나, 본 명세서를 읽은 당업자에게 이해될 수 있을 것이다.The matrix table calculator 513 calculates a code routing vector matrix table using the calculated vector value (direction, distance) and the branch instruction code used to calculate the vector value. Details related to the matrix table are the same as described above, and repeated descriptions will be omitted, but will be understood by those skilled in the art.

해시값 산출부(514) 에서, 생성된 매트릭스 테이블로부터 해시값(Hash Code, Hash Value)을 산출한다.The hash value calculator 514 calculates a hash code (Hash Code, Hash Value) from the generated matrix table.

해시 비교부(520)는 기존에 미리 산출되어 저장되어 있던 해시값을 데이터베이스(530)으로부터 입력받고, 대상 실행파일의 해시값과 비교하여 동일여부를 판단한다.The hash comparison unit 520 receives a hash value previously calculated and stored from the database 530, and compares the hash value with the hash value of the target executable file to determine whether the hash value is the same.

본 발명의 실시예들은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 형태로 구현되어 컴퓨터로 판독할 수 있는 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독가능 기록 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체는 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 이러한 매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수도 있다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.Embodiments of the present invention may be implemented in the form of programs that can be executed by various computer means to be recorded on a computer-readable recording medium. The computer readable recording medium may include program instructions, data files, data structures, etc. alone or in combination. The media may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic-optical media such as floppy disks. magneto-optical media and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Such a medium may be a transmission medium such as an optical or metal wire, a waveguide, or the like including a carrier wave for transmitting a signal specifying a program command, a data structure, or the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

이상 첨부된 도면을 참조하여 본 발명의 실시예를 설명하였으나, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자라면 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 다양한 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 한정적인 것으로 이해해서는 안 된다.
Although the embodiments of the present invention have been described above with reference to the accompanying drawings, those skilled in the art to which the present invention pertains may be embodied in various other forms without changing the technical spirit or essential features of the present invention. I can understand that. Therefore, the embodiments described above are illustrative in all respects and should not be understood as limiting.

500: 코드 식별부 510: 코드라우팅벡터 산출부 520: 해시 비교부 530: DB
511: 코드 식별부 512: 벡터 계산부 513: 매트릭스 테이블 산출부
514: 해시값 산출부500: code identification unit 510: code routing vector calculation unit 520: hash comparison unit 530: DB
511: code identifying unit 512: vector calculating unit 513: matrix table calculating unit
514: hash value calculation unit

Claims

In the classification and diagnosis method of malicious code using vector quantity calculation,
Analyzing the binary code of the executable,
Loading the analyzed binary code into memory in units of pages of a two-dimensional structure;
Identifying a branch instruction code (Opcode) in the analyzed binary code,
Calculating a vector value from the identified binary position of the branch instruction code, the distance and direction on a page unit of memory of the two-dimensional structure, with respect to the binary position of the branched instruction code;
Calculating a matrix table using the calculated vector value and the identified branch instruction code;
Calculating a hash value from the matrix;
And comparing the calculated hash value with hash values of different calculated executable files to determine whether they are the same.
Classification and diagnosis method of malicious code using vector quantity calculation.

The method of claim 1,
Analyzing the binary code of the executable file,
Detecting whether the executable file is executed on a system through API hooking;
If it is detected that the executable file is executed through the API hooking, disassembling and analyzing the binary code of the executable file from a code region on the system on which the executable file is executed;
Classification and diagnosis method of malicious code using vector quantity calculation.

The method of claim 1, wherein
Analyzing the binary code of the executable file,
Before the executable is executed on the system, disassembling and analyzing the binary code of the executable file.
Classification and diagnosis method of malicious code using vector quantity calculation.

The method of claim 1,
The executable file is executable and compressed by a packer
Classification and diagnosis method of malicious code using vector quantity calculation.

The method of claim 1,
Analyzing the binary code of the executable file,
Analyzing the binary code from an entry point location of the executable file or from a stub code API location used by a compiler;
Classification and diagnosis method of malicious code using vector quantity calculation.

The method of claim 1,
The branch instruction code includes one of a forced branch instruction or a conditional branch instruction.
Classification and diagnosis method of malicious code using vector quantity calculation.

The method of claim 1,
The distance is a distance from one binary position on the page-based memory of the two-dimensional structure to another binary position, and the direction is an upward direction according to the size of the address value of the memory on the page-based memory of the two-dimensional structure or Expressed in one of the downward directions,
Wherein the distance is expressed as a difference between a memory address value of the one binary location and another binary location,
Classification and diagnosis method of malicious code using vector quantity calculation.

The method of claim 1,
The distance is a distance from one binary position on the page unit of memory of the two-dimensional structure to another binary position, and the direction is two of the other binary positions relative to one binary position on the memory of the page unit of the two-dimensional structure. Expressed in a dimensional direction,
The distance and the direction,
The one binary position and the other binary position are each represented by two-dimensional coordinates and calculated from these coordinates.
Classification and diagnosis method of malicious code using vector quantity calculation.

The method of claim 1,
Computing the matrix table
Storing the calculated vector value and the identified branch instruction code in a buffer up to a predetermined size;
The calculating of the hash value may include calculating the hash value by using a value stored in the buffer.
Classification and diagnosis method of malicious code using vector quantity calculation.

The method of claim 1,
Computing the matrix table
Storing in the first buffer a direction of the calculated vector value and a set of identified branch instruction codes used to calculate the vector value in a byte format of a predetermined size;
Storing the calculated size of the vector value in a second buffer in a byte format having a predetermined size;
Including,
The calculating of the hash value may include calculating the hash value by using values stored in the first buffer and the second buffer.
Classification and diagnosis method of malicious code using vector quantity calculation.

The method of claim 1,
And storing the calculated hash value in a hash data signature database.
Classification and diagnosis method of malicious code using vector quantity calculation.

The method of claim 1,
The calculated hash value is a hash value calculated from a binary code determined to be malicious code, and is stored in a hash data signature database.
If the comparison results in the same, further comprising determining that the same type of homogeneous malicious code
Classification and diagnosis method of malicious code using vector quantity calculation.

The method of claim 1,
The calculated hash value is used as one of an identifier for a specific malicious code, an identifier for a packer for compressing the executable file, and an identifier for a compiler for compiling the executable file.
Classification and diagnosis method of malicious code using vector quantity calculation.

The method of claim 1,
Computing the vector value
According to the order in which the branch instruction code on the binary code of the executable file is executed, the vector values for the branch instruction code are sequentially calculated.
Classification and diagnosis method of malicious code using vector quantity calculation.

15. The method of claim 14,
If the branch instruction code on the binary code of the executable file is a branch instruction code that leaves the image area of the process of the executable file, the next calculation target branch instruction is set to the next branch instruction in the image area.
Classification and diagnosis method of malicious code using vector quantity calculation.

A computer readable recording medium having a computer program stored thereon,
The computer program, when executed on a computer, performs the method according to any one of claims 1 to 15.
Computer-readable recording media.

In the apparatus for classifying and diagnosing malicious codes using vector quantity calculation,
A code analysis unit that analyzes binary code of an executable file in units of pages of a two-dimensional structure,
A code identifier for identifying a branch instruction code in the analyzed binary code,
A vector calculation unit for calculating a vector value including a distance and a direction on a page unit of a memory of a two-dimensional structure from a binary position of the branch instruction code identified to a binary position of a branching instruction code;
A matrix table calculator configured to calculate a matrix table using the calculated vector value and the identified branch instruction code;
A hash value calculator for calculating a hash value from the matrix;
Comprising a comparison unit for determining whether the same by comparing the calculated hash value with the calculated hash value for the different executable file;
Malicious code classification and diagnosis device using vector quantity calculation.

The method of claim 17,
And a hash data signature database for storing the calculated hash value.
Malicious code classification and diagnosis device using vector quantity calculation.