KR20220022170A

KR20220022170A - System and method for analyzing malware in application

Info

Publication number: KR20220022170A
Application number: KR1020200103033A
Authority: KR
Inventors: 곽진; 최슬기; 김득훈; 정해선
Original assignee: 아주대학교산학협력단
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2022-02-25
Also published as: KR102392559B1

Abstract

According to one aspect of the present disclosure, a system for analyzing a malicious code in an application comprises: a control flow graph (CFG) generator which generates a CFG including loop information from an execution code of an application; an API call graph generator which generates an API call graph presenting API call flow from the generated CFG; a feature vector generator which generates a feature vector of each of one or more API blocks included in the generated API call graph; and a malicious code analyzer which analyzes a malicious code of the application from the generated one or more feature vector.

Description

SYSTEM AND METHOD FOR ANALYZING MALWARE IN APPLICATION

본 개시(disclosure)의 기술적 사상은 애플리케이션의 악성 코드를 분석하는 시스템 및 방법에 관한 것이다.A technical idea of the present disclosure relates to a system and method for analyzing a malicious code of an application.

애플리케이션의 악성 코드는 기존의 악성 코드를 재활용하거나, 새로운 악성 기능을 추가하면서 지속적으로 발전하고 있다. 이러한 악성 코드를 단순히 탐지하는 것이 아니라 악성 코드 군을 분류하는 것은, 신종 또는 변종 악성 코드의 행위 및 위험 수준을 이해하고 악성 코드의 공격에 대응할 때 도움이 될 수 있다.Malicious code in applications is continuously evolving as existing malicious code is recycled or new malicious functions are added. Classifying the malicious code group rather than simply detecting such malicious code can be helpful when understanding the behavior and risk level of new or variant malicious code and responding to the attack of the malicious code.

악성 코드의 분석 방법으로는 정적 분석과 동적 분석이 존재한다. 동적 분석의 경우 애플리케이션을 실제 구동하면서 악성 코드를 분석하는 방법으로서, 실행된 행위에 대해서만 분석하게 되는 한계가 존재한다. 반면, 정적 분석의 경우 애플리케이션을 구동하지 않고, 애플리케이션에 포함된 명령어나 코드를 전반적으로 분석함으로써, 악성 코드가 수행할 수 있는 가능한 모든 행위에 대해 분석할 수 있는 장점이 있다.Malicious code analysis methods include static analysis and dynamic analysis. In the case of dynamic analysis, there is a limitation in analyzing only the executed behavior as a method of analyzing malicious code while actually running the application. On the other hand, static analysis has the advantage of analyzing all possible actions that malicious code can perform by not running the application, but by analyzing the commands or codes included in the application as a whole.

일례로, 안드로이드 애플리케이션의 악성 코드의 정적 분석 기반 분류 기법에는 Opcode 기반 분류 기법과 CFG(control flow graph) 기반 분류 기법이 존재한다. CFG 내에는 API 간의 호출 관계나 코드 분기 등의 실행 흐름 정보가 포함되므로, CFG 기반 분류 기법은 악성 코드의 행위 기반 분류 시 Opcode 기반 분류 기법에 비해 높은 정확도를 가질 수 있다.For example, in the static analysis-based classification method of malicious code of an Android application, there are an opcode-based classification method and a CFG (control flow graph)-based classification method. Since execution flow information such as call relationship between APIs and code branching is included in CFG, the CFG-based classification technique can have higher accuracy than the opcode-based classification technique when classifying malicious code based on behavior.

다만, 종래의 CFG 기반 분류 기법들은 루프를 고려하지 않고 악성 코드의 행위 분석을 수행하고 있는 바, 루프 정보가 악성 행위를 의미하는 경우 행위 분석의 정확도가 급격히 감소하는 문제가 있다.However, since conventional CFG-based classification techniques perform behavior analysis of malicious code without considering loops, there is a problem in that the accuracy of behavior analysis rapidly decreases when loop information means malicious behavior.

본 발명이 해결하고자 하는 일 과제는, 루프 정보를 활용하여 애플리케이션의 악성 코드를 분석할 수 있는 시스템 및 방법을 제공하는 것이다.One problem to be solved by the present invention is to provide a system and method capable of analyzing a malicious code of an application by using loop information.

본 발명이 해결하고자 하는 일 과제는, 신종 또는 변종 악성 코드에 대해서도 악성 코드의 유무 및 특성 등을 검출해낼 수 있는 악성 코드 분석 시스템 및 방법을 제공하는 것이다.An object of the present invention is to provide a malicious code analysis system and method capable of detecting the presence and characteristics of malicious code even in new or variant malicious code.

상기와 같은 목적을 달성하기 위하여, 본 개시의 기술적 사상에 의한 일 양태(aspect)에 따른 애플리케이션 악성 코드 분석 시스템은, 애플리케이션의 실행 코드로부터, 루프 정보를 포함하는 제어 흐름 그래프(CFG)를 생성하는 CFG 생성기, 생성된 CFG로부터 API 호출 흐름을 나타내는 API 호출 그래프를 생성하는 API 호출 그래프 생성기, 생성된 API 호출 그래프에 포함된 적어도 하나의 API 블록 각각의 특징 벡터를 생성하는 특징 벡터 생성기, 및 생성된 적어도 하나의 특징 벡터로부터, 상기 애플리케이션의 악성 코드를 분석하는 악성 코드 분석기를 포함한다.In order to achieve the above object, an application malicious code analysis system according to an aspect according to the technical idea of the present disclosure generates a control flow graph (CFG) including loop information from an execution code of an application. A CFG generator, an API call graph generator that generates an API call graph representing an API call flow from the generated CFG, a feature vector generator that generates a feature vector of each of at least one API block included in the generated API call graph, and the generated and a malicious code analyzer that analyzes the malicious code of the application from at least one feature vector.

실시 예에 따라, 상기 CFG는 상기 애플리케이션의 실행 코드의 분기에 기초하여 구분되는 적어도 하나의 기본 블록을 포함하고, 상기 적어도 하나의 기본 블록 각각은, 루프에 포함되어 있는지 여부, 및 루프 유형 중 적어도 하나의 정보를 포함할 수 있다.According to an embodiment, the CFG includes at least one basic block divided based on a branch of the execution code of the application, and each of the at least one basic block is included in a loop, and at least one of a loop type. It may contain one piece of information.

실시 예에 따라, 상기 루프 유형은 루프의 종료 지점 존재 여부, 및 루프의 반복과 관련된 레지스터값의 변화 유형 중 적어도 하나를 통해 정의될 수 있다.According to an embodiment, the loop type may be defined through at least one of whether an end point of a loop exists and a change type of a register value related to loop repetition.

실시 예에 따라, 상기 적어도 하나의 기본 블록 각각은 인덱스, 이전 연결된 기본 블록의 인덱스, 및 다음 연결되는 기본 블록의 인덱스 중 적어도 하나를 더 포함할 수 있다.According to an embodiment, each of the at least one basic block may further include at least one of an index, an index of a previously connected basic block, and an index of a next connected basic block.

실시 예에 따라, 상기 API 호출 그래프 생성기는 상기 CFG에 포함된 상기 적어도 하나의 기본 블록 중, API 호출이 수행되는 적어도 하나의 API 블록을 연결한 상기 API 호출 그래프를 생성할 수 있다.According to an embodiment, the API call graph generator may generate the API call graph in which at least one API block in which an API call is performed among the at least one basic block included in the CFG is connected.

실시 예에 따라, 상기 적어도 하나의 API 블록 각각은 루프에 포함되어 있는지 여부, 루프에 포함 시 루프의 시작점에 대응하는 API 블록의 인덱스, 및 루프 유형 중 적어도 하나의 정보를 포함할 수 있다.According to an embodiment, each of the at least one API block may include information on at least one of whether the at least one API block is included in the loop, the index of the API block corresponding to the starting point of the loop when included in the loop, and the loop type.

실시 예에 따라, 상기 적어도 하나의 API 블록 각각은 인덱스, 이전 연결된 API 블록의 인덱스, 및 다음 연결되는 API 블록의 인덱스 중 적어도 하나를 더 포함할 수 있다.According to an embodiment, each of the at least one API block may further include at least one of an index, an index of a previously connected API block, and an index of a next connected API block.

실시 예에 따라, 상기 특징 벡터 생성부는 상기 적어도 하나의 API 블록 각각에 포함된 정보에 대해 feature hashing을 수행함으로써, 상기 적어도 하나의 API 블록 각각에 대응하는 기 설정된 길이의 특징 벡터를 생성할 수 있다.According to an embodiment, the feature vector generator may generate a feature vector of a preset length corresponding to each of the at least one API block by performing feature hashing on the information included in each of the at least one API block. .

실시 예에 따라, 상기 악성 코드 분석기는 상기 적어도 하나의 특징 벡터로부터 악성 코드의 유무, 및 악성 코드의 유형을 분류하는 인공지능 기반의 신경망을 포함하고, 상기 신경망은 LSTM을 포함할 수 있다.According to an embodiment, the malicious code analyzer may include an artificial intelligence-based neural network for classifying the presence or absence of malicious code and the type of malicious code from the at least one feature vector, and the neural network may include an LSTM.

본 개시의 기술적 사상에 의한 일 양태에 따른 애플리케이션 악성 코드 분석 방법은, 애플리케이션의 실행 코드로부터, 적어도 하나의 기본 블록 및 상기 적어도 하나의 기본 블록 각각에 대한 루프 정보를 포함하는 제어 흐름 그래프(CFG)를 생성하는 단계; 생성된 CFG로부터, API의 호출 흐름을 나타내는 API 호출 그래프를 생성하는 단계; 생성된 API 호출 그래프에 포함된 적어도 하나의 API 블록 각각의 특징 벡터를 생성하는 단계; 및 생성된 적어도 하나의 특징 벡터로부터, 상기 애플리케이션의 악성 코드를 분석하는 단계를 포함한다.In an application malicious code analysis method according to an aspect according to the technical spirit of the present disclosure, a control flow graph (CFG) including at least one basic block and loop information for each of the at least one basic block from an execution code of an application creating a; generating an API call graph representing an API call flow from the generated CFG; generating a feature vector of each of at least one API block included in the generated API call graph; and analyzing the malicious code of the application from the generated at least one feature vector.

본 개시의 기술적 사상에 따르면, 애플리케이션 악성 코드 분석 시스템은 애플리케이션의 실행 코드로부터 루프 정보를 포함하는 CFG를 생성하고, 생성된 CFG를 기반으로 특징 벡터를 생성하여 악성 코드를 분류함으로써, 루프에 의한 악성 행위까지도 원활히 검출할 수 있다.According to the technical idea of the present disclosure, the application malicious code analysis system generates a CFG including loop information from the execution code of the application, generates a feature vector based on the generated CFG, and classifies the malicious code to classify the malicious code. Even actions can be detected smoothly.

또한, 애플리케이션 악성 코드 분석 시스템은 LSTM 등의 신경망을 통해 상기 특징 벡터로부터 악성 코드를 분류하도록 구현됨으로써, 신종 또는 변종 악성 코드의 유무 및 특성까지도 효과적으로 검출해낼 수 있다.In addition, the application malicious code analysis system is implemented to classify malicious code from the feature vector through a neural network such as LSTM, thereby effectively detecting the presence and characteristics of new or variant malicious code.

본 개시의 기술적 사상에 따른 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.Effects according to the technical spirit of the present disclosure are not limited to the above-mentioned effects, and other effects not mentioned may be clearly understood by those of ordinary skill in the art to which the present invention belongs from the description below. There will be.

본 개시에서 인용되는 도면을 보다 충분히 이해하기 위하여 각 도면의 간단한 설명이 제공된다.
도 1은 본 개시의 예시적 실시 예에 따른 애플리케이션 악성 코드 분석 시스템의 개략적인 블록도이다.
도 2는 본 개시의 예시적 실시 예에 따른 애플리케이션 악성 코드 분석 방법을 설명하기 위한 플로우차트이다.
도 3은 애플리케이션의 제어 흐름을 보여주는 예시도이다.
도 4는 도 3의 실시 예에 따른 애플리케이션으로부터 생성되는 CFG를 나타낸다.
도 5는 도 4의 CFG에 기초하여 생성되는 API 호출 그래프를 나타낸다.
도 6은 API 호출 그래프로부터 API 특징 벡터를 생성하는 방법에 대한 일례를 설명하는 표이다.
도 7은 생성된 API 특징 벡터들로부터 악성 코드를 분석 및 분류하는 악성 코드 분석기의 일 구현 예를 설명하기 위한 도면이다.In order to more fully understand the drawings cited in this disclosure, a brief description of each drawing is provided.
1 is a schematic block diagram of an application malicious code analysis system according to an exemplary embodiment of the present disclosure.
2 is a flowchart illustrating a method for analyzing an application malicious code according to an exemplary embodiment of the present disclosure.
3 is an exemplary diagram showing a control flow of an application.
4 illustrates a CFG generated from an application according to the embodiment of FIG. 3 .
5 shows an API call graph generated based on the CFG of FIG. 4 .
6 is a table for explaining an example of a method of generating an API feature vector from an API call graph.
7 is a diagram for explaining an implementation example of a malicious code analyzer that analyzes and classifies malicious code from generated API feature vectors.

본 개시의 기술적 사상에 따른 예시적인 실시예들은 당해 기술 분야에서 통상의 지식을 가진 자에게 본 개시의 기술적 사상을 더욱 완전하게 설명하기 위하여 제공되는 것으로, 아래의 실시예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 개시의 기술적 사상의 범위가 아래의 실시예들로 한정되는 것은 아니다. 오히려, 이들 실시예들은 본 개시를 더욱 충실하고 완전하게 하며 당업자에게 본 발명의 기술적 사상을 완전하게 전달하기 위하여 제공되는 것이다.Exemplary embodiments according to the technical spirit of the present disclosure are provided to more completely explain the technical spirit of the present disclosure to those of ordinary skill in the art, and the following embodiments are modified in various other forms may be, and the scope of the technical spirit of the present disclosure is not limited to the following embodiments. Rather, these embodiments are provided so as to more fully and complete the present disclosure, and to fully convey the technical spirit of the present invention to those skilled in the art.

본 개시에서 제1, 제2 등의 용어가 다양한 부재, 영역, 층들, 부위 및/또는 구성 요소들을 설명하기 위하여 사용되지만, 이들 부재, 부품, 영역, 층들, 부위 및/또는 구성 요소들은 이들 용어에 의해 한정되어서는 안 됨은 자명하다. 이들 용어는 특정 순서나 상하, 또는 우열을 의미하지 않으며, 하나의 부재, 영역, 부위, 또는 구성 요소를 다른 부재, 영역, 부위 또는 구성 요소와 구별하기 위하여만 사용된다. 따라서, 이하 상술할 제1 부재, 영역, 부위 또는 구성 요소는 본 개시의 기술적 사상의 가르침으로부터 벗어나지 않고서도 제2 부재, 영역, 부위 또는 구성 요소를 지칭할 수 있다. 예를 들면, 본 개시의 권리 범위로부터 이탈되지 않은 채 제1 구성 요소는 제2 구성 요소로 명명될 수 있고, 유사하게 제2 구성 요소도 제1 구성 요소로 명명될 수 있다.Although the terms first, second, etc. are used in this disclosure to describe various members, regions, layers, regions, and/or components, these members, parts, regions, layers, regions, and/or components refer to these terms It is self-evident that it should not be limited by These terms do not imply a specific order, upper and lower, or superiority, and are used only to distinguish one member, region, region, or component from another member, region, region, or component. Accordingly, a first member, region, region, or component to be described below may refer to a second member, region, region, or component without departing from the teachings of the present disclosure. For example, without departing from the scope of the present disclosure, a first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component.

달리 정의되지 않는 한, 여기에 사용되는 모든 용어들은 기술 용어와 과학 용어를 포함하여 본 개시의 개념이 속하는 기술 분야에서 통상의 지식을 가진 자가 공통적으로 이해하고 있는 바와 동일한 의미를 지닌다. 또한, 통상적으로 사용되는, 사전에 정의된 바와 같은 용어들은 관련되는 기술의 맥락에서 이들이 의미하는 바와 일관되는 의미를 갖는 것으로 해석되어야 하며, 여기에 명시적으로 정의하지 않는 한 과도하게 형식적인 의미로 해석되어서는 아니 될 것이다.Unless defined otherwise, all terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the concepts of this disclosure belong, including technical and scientific terms. In addition, commonly used terms as defined in the dictionary should be construed as having a meaning consistent with their meaning in the context of the relevant technology, and unless explicitly defined herein, in an overly formal sense. shall not be interpreted.

어떤 실시예가 달리 구현 가능한 경우에 특정한 공정 순서는 설명되는 순서와 다르게 수행될 수도 있다. 예를 들면, 연속하여 설명되는 두 공정이 실질적으로 동시에 수행될 수도 있고, 설명되는 순서와 반대의 순서로 수행될 수도 있다.In cases where certain embodiments may be implemented otherwise, a specific process sequence may be performed different from the described sequence. For example, two processes described in succession may be performed substantially simultaneously, or may be performed in an order opposite to the described order.

첨부한 도면에 있어서, 예를 들면, 제조 기술 및/또는 공차에 따라, 도시된 형상의 변형들이 예상될 수 있다. 따라서, 본 개시의 기술적 사상에 의한 실시예들은 본 개시에 도시된 영역의 특정 형상에 제한된 것으로 해석되어서는 아니 되며, 예를 들면, 제조 과정에서 초래되는 형상의 변화를 포함하여야 한다. 도면 상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고, 이들에 대한 중복된 설명은 생략한다.In the accompanying drawings, variations of the illustrated shapes can be expected, for example depending on manufacturing technology and/or tolerances. Accordingly, embodiments according to the technical spirit of the present disclosure should not be construed as being limited to the specific shape of the region shown in the present disclosure, but should include, for example, a change in shape resulting from a manufacturing process. The same reference numerals are used for the same components in the drawings, and duplicate descriptions thereof are omitted.

여기에서 사용된 '및/또는' 용어는 언급된 부재들의 각각 및 하나 이상의 모든 조합을 포함한다.As used herein, the term 'and/or' includes each and every combination of one or more of the recited elements.

이하에서는 첨부한 도면들을 참조하여 본 개시의 기술적 사상에 의한 실시예들에 대해 상세히 설명한다.Hereinafter, embodiments according to the technical spirit of the present disclosure will be described in detail with reference to the accompanying drawings.

도 1은 본 개시의 예시적 실시 예에 따른 애플리케이션 악성 코드 분석 시스템의 개략적인 블록도이다.1 is a schematic block diagram of an application malicious code analysis system according to an exemplary embodiment of the present disclosure.

애플리케이션 악성 코드 분석 시스템(100; 이하 '악성 코드 분석 시스템')은 적어도 하나의 컴퓨팅 장치로 구현될 수 있다. 도 1에 도시된 구성들(110, 120, 130, 140)은 하나의 컴퓨팅 장치 내에 모두 구현되거나, 복수의 컴퓨팅 장치들에 서로 분산되어 구현될 수도 있다. 예컨대, 상기 적어도 하나의 컴퓨팅 장치는 스마트폰, 태블릿 PC 등의 이동 단말기를 포함할 수 있으나, 이에 한정되는 것은 아니고 PC나 서버 등의 고정형 단말기를 포함할 수도 있다. The application malicious code analysis system 100 (hereinafter, 'malicious code analysis system') may be implemented as at least one computing device. The components 110 , 120 , 130 , and 140 illustrated in FIG. 1 may all be implemented in one computing device, or may be implemented while being distributed among a plurality of computing devices. For example, the at least one computing device may include a mobile terminal such as a smart phone or a tablet PC, but is not limited thereto, and may include a fixed terminal such as a PC or a server.

이하 본 명세서에서, 악성 코드 분석 시스템(100)은 안드로이드 OS에서 구동되는 애플리케이션의 악성 코드를 분석하는 것으로 가정하여 설명한다. 다만, 본 개시의 실시 예들이 안드로이드 OS에서 구동되는 애플리케이션에 한정되어 적용되는 것은 아니며, 당해 기술분야에서 통상의 지식을 가진 자에 의해 용이하게 변형 가능한 범위에서 다른 다양한 OS(windows, linux 등)의 애플리케이션에 대해서도 적용될 수 있음은 당연하다.Hereinafter, in this specification, it is assumed that the malicious code analysis system 100 analyzes a malicious code of an application running in the Android OS. However, the embodiments of the present disclosure are not limited to applications running in the Android OS, and other various OSs (windows, linux, etc.) within a range that can be easily modified by those of ordinary skill in the art It goes without saying that it can also be applied to applications.

한편, 본 개시의 실시 예에 따른 악성 코드 분석 시스템(100)은 정적 분석 기반으로 악성 코드를 분석함으로써, 악성 코드가 수행할 수 있는 가능한 모든 행위에 대해 분석할 수 있다. 안드로이드 애플리케이션의 정적 분석 기법에는 Opcode 기반 분류 기법과 제어 흐름 그래프(control flow graph(CFG)) 기반 분류 기법이 존재한다. CFG 내에는 API 간의 호출 관계나 코드 분기 등의 실행 흐름 정보가 포함되므로, 악성 코드의 행위를 기반한 분류 시 높은 정확도를 보일 수 있다.Meanwhile, the malicious code analysis system 100 according to an embodiment of the present disclosure may analyze all possible actions that the malicious code can perform by analyzing the malicious code based on static analysis. In the static analysis method of Android applications, there are an opcode-based classification method and a control flow graph (CFG)-based classification method. Since execution flow information such as call relationship between APIs and code branching is included in the CFG, high accuracy can be shown in classification based on the behavior of malicious code.

도 1을 참조하면, 악성 코드 분석 시스템(100)은 CFG 생성기(110), API(application programming interface) 호출 그래프 생성기(120), 특징 벡터(feature vector) 생성기(130), 및 악성 코드 분석기(140)를 포함할 수 있다. 본 개시의 실시 예에 따른 악성 코드 분석 시스템(100)이 상술한 구성들에 반드시 한정되는 것은 아닌 바, 악성 코드 분석 시스템(100)은 보다 많거나 적은 구성들을 포함할 수도 있다. 악성 코드 분석 시스템(100)에 포함된 구성들(110, 120, 130, 140) 각각은 하드웨어, 소프트웨어, 또는 이들의 조합으로 구현될 수 있다.Referring to FIG. 1 , the malicious code analysis system 100 includes a CFG generator 110 , an application programming interface (API) call graph generator 120 , a feature vector generator 130 , and a malicious code analyzer 140 . ) may be included. Since the malicious code analysis system 100 according to an embodiment of the present disclosure is not necessarily limited to the above-described components, the malicious code analysis system 100 may include more or fewer components. Each of the components 110 , 120 , 130 , and 140 included in the malicious code analysis system 100 may be implemented as hardware, software, or a combination thereof.

CFG 생성기(110)는, 애플리케이션의 실행 코드(소스 코드)로부터 제어 흐름 그래프(CFG)를 추출 및 생성할 수 있다. 상기 CFG는 코드의 분기가 일어나지 않는 명령어의 가장 작은 단위인 기본 블록으로 구성되고, 기본 블록이 다른 블록으로 분기되면 블록 간에 선으로 연결될 수 있다. 기본 블록들 각각에는 인덱스가 설정될 수 있고, 초기 기본 블록을 기준으로 분기되는 블록들의 인덱스는 순차적으로 증가하도록 설정될 수 있다.The CFG generator 110 may extract and generate the control flow graph (CFG) from the execution code (source code) of the application. The CFG is composed of a basic block, which is the smallest unit of an instruction in which code branching does not occur, and when the basic block branches to another block, the blocks may be connected with a line. An index may be set for each of the basic blocks, and the indexes of the blocks branching based on the initial basic block may be set to sequentially increase.

이 때, 코드가 이전에 실행된 다른 코드로 분기될 경우 루프가 생성되는데, 루프 및 루프의 반복 횟수는 코드 실행 흐름을 반영하는 주된 요소에 해당한다. 그러나, 종래의 CFG 기반의 안드로이드 애플리케이션 악성 코드 분류 기법들은 루프를 고려하지 않고 CFG를 생성하여 악성 코드의 행위 분석을 수행하므로, 루프 정보가 악성 행위를 의미하는 경우 행위 분석의 정확도가 감소하는 문제가 발생할 수 있다.At this time, when the code branches to other previously executed code, a loop is created, and the loop and the number of iterations of the loop are the main factors that reflect the code execution flow. However, since conventional CFG-based Android application malicious code classification techniques generate CFGs without considering loops and perform malicious code behavior analysis, there is a problem in that the accuracy of behavior analysis decreases when loop information means malicious behavior. can occur

본 개시의 실시 예에 따른 CFG 생성기(110)는, CLAPP(characterizing loops in android application) 프레임워크를 이용하여, 애플리케이션의 디컴파일 시 생성되는 기계어 실행 코드로부터 루프 정보를 포함하는 CFG를 추출 및 생성할 수 있다. 상기 CLAPP 프레임워크는 상기 기계어 실행 코드인 Smali 코드로부터 CFG를 추출하고, 추출된 CFG 내의 루프 유형을 정의할 수 있다. The CFG generator 110 according to an embodiment of the present disclosure uses a CLAPP (characterizing loops in android application) framework to extract and generate a CFG including loop information from a machine code execution code generated when an application is decompiled. can The CLAPP framework may extract a CFG from the Smali code, which is the machine code execution code, and define a loop type in the extracted CFG.

예컨대, 상기 루프 유형은 루프의 종료 지점의 존재 여부에 따라 '무한(infinite)' 또는 '유한(finite)'으로 정의될 수 있다. 또한, 유한 루프의 경우 루프 반복 횟수를 결정하는 레지스터값의 변화 유형에 따라 루프 유형이 추가로 정의될 수 있다. 예컨대 레지스터값의 변화 유형은 '고정(fixed)', '증가(increasing)', '감소(decreasing)', '제한적(bounded)', 및 레지스터값이 네트워크 입력으로 결정되어 알 수 없는 경우인 '알 수 없음(unknown)' 등으로 정의될 수 있다.For example, the loop type may be defined as 'infinite' or 'finite' depending on whether or not there is an end point of the loop. Also, in the case of a finite loop, a loop type may be additionally defined according to a change type of a register value that determines the number of loop iterations. For example, the type of change in the register value is 'fixed', 'increasing', 'decreating', 'bounded', and ' when the register value is determined by a network input and is unknown. It may be defined as 'unknown'.

이를 종합하면, 본 개시의 실시 예에 따른 CFG 생성기(110)에 의해 생성되는 CFG는 적어도 하나의 기본 블록을 포함할 수 있다. 상기 적어도 하나의 기본 블록 각각은, 자신의 인덱스, 이전 호출되는 기본 블록의 인덱스, 분기 후(다음 호출되는) 기본 블록의 인덱스, 루프 포함 여부, 및/또는 루프 유형 등의 정보를 포함하도록 구성될 수 있다.Taken together, the CFG generated by the CFG generator 110 according to an embodiment of the present disclosure may include at least one basic block. Each of the at least one basic block is configured to include information such as its own index, an index of a previously called basic block, an index of a basic block after branching (next called), whether a loop is included, and/or a loop type. can

API 호출 그래프 생성기(120)는, CFG 생성기(110)에 의해 생성된 CFG로부터, 적어도 하나의 API 블록의 호출 정보를 나타내는 API 호출 그래프를 생성할 수 있다.The API call graph generator 120 may generate an API call graph indicating call information of at least one API block from the CFG generated by the CFG generator 110 .

악성 코드는 특정 행위를 수행하기 위해 API를 호출하므로, API 호출 정보는 악성 코드의 행위를 반영하는 중요 정보에 해당한다. 따라서, 본 개시의 실시 예에 따르면, API 호출 그래프 생성기(120)는 상기 생성된 CFG로부터 API 호출이 수행되는 기본 블록만을 연결하여 API 호출 그래프를 생성할 수 있다. 이에 따르면, 상기 API 블록은 API 호출이 수행되는 기본 블록을 의미할 수 있다.Since the malicious code calls the API to perform a specific action, the API call information corresponds to important information reflecting the action of the malicious code. Accordingly, according to an embodiment of the present disclosure, the API call graph generator 120 may generate an API call graph by connecting only the basic blocks in which API calls are performed from the generated CFG. Accordingly, the API block may mean a basic block in which an API call is performed.

API 호출 그래프는 적어도 하나의 API 블록으로 구성될 수 있다. API 블록 각각은, 자신의 인덱스, 이전 호출된 API 블록의 인덱스, 분기 후(다음 호출되는) API 블록의 인덱스, 루프 포함 여부, 루프 포함 시 루프의 시작점에 해당하는 API 블록의 인덱스 및 루프 유형 등의 정보를 포함하도록 구성될 수 있다.The API call graph may be composed of at least one API block. Each API block has its own index, the index of the previously called API block, the index of the API block after branching (next called), whether or not a loop is included, the index of the API block corresponding to the starting point of the loop when the loop is included, and the loop type, etc. It may be configured to include information of

특징 벡터 생성기(130)는, 상기 API 호출 그래프에 포함된 적어도 하나의 API 블록 각각에 저장된 정보에 기초하여, 기 설정된 길이를 갖는 적어도 하나의 특징 벡터(feature vector)를 생성할 수 있다. 예컨대, 특징 벡터 생성기(130)는 API 블록에 저장된 정보에 대해 feature hashing을 수행하여, 기 설정된 길이의 값(또는 문자열 등)을 갖는 특징 벡터를 생성할 수 있다.The feature vector generator 130 may generate at least one feature vector having a preset length based on information stored in each of at least one API block included in the API call graph. For example, the feature vector generator 130 may generate a feature vector having a value (or a string, etc.) of a preset length by performing feature hashing on information stored in the API block.

악성 코드 분석기(140)는, 특징 벡터 생성기(130)에 의해 생성된 적어도 하나의 특징 벡터에 기초하여, 애플리케이션에 대한 악성 코드 분석을 수행할 수 있다. 예컨대, 악성 코드 분석기(140)는 상기 애플리케이션에 악성 코드가 포함되어 있는지 여부를 분석할 수 있고, 악성 코드가 포함된 경우 악성 코드의 유형(종류)을 분류할 수 있다. 악성 코드 분석기(140)는 분석 결과에 기초하여 바이러스(virus), 웜(warm), 트로이목마(trojan), 랜섬웨어(ransomware), 스미싱(smithing), 무한 루프 등의 공지된 다양한 유형으로 상기 애플리케이션에 포함된 악성 코드를 분류할 수 있다.The malicious code analyzer 140 may perform malicious code analysis on the application based on at least one feature vector generated by the feature vector generator 130 . For example, the malicious code analyzer 140 may analyze whether or not malicious code is included in the application, and if malicious code is included, may classify the type (type) of the malicious code. Based on the analysis result, the malicious code analyzer 140 is classified into various known types such as virus, warm, trojan, ransomware, smithing, and infinite loop. You can classify the malicious code included in the application.

예컨대, 악성 코드 분석기(140)는 심층학습(deep learning)에 따라 구현된 신경망으로 구현되거나 상기 신경망을 포함하도록 구성될 수 있다. 한편, 상기 특징 벡터는 제어 흐름(또는 API 호출 흐름)에 따라 변화하는 벡터에 해당할 수 있고, 악성 코드의 특성은 특정 API의 특징 벡터만이 아니라 애플리케이션의 실행 흐름에 따른 특징 벡터의 변화를 고려하여 분류될 수 있는 정보이다. 따라서, 본 개시의 실시 예에 따른 악성 코드 분석기(140)는, 기존의 피드포워드(feedforward) 방식의 신경망보다는, 순환신경망(Recurrent Neural Network (RNN)), 장단기 기억 신경망(Long Short-Term Memory (LSTM)), 게이트 순환 유닛(Gate Recurrent Unit (GRU)) 등에 기반한 심층 기계학습에 따라 구현될 수 있다.For example, the malicious code analyzer 140 may be implemented as a neural network implemented according to deep learning or may be configured to include the neural network. On the other hand, the feature vector may correspond to a vector that changes according to the control flow (or API call flow), and the characteristic of malicious code considers not only the characteristic vector of a specific API but also the change of the characteristic vector according to the execution flow of the application. information that can be classified according to Therefore, the malicious code analyzer 140 according to the embodiment of the present disclosure, rather than a conventional feedforward neural network, a Recurrent Neural Network (RNN), a Long Short-Term Memory (Long Short-Term Memory) LSTM)), gate recurrent unit (GRU), etc. can be implemented according to deep machine learning.

또한, 악성 코드 분석기(140)는 입력된 특징 벡터에 기초한 분석 결과를 이용한 지속적 학습을 통해 신경망을 업데이트하도록 구현될 수 있다. 이에 따라, 신종 또는 변종 악성 코드가 애플리케이션에 포함된 경우에도, 악성 코드의 유무나 특성에 대해 보다 효과적으로 분석할 수 있다.Also, the malicious code analyzer 140 may be implemented to update the neural network through continuous learning using the analysis result based on the input feature vector. Accordingly, even when new or mutated malicious code is included in the application, the existence or characteristics of the malicious code can be analyzed more effectively.

도 2는 본 개시의 예시적 실시 예에 따른 애플리케이션 악성 코드 분석 방법을 설명하기 위한 플로우차트이다. 도 3은 애플리케이션의 제어 흐름을 보여주는 예시도이다. 도 4는 도 3의 실시 예에 따른 애플리케이션으로부터 생성되는 CFG를 나타낸다. 도 5는 도 4의 CFG에 기초하여 생성되는 API 호출 그래프를 나타낸다. 도 6은 API 호출 그래프로부터 API 특징 벡터를 생성하는 방법에 대한 일례를 설명하는 표이다. 도 7은 생성된 API 특징 벡터들로부터 악성 코드를 분석 및 분류하는 악성 코드 분석기의 일 구현 예를 설명하기 위한 도면이다.2 is a flowchart illustrating a method for analyzing an application malicious code according to an exemplary embodiment of the present disclosure. 3 is an exemplary diagram showing a control flow of an application. 4 illustrates a CFG generated from an application according to the embodiment of FIG. 3 . 5 shows an API call graph generated based on the CFG of FIG. 4 . 6 is a table for explaining an example of a method of generating an API feature vector from an API call graph. 7 is a diagram for explaining an implementation example of a malicious code analyzer that analyzes and classifies malicious code from generated API feature vectors.

도 2를 참조하면, 악성 코드 분석 시스템(100)은 분석 대상이 되는 애플리케이션으로부터, 루프 정보를 포함하는 제어 흐름 그래프(CFG)를 생성할 수 있다(S200).Referring to FIG. 2 , the malicious code analysis system 100 may generate a control flow graph (CFG) including loop information from an application to be analyzed ( S200 ).

CFG 생성기(110)는 애플리케이션의 디컴파일을 통해 획득되는 실행 코드로부터(예컨대 Smali 코드 등) CFG를 추출 및 생성할 수 있다. 도 1에서 상술한 바와 같이, CFG 생성기(110)는 CLAPP 프레임워크를 기반으로 루프 정보가 포함된 CFG를 추출 및 생성할 수 있다.The CFG generator 110 may extract and generate a CFG from an executable code (eg, Smali code, etc.) obtained through decompilation of an application. As described above in FIG. 1 , the CFG generator 110 may extract and generate a CFG including loop information based on the CLAPP framework.

이와 관련하여 도 3을 참조하면, CFG 생성기(110)는 상기 실행 코드로부터 코드의 분기 형태에 따라 복수의 기본 블록들(B₀, B₁, B₂, B₃)을 구분한 CFG를 생성할 수 있다. 복수의 기본 블록들(B₀, B₁, B₂, B₃) 각각은 코드의 분기에 의해 서로 구분될 수 있다. 이에 따라, 특정 기본 블록 내의 코드는 분기되지 않고 순차적으로 처리 및 실행될 수 있다.Referring to FIG. 3 in this regard, the CFG generator 110 generates a CFG in which a plurality of basic blocks (B ₀ , B ₁ , B ₂ , B ₃ ) are divided according to the branching form of the code from the execution code. can Each of the plurality of basic blocks B ₀ , B ₁ , B ₂ , and B ₃ may be distinguished from each other by branching of the code. Accordingly, the code in a specific basic block can be sequentially processed and executed without branching.

예컨대, 사용자는 애플리케이션의 UI 인터페이스인 Activity를 통해 애플리케이션을 사용할 수 있고, Activity의 사용 시 onCreate API가 가장 먼저 호출될 수 있다. 따라서, onCreate API가 포함된 제1 기본 블록(B₀)이 가장 낮은 인덱스(예컨대 '0')를 가질 수 있다.For example, the user may use the application through Activity, which is the UI interface of the application, and when using the Activity, the onCreate API may be called first. Accordingly, the first basic block B ₀ including the onCreate API may have the lowest index (eg, '0').

또한, if 조건에 따라 코드 분기가 발생하는 위치에서 인덱스가 '1'로 증가한 제2 기본 블록(B₁)이 정의되고, 분기된 코드 위치에 따라 각각 인덱스가 '2' 및 '3'인 제3 기본 블록 및 제4 기본 블록(B₂, B₃)이 정의될 수 있다. 한편, 인덱스가 '2'인 제3 기본 블록(B₂)은 인덱스가 '0'인 제1 기본 블록(B0)으로 분기되므로, 기본 블록들(B₀, B₁, B₂)을 연결하는 루프(L0)가 정의될 수 있다. In addition, the second basic block (B 1 ) whose index is increased to '1' is defined at the position where the code branch occurs according to the if condition, and the second basic block (B ₁ ) whose index is '2' and '3', respectively, is defined according to the branched code position. 3 basic blocks and fourth basic blocks B ₂ and B ₃ may be defined. On the other hand, since the third basic block B ₂ having an index of '2' is branched to the first basic block B0 having an index of '0', the basic blocks B ₀ , B ₁ , B ₂ are connected. A loop L0 may be defined.

CFG 생성기(110)는 루프의 종료 지점 존재 여부 및 루프 반복과 관련된 레지스터값의 변화 유형을 통해, 루프(L₀)의 유형을 정의할 수 있다. 도 3의 실시 예에서, 제2 기본 블록(B₁)에서 루프(L₀)가 종료될 수 있으므로, CFG 생성기(110)는 루프(L₀)가 유한 루프인 것으로 정의할 수 있다. 또한, CFG 생성기(110)는 루프의 반복에 따른 레지스터값(R₃, R₄)의 변화를 통해, 루프의 변화 유형이 '고정', '증가', '감소', '제한적', '알 수 없음' 중 어느 하나인 것으로 정의할 수 있다. The CFG generator 110 may define the type of the loop (L ₀ ) through whether or not there is an end point of the loop and a change type of a register value related to loop iteration. In the embodiment of FIG. 3 , since the loop L ₀ may be terminated in the second basic block B ₁ , the CFG generator 110 may define the loop L ₀ as a finite loop. In addition, the CFG generator 110 through the change of the register values (R ₃ , R ₄ ) according to the repetition of the loop, the change type of the loop is 'fixed', 'increase', 'decrement', 'limited', 'al It can be defined as any one of 'not possible'.

정리하면, CFG 생성기(110)에 의해 생성된 CFG는 코드의 분기에 따라 서로 연결되는 복수의 기본 블록들(B₀, B₁, B₂, B₃)을 포함한다. 복수의 기본 블록들(B₀, B₁, B₂, B₃) 각각은 이전 연결된 블록의 인덱스, 자신의 인덱스, 다음 연결되는 블록의 인덱스, 루프에 포함되어 있는지 여부, 루프 유형에 대한 정보를 포함할 수 있다. 일례로 제3 기본 블록(B₂)은 이전 연결된 블록의 인덱스('1'), 자신의 인덱스('2'), 다음 연결되는 블록의 인덱스('0'), 루프 포함 여부('포함됨'), 및 루프 유형('유한', '증가')을 나타내는 정보를 포함할 수 있을 것이다. In summary, the CFG generated by the CFG generator 110 includes a plurality of basic blocks B ₀ , B ₁ , B ₂ , and B ₃ connected to each other according to branching of the code. Each of the plurality of basic blocks (B ₀ , B ₁ , B ₂ , B ₃ ) contains the index of the previously connected block, its own index, the index of the next connected block, whether it is included in a loop, and information about the loop type. may include As an example, the third basic block (B ₂ ) is the index of the previously connected block ('1'), its own index ('2'), the index of the next connected block ('0'), whether or not a loop is included ('included') ), and information indicating the loop type ('finite', 'increment').

다시 도 2를 설명한다.Fig. 2 will be described again.

악성 코드 분석 시스템(100)은 생성된 CFG로부터 API 호출 그래프를 생성할 수 있다(S210).The malicious code analysis system 100 may generate an API call graph from the generated CFG (S210).

API 호출 그래프 생성기(120)는, 생성된 CFG에 포함된 기본 블록들 중, API 호출이 수행되는 기본 블록들을 연결하여 API 호출 그래프를 생성할 수 있다.The API call graph generator 120 may generate an API call graph by connecting basic blocks in which API calls are performed among the basic blocks included in the generated CFG.

도 4에 도시된 CFG(400)의 일례를 참조하면, 기본 블록들(B₀, B₁, B₂, B₃) 중 제1 기본 블록(B₀), 제3 기본 블록(B₂), 및 제4 기본 블록(B₃) 각각은 API의 호출 코드를 포함하고, 제2 기본 블록(B₁)은 API의 호출 코드를 포함하지 않을 수 있다. 기본 블록들 각각이 호출하는 API는 동일하거나 다를 수 있다.Referring to an example of the CFG 400 shown in FIG. 4 , among the basic blocks B ₀ , B ₁ , B ₂ , B ₃ , a first basic block B ₀ , a third basic block B ₂ , And each of the fourth basic blocks (B ₃ ) may include an API calling code, and the second basic block (B ₁ ) may not include an API calling code. The API that each of the basic blocks calls may be the same or different.

도 5를 참조하면, API 호출 그래프 생성기(120)는, API의 호출 코드를 포함하는 제1 기본 블록(B₀), 제3 기본 블록(B₂), 및 제4 기본 블록(B₃)을 연결하여 API 호출 그래프(500)를 생성할 수 있다. 이 때, 제1 기본 블록(B₀)은 제1 API 블록(A₀)에 대응하고, 제3 기본 블록(B₂)은 제2 API 블록(A₁)에 대응하며, 제4 기본 블록(B₃)은 제3 API 블록(A₂)에 대응할 수 있다. 제1 API 블록(A₀) 및 제2 API 블록(A₁)은 루프(L₀)에 포함되는 블록이고, 제3 API 블록(A₂)은 루프(L₀)에 포함되지 않는 블록에 해당한다.Referring to FIG. 5 , the API call graph generator 120 includes a first basic block (B ₀ ), a third basic block (B ₂ ), and a fourth basic block (B ₃ ) including an API call code. By connecting, the API call graph 500 can be created. At this time, the first basic block (B ₀ ) corresponds to the first API block (A ₀ ), the third basic block (B ₂ ) corresponds to the second API block (A ₁ ), and the fourth basic block ( B ₃ ) may correspond to the third API block A ₂ . The first API block (A ₀ ) and the second API block (A ₁ ) are blocks included in the loop (L ₀ ), and the third API block (A ₂ ) corresponds to a block not included in the loop (L ₀ ). do.

API 호출 그래프(500)에 포함되는 API 블록들(A₀, A₁, A₂)은 API명, 이전 연결된 API 블록의 인덱스, 자신의 인덱스, 다음 연결되는 API 블록의 인덱스, 루프에 포함되어 있는지 여부, 루프에 포함 시 루프의 시작점에 해당하는 API 블록의 인덱스, 및 루프 유형에 대한 정보 등을 포함할 수 있다. 일례로, 제2 API 블록(A₁)은 이전 연결된 블록의 인덱스('0'), 자신의 인덱스('1'), 다음 연결되는 블록의 인덱스('0'), 루프 포함 여부('포함됨'), 루프 시작점에 해당하는 API 블록의 인덱스('1'), 및 루프 유형('유한', '증가'등)을 나타내는 정보를 포함할 수 있을 것이다.The API blocks (A ₀ , A ₁ , A ₂ ) included in the API call graph 500 are the API name, the index of the previously connected API block, their index, the index of the next connected API block, and whether they are included in the loop. When included in the loop, the index of the API block corresponding to the starting point of the loop, and information on the loop type may be included. As an example, the second API block (A ₁ ) is the index of the previously connected block ('0'), its own index ('1'), the index of the next connected block ('0'), whether a loop is included ('included) '), the index of the API block corresponding to the loop start point ('1'), and information indicating the loop type ('finite', 'increment', etc.).

다시 도 2를 설명한다.Fig. 2 will be described again.

악성 코드 분석 시스템(100)은 생성된 API 호출 그래프에 기초하여, 상기 API 호출 그래프에 포함되는 API 블록들의 특징 벡터를 생성할 수 있다(S220).The malicious code analysis system 100 may generate a feature vector of API blocks included in the API call graph based on the generated API call graph ( S220 ).

특징 벡터 생성기(130)는, API 호출 그래프의 API 블록들이 포함하는 정보를 기반으로, API 블록들 각각에 대응하는 기 설정된 길이의 특징 벡터를 생성할 수 있다. 구체적으로, 특징 벡터 생성기(130)는 feature hashing을 수행함으로써, API 블록에 포함된 정보 각각을 기 설정된 길이의 값(또는 문자열 등)으로 변환할 수 있다.The feature vector generator 130 may generate a feature vector of a preset length corresponding to each of the API blocks, based on information included in the API blocks of the API call graph. Specifically, the feature vector generator 130 may convert each piece of information included in the API block into a value (or a string, etc.) of a preset length by performing feature hashing.

도 6에는 특징 벡터 생성 방법의 일 실시 예가 도시되어 있으나, 본 실시 예는 설명의 편의를 위한 것에 불과하므로, 본 개시의 실시 예에 따른 악성 코드 분석 시스템(100)에 적용되는 특징 벡터 생성 방법은 다양하게 변형될 수 있다.6 shows an embodiment of a method for generating a feature vector, but this embodiment is only for convenience of explanation, so the method for generating a feature vector applied to the malicious code analysis system 100 according to an embodiment of the present disclosure It can be variously modified.

도 6의 예를 참조하면, 특징 벡터 생성기(130)는 API 블록들(A₀, A₁, A₂) 각각의 API명에 feature hashing을 수행하여, API 블록들 각각에 대한 16 바이트 길이의 제1 부분 특징 벡터를 생성할 수 있다. 특징 벡터 생성기(130)는 API 블록들 각각이 루프에 포함된 경우, 루프 시작 지점의 API명에 feature hashing을 수행하여 16 바이트 길이의 제2 부분 특징 벡터를 생성할 수 있다. API 블록이 루프에 포함되지 않은 경우(예컨대 제3 API 블록(A₂)), 해당 API 블록에 대한 제2 부분 특징 벡터는 0의 값을 가질 수 있다.Referring to the example of FIG. 6 , the feature vector generator 130 performs feature hashing on the API names of each of the API blocks (A ₀ , A ₁ , A ₂ ), A one-part feature vector can be generated. When each of the API blocks is included in the loop, the feature vector generator 130 may generate a second partial feature vector having a length of 16 bytes by performing feature hashing on the API name of the loop starting point. When the API block is not included in the loop (eg, the third API block A ₂ ), the second partial feature vector for the API block may have a value of 0.

실시 예에 따라, 특징 벡터 생성기(130)는 API 블록들 각각이 루프에 포함된 경우, 루프 유형 정보에 대해 feature hashing을 수행하여 32 바이트 길이의 제3 특징 벡터를 생성할 수 있다. 예컨대, 루프 유형 정보는 루프 종료 지점에 기반하여 '무한' 또는 '유한'을 나타내는 단어, 및 레지스터값의 변화 유형에 기반한 '고정', '증가', '감소', '제한적', '알 수 없음' 등의 단어를 포함할 수 있다. 특징 벡터 생성기(130)는 루프 유형 정보에 포함된 단어들에 대해 feature hashing을 수행하여 상기 제3 특징 벡터를 생성할 수 있다. 실시 예에 따라 API 블록이 루프에 포함되지 않은 경우(예컨대 제3 API 블록(A₂)), 해당 API 블록에 대한 제3 부분 특징 벡터는 0의 값을 가질 수 있다.According to an embodiment, when each of the API blocks is included in a loop, the feature vector generator 130 may generate a third feature vector having a length of 32 bytes by performing feature hashing on the loop type information. For example, loop type information includes words representing 'infinite' or 'finite' based on the loop end point, and 'fixed', 'increment', 'decrement', 'limited', 'unknown' based on the type of change in the register value. It may include words such as 'none'. The feature vector generator 130 may generate the third feature vector by performing feature hashing on words included in the loop type information. According to an embodiment, when the API block is not included in the loop (eg, the third API block A ₂ ), the third partial feature vector for the API block may have a value of 0.

실시 예에 따라, 특징 벡터 생성기(130)는 API 블록이 루프에 포함된 경우, 루프에 포함된 모든 API명을 이은 문자열에 feature hashing을 수행하여 64 바이트 길이의 제4 특징 벡터를 생성할 수 있다. 반면 API 블록이 루프에 포함되지 않은 경우, 해당 API 블록에 대한 제4 부분 특징 벡터는 0의 값을 가질 수 있다.According to an embodiment, when an API block is included in a loop, the feature vector generator 130 may generate a fourth feature vector having a length of 64 bytes by performing feature hashing on a string concatenating all API names included in the loop. . On the other hand, when the API block is not included in the loop, the fourth partial feature vector for the API block may have a value of 0.

특징 벡터 생성기(130)는 상기 제1 부분 특징 벡터 내지 제4 부분 특징 벡터를 연결함으로써, API 블록에 대응하는 특징 벡터를 생성할 수 있다. 도 3 내지 도 5의 실시 예에 따라 3개의 API 블록이 존재하는 경우, 특징 벡터 생성기(130)는 API 블록마다 특징 벡터를 생성함으로써 총 3개의 특징 벡터를 생성할 수 있다. 다만, 특징 벡터의 수는 다양하게 변경될 수 있다.The feature vector generator 130 may generate a feature vector corresponding to the API block by connecting the first to fourth partial feature vectors. 3 to 5 , when three API blocks exist, the feature vector generator 130 may generate a total of three feature vectors by generating a feature vector for each API block. However, the number of feature vectors may be variously changed.

다시 도 2를 설명한다.Fig. 2 will be described again.

악성 코드 분석 시스템(100)은 생성된 특징 벡터에 기초하여, 애플리케이션에 대한 악성 코드를 분석할 수 있다(S230).The malicious code analysis system 100 may analyze the malicious code for the application based on the generated feature vector (S230).

악성 코드 분석기(140)는 API 호출 그래프로부터 획득되는 특징 벡터들(API 특징 벡터들)로부터 애플리케이션의 악성 코드를 분석할 수 있다. 도 1에서 상술한 바와 같이, 악성 코드 분석기(140)는 인공지능 기반의 심층 학습(딥러닝)에 따라 구현될 수 있고, 하드웨어, 소프트웨어, 또는 이들의 조합으로 구현될 수 있다.The malicious code analyzer 140 may analyze the malicious code of the application from feature vectors (API feature vectors) obtained from the API call graph. As described above in FIG. 1 , the malicious code analyzer 140 may be implemented according to artificial intelligence-based deep learning (deep learning), and may be implemented as hardware, software, or a combination thereof.

또한, 상술한 바와 같이 악성 코드 분석기(140)는 기존의 피드포워드(feedforward) 방식의 신경망이 아닌, RNN, LSTM, GRU 등에 기반한 심층 기계학습에 따라 구현될 수 있다. 이하 본 명세서에서는 악성 코드 분석기(140)가 LSTM에 따라 구현된 것으로 가정하여 설명한다.In addition, as described above, the malicious code analyzer 140 may be implemented according to deep machine learning based on RNN, LSTM, GRU, etc., rather than the conventional feedforward neural network. Hereinafter, in this specification, it is assumed that the malicious code analyzer 140 is implemented according to the LSTM.

도 7을 참조하면, LSTM에 따라 구현된 악성 코드 분석기(140)는 특징 벡터들(API 특징 벡터들)이 입력되는 입력 계층(input layer), 입력된 특징 벡터들을 이용한 학습이 수행되는 LSTM 계층(LSTM layer), 입력된 특징 벡터들로부터 악성 코드들 및 정상 코드 각각에 대한 확률을 결정하는 fully connected 계층 및 softmax 계층, 및 결정된 확률에 기초하여 애플리케이션에 대한 악성 코드 유형 또는 정상 코드를 분류하는 분류 계층(classification layer)을 포함할 수 있다.Referring to FIG. 7 , the malicious code analyzer 140 implemented according to the LSTM includes an input layer to which feature vectors (API feature vectors) are input, and an LSTM layer where learning using the input feature vectors is performed ( LSTM layer), a fully connected layer and softmax layer that determine probabilities for malicious codes and normal codes from the input feature vectors, respectively, and a classification layer that classifies malicious code types or normal codes for applications based on the determined probabilities (classification layer) may be included.

LSTM은 일반적인 RNN에 비해 길이가 긴 시계열 데이터에 대해서도 효과적인 처리가 가능하다. 악성 코드 분석기(140)로 입력되는 특징 벡터들은, API 호출 그래프에 포함된 API 블록들의 인덱스에 기초하여 순차적으로 입력될 수 있다. 따라서, 본 개시의 실시 예에 따른 악성 코드 분석기(140)는 LSTM 기반으로 구현되어, 가변적이고 길이가 긴 악성 코드의 API 호출 특성을 나타내는 특징 벡터들로부터 높은 정확도의 악성 코드 분석 결과를 제공할 수 있다.Compared to general RNN, LSTM can effectively process time-series data with a longer length. The feature vectors input to the malicious code analyzer 140 may be sequentially input based on indexes of API blocks included in the API call graph. Therefore, the malicious code analyzer 140 according to an embodiment of the present disclosure is implemented based on LSTM, and can provide high-accuracy malicious code analysis results from feature vectors representing the API call characteristics of variable and long malicious codes. there is.

예컨대, 악성 코드 분석기(140)로부터 출력되는 분석 결과는 정상 코드 및 악성 코드로 분류될 수 있다. 또한, 악성 코드의 경우 바이러스(virus), 웜(warm), 트로이목마(trojan), 랜섬웨어(ransomware), 스미싱(smithing), 무한루프 등으로 분류되어, 이들 중 어느 하나가 분석 결과로서 제공될 수 있다.For example, the analysis result output from the malicious code analyzer 140 may be classified into a normal code and a malicious code. In addition, in the case of malicious code, it is classified into virus, worm, Trojan, ransomware, smithing, infinite loop, etc., any one of these is provided as an analysis result can be

즉, 본 개시의 실시 예들에 따르면, 악성 코드 분석 시스템(100)은 애플리케이션의 실행 코드의 루프 정보를 활용하여 악성 코드를 분석할 수 있으므로, 루프 반복 패턴이나 횟수에 따른 악성 행위까지도 정확히 분석할 수 있다.That is, according to embodiments of the present disclosure, since the malicious code analysis system 100 can analyze malicious code by using loop information of the execution code of the application, it is possible to accurately analyze even malicious behavior according to the loop repetition pattern or number of times. there is.

상기한 실시 예들의 설명은 본 개시의 더욱 철저한 이해를 위하여 도면을 참조로 예를 든 것들에 불과하므로, 본 개시의 기술적 사상을 한정하는 의미로 해석되어서는 안될 것이다. Since the descriptions of the above embodiments are merely those given with reference to the drawings for a more thorough understanding of the present disclosure, they should not be construed as limiting the technical spirit of the present disclosure.

또한, 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자에게 있어 본 개시의 기본적 원리를 벗어나지 않는 범위 내에서 다양한 변화와 변경이 가능함은 명백하다 할 것이다.In addition, it will be apparent to those of ordinary skill in the art to which the present disclosure pertains that various changes and modifications can be made without departing from the basic principles of the present disclosure.

Claims

a CFG generator that generates a control flow graph (CFG) including loop information from the execution code of the application;
an API call graph generator that generates an API call graph representing an API call flow from the generated CFG;
a feature vector generator that generates a feature vector of each of at least one API block included in the generated API call graph; and
A malicious code analyzer that analyzes the malicious code of the application from the generated at least one feature vector,
Application malware analysis system.

According to claim 1,
The CFG includes at least one basic block divided based on a branch of the execution code of the application,
Each of the at least one basic block,
whether it is included in a loop, and at least one of a type of loop;
Application malware analysis system.

3. The method of claim 2,
The loop type is
defined by at least one of the existence of an exit point of the loop, and the type of change in a register value associated with the iteration of the loop.
Application malware analysis system.

3. The method of claim 2,
Each of the at least one basic block,
Further comprising at least one of an index, an index of a previously concatenated basic block, and an index of a next concatenated basic block,
Application malware analysis system.

3. The method of claim 2,
The API call graph generator is
generating the API call graph in which at least one API block in which an API call is performed among the at least one basic block included in the CFG is connected,
Application malware analysis system.

6. The method of claim 5,
Each of the at least one API block,
Whether included in a loop, when included in a loop, including at least one of the index of the API block corresponding to the starting point of the loop, and the type of loop,
Application malware analysis system.

7. The method of claim 6,
Each of the at least one API block,
Further comprising at least one of an index, an index of a previously connected API block, and an index of a next connected API block,
Application malware analysis system.

According to claim 1,
The feature vector generator,
By performing feature hashing on information included in each of the at least one API block, a feature vector of a preset length corresponding to each of the at least one API block is generated,
Application malware analysis system.

According to claim 1,
The malicious code analyzer is
and an artificial intelligence-based neural network for classifying the presence or absence of malicious code and the type of malicious code from the at least one feature vector;
The neural network comprises a long short-term memory (LSTM),
Application malware analysis system.

generating, from the execution code of the application, a control flow graph (CFG) including at least one basic block and loop information for each of the at least one basic block;
generating an API call graph representing an API call flow from the generated CFG;
generating a feature vector of each of at least one API block included in the generated API call graph; and
From the generated at least one feature vector, comprising the step of analyzing the malicious code of the application,
How to analyze application malware.

11. The method of claim 10,
The loop information is
at least one of whether a corresponding basic block is included in a loop, and a loop type;
The loop type is
defined by at least one of the existence of an exit point of the loop, and the type of change in a register value associated with the iteration of the loop.
How to analyze application malware.

11. The method of claim 10,
Each of the at least one basic block,
Further comprising information about at least one of an index, an index of a previously concatenated basic block, and an index of a next concatenated basic block,
How to analyze application malware.

11. The method of claim 10,
The step of generating the API call graph includes:
generating the API call graph by connecting at least one API block in which an API call is performed among the at least one basic block included in the CFG,
Each of the at least one API block,
Whether included in a loop, when included in a loop, including at least one of the index of the API block corresponding to the starting point of the loop, and the type of loop,
How to analyze application malware.

14. The method of claim 13,
The step of generating the feature vector comprises:
generating a feature vector of a preset length corresponding to each of the at least one API block by performing feature hashing on information included in each of the at least one API block;
How to analyze application malware.

11. The method of claim 10,
The step of analyzing the malicious code is
and sequentially inputting the at least one feature vector into an artificial intelligence-based neural network according to the flow of a corresponding API block, and classifying the presence or absence of malicious code of the application and the type of malicious code;
The neural network comprises a long short-term memory (LSTM),
How to analyze application malware.