KR20200076426A

KR20200076426A - Method and apparatus for malicious detection based on heterogeneous information network

Info

Publication number: KR20200076426A
Application number: KR1020180165522A
Authority: KR
Inventors: 김성열; 은상남; 진치국; 강호석
Original assignee: 건국대학교 산학협력단
Priority date: 2018-12-19
Filing date: 2018-12-19
Publication date: 2020-06-29
Also published as: KR102151318B1

Abstract

Disclosed are a method for detecting a malignant code based on a heterogeneous information network and a device thereof. According to one embodiment of the present invention, the method for detecting a malignant software comprises the steps of: extracting features from a portable executable (PE) files; generating a heterogeneous information network (HIN) for a relationship between the PE files and the features; and detecting whether the PE files are malignant software by using a meta path on the HIN.

Description

METHOD AND APPARATUS FOR MALICIOUS DETECTION BASED ON HETEROGENEOUS INFORMATION NETWORK}

아래 실시예들은 이종 정보 네트워크 기반 악성 코드 탐지 방법 및 장치에 관한 것이다.The embodiments below relate to a method and apparatus for detecting malicious code based on a heterogeneous information network.

소프트웨어 내 악성 코드를 분석하기 위한 방법에는 지문(signature) 검사법, CRC(Cyclic Redundancy Check) 검사법, 및 경험적(heuristic) 검사법 등이 있다.Methods for analyzing malicious code in software include a signature test method, a cyclic redundancy check (CRC) test method, and a heuristic test method.

지문 검사법은 사람을 구별할 때 지문을 보듯이 보안 프로그램이 악성 코드를 진단하는 방법 중의 한 가지이다. 즉, 악성 코드가 가지고 있는 독특한 문자열(패턴)을 수집하여 이를 데이터베이스에 저장하고, 보안 프로그램이 패턴을 매칭하는 방법을 이용하여 악성 코드를 분석한다.Fingerprint inspection is one of the methods that a security program diagnoses malicious code as if looking at a fingerprint when distinguishing a person. That is, the unique character string (pattern) possessed by the malicious code is collected and stored in a database, and the malicious code is analyzed by using a method of matching the security program pattern.

CRC 검사법은 시리얼 전송에서 데이터의 신뢰성을 검증하기 위한 에러 검출 방법의 일종으로 오진율이 낮다는 장점이 있으나, 데이터가 1 바이트라도 변형되면 악성 코드를 진단할 수 있는 단점이 있다.The CRC test method is an error detection method for verifying the reliability of data in serial transmission, and has an advantage of low error rate, but has a disadvantage in that it can diagnose malicious code if the data is modified even if it is 1 byte.

최근에는 악성 코드 분석 방법으로서 지문 검사법의 기능을 향상시킨 경험적 기법이 주로 사용되는데, 이는 악성 코드의 행동을 분석하거나 방식을 분석하여 자체적으로 학습하는 학습기반 분석법 중 하나이다. 즉, 악성 소프트웨어의 경우 독특한 조합의 API 명령을 사용하는 경우가 많은데, 경험적 기법은 이와 같이 독특한 API 명령의 조합을 학습하여 API 명령을 기반으로 악성 코드 여부를 판단한다.Recently, as an analysis method of malicious code, an empirical technique that improves the function of fingerprint inspection is mainly used, which is one of the learning-based analysis methods that analyzes the behavior of malicious code or analyzes the method and learns itself. That is, in the case of malicious software, a unique combination of API commands is often used, and an empirical technique learns a combination of unique API commands as described above to determine whether malicious code is based on the API command.

인터넷의 급속한 발전으로 다양한 유형의 보안 위협이 급속하게 증가했으며, 그 중 전통적인 PC 플랫폼의 악성 소프트웨어가 가장 많이 보급되었다. 복잡한 포장, 혼란, 안티-샌드 박싱(anti-sandboxing), 가상 침투(virtual penetration) 및 기타 기술 개발로 인해, 기존의 악성 소프트웨어 탐지 방법은 만족스럽지 못하다.With the rapid development of the Internet, various types of security threats have rapidly increased, and among them, the malicious software of the traditional PC platform has been most prevalent. Due to complex packaging, confusion, anti-sandboxing, virtual penetration and other technology developments, existing methods of detecting malicious software are not satisfactory.

정보 네트워크 시대에서, 점점 더 많은 악성 소프트웨어(malicious software)는 보안에 심각한 위협을 가하고 있다. 적시에 효과적인 방법으로 악성 소프트웨어 공격을 탐지하는 방법이 특히 중요하다. 점점 더 정교 해지는 악성 소프트웨어에 대해서, 새로운 공격과 위협을 탐지하고 대처할 수 있는 새로운 방어 기술이 요구된다.In the information network era, more and more malicious software poses a serious threat to security. It is especially important to detect malicious software attacks in a timely and effective manner. For increasingly sophisticated malicious software, new defense technologies are needed to detect and respond to new attacks and threats.

실시예들은 PE 파일을 분석하여 특성들을 추출하고 특성들 간의 관계에 대한 HIN를 구축한 다음, 메타 경로 기반 방법을 사용하여 해당 PE 파일이 악성 소프트웨어에 해당하는지 탐지하는 기술을 제공할 수 있다.Embodiments may provide a technique for analyzing a PE file, extracting characteristics, constructing a HIN for a relationship between characteristics, and then detecting whether the corresponding PE file corresponds to malicious software using a meta-path-based method.

일 실시예에 따른 악성 소프트웨어 탐지 방법은 PE(Portable Executable) 파일들로부터 특성들을 추출하는 단계와, 상기 PE 파일들과 상기 특성들 간의 관계에 대한 HIN(heterogeneous information network)을 생성하는 단계와, 상기 HIN 상의 메타 경로를 이용하여 상기 PE 파일들이 악성 소프트웨어인지 탐지하는 단계를 포함한다.The malicious software detection method according to an embodiment includes extracting characteristics from Portable Executable (PE) files, generating a heterogeneous information network (HIN) for the relationship between the PE files and the characteristics, and And detecting whether the PE files are malicious software using a meta path on the HIN.

상기 특성들은 PE 헤더 정보, API 호출(call), DLL 및 Opcode 시퀀스를 포함할 수 있다.The properties may include PE header information, API calls, DLL and Opcode sequences.

상기 탐지하는 단계는 상기 HIN 상의 메타 경로를 이용하여 상기 PE 파일들 간의 유사성을 계산함으로써 상기 PE 파일들이 악성 소프트웨어인지 탐지하는 단계를 포함할 수 있다.The detecting may include detecting whether the PE files are malicious software by calculating the similarity between the PE files using the meta path on the HIN.

상기 관계는 PE 파일과 API 호출에 대한 제1 관계, PE 파일의 PE 헤더 정보의 속성값에 대한 제2 관계, API 호출이 속하는 패키지에 대한 제3 관계, PE 파일의 API 시퀀스에 대한 제4 관계, 및 PE 파일의 Opcode 시퀀스에 대한 제5 관계를 포함할 수 있다.The relationship includes the first relationship between the PE file and the API call, the second relationship to the attribute value of the PE header information of the PE file, the third relationship to the package to which the API call belongs, and the fourth relationship to the API sequence of the PE file , And a fifth relationship for the Opcode sequence of the PE file.

상기 메타 경로는 상기 제1 관계를 표현하는 제1 매트릭스를 통해 구성된 제1 메타 경로, 상기 제2 관계를 표현하는 제2 매트릭스를 통해 구성된 제2 메타 경로, 상기 제3 관계를 표현하는 제3 매트릭스를 통해 구성된 제3 메타 경로, 상기 제4 관계를 표현하는 제4 매트릭스를 통해 구성된 제4 메타 경로, 및 상기 제5 관계를 표현하는 제5 매트릭스를 통해 구성된 제5 메타 경로를 포함할 수 있다.The meta-path includes a first meta-path constructed through a first matrix expressing the first relationship, a second meta-path constructed through a second matrix expressing the second relationship, and a third matrix expressing the third relationship. A third meta-path configured through, a fourth meta-path configured through a fourth matrix expressing the fourth relationship, and a fifth meta-path configured through a fifth matrix expressing the fifth relationship.

상기 메타 경로는 다중 커널 학습을 통해 상기 제1 메타 경로, 상기 제2 메타 경로, 상기 제3 메타 경로, 상기 제4 메타 경로, 및 상기 제5 메타 경로를 최적화하여 선형 결합한 메타 경로를 더 포함할 수 있다.The meta-path further includes a meta-path that is linearly combined by optimizing the first meta-path, the second meta-path, the third meta-path, the fourth meta-path, and the fifth meta-path through multi-kernel learning. Can.

상기 메타 경로는 상기 제1 관계를 표현하는 제1 매트릭스를 통해 구성된 제1 메타 경로, 상기 제2 관계를 표현하는 제2 매트릭스를 통해 구성된 제2 메타 경로, 상기 제3 관계를 표현하는 제3 매트릭스를 통해 구성된 제3 메타 경로, 상기 제4 관계를 표현하는 제4 매트릭스를 통해 구성된 제4 메타 경로, 및 상기 제5 관계를 표현하는 제5 매트릭스를 통해 구성된 제5 메타 경로를 다중 커널 학습을 통해 최적화하여 선형 결합한 메타 경로일 수 있다.The meta-path includes a first meta-path constructed through a first matrix expressing the first relationship, a second meta-path constructed through a second matrix expressing the second relationship, and a third matrix expressing the third relationship. Through a multi-kernel learning, a third meta-path constructed through, a fourth meta-path constructed through a fourth matrix expressing the fourth relationship, and a fifth meta-path constructed through a fifth matrix expressing the fifth relationship through It may be a meta-path that is optimized and linearly coupled.

일 실시예에 따른 악성 소프트웨어 탐지 장치는 PE(Portable Executable) 파일들을 수신하는 수신기와, 상기 PE 파일들로부터 특성들을 추출하고, 상기 PE 파일들과 상기 특성들 간의 관계에 대한 HIN(heterogeneous information network)을 생성하고, 상기 HIN 상의 메타 경로를 이용하여 상기 PE 파일들이 악성 소프트웨어인지 탐지하는 컨트롤러를 포함한다.A malicious software detection apparatus according to an embodiment includes a receiver that receives Portable Executable (PE) files, extracts characteristics from the PE files, and a heterogeneous information network (HIN) for the relationship between the PE files and the characteristics And a controller that detects whether the PE files are malicious software using a meta path on the HIN.

상기 컨트롤러는 상기 HIN 상의 메타 경로를 이용하여 상기 PE 파일들 간의 유사성을 계산함으로써 상기 PE 파일들이 악성 소프트웨어인지 탐지할 수 있다.The controller may detect whether the PE files are malicious software by calculating the similarity between the PE files using the meta path on the HIN.

도 1은 일 실시예에 따른 악성 소프트웨어 탐지 방법을 수행하는 장치를 나타낸다.
도 2는 도 1에 도시된 악성 소프트웨어 탐지 장치의 개략적인 블록도이다.
도 3은 도 2에 도시된 컨트롤러의 개략적인 블록도이다.
도 4는 도 3에 도시된 PE 파일 분석기의 개락적인 블록도를 나타낸다.
도 5a 및 도 5b는 도 3에 도시된 PE 파일 분석기가 특성을 추출하는 동작의 일 예를 설명하기 위한 도면이다.
도 6은 도 3에 도시된 HIN 생성기의 동작을 설명하기 위한 도면이다.
도 7은 도 3에 도시된 다중 커널 학습기 및 분류 모델에 대해 설명하기 위한 도면이다.
도 8은 일 실시예에 따른 악성 소프트웨어 탐지 방법에 의해 수행된 실험 결과를 설명하기 위한 그래프이다.1 shows an apparatus for performing a malicious software detection method according to an embodiment.
FIG. 2 is a schematic block diagram of the malicious software detection device shown in FIG. 1.
FIG. 3 is a schematic block diagram of the controller shown in FIG. 2.
FIG. 4 shows a schematic block diagram of the PE file analyzer shown in FIG. 3.
5A and 5B are diagrams for explaining an example of an operation in which the PE file analyzer illustrated in FIG. 3 extracts characteristics.
6 is a view for explaining the operation of the HIN generator shown in FIG.
FIG. 7 is a diagram for explaining the multi-kernel learner and classification model illustrated in FIG. 3.
8 is a graph for explaining the experimental results performed by the malicious software detection method according to an embodiment.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, various modifications may be made to the embodiments, and the scope of the patent application right is not limited or limited by these embodiments. It should be understood that all modifications, equivalents, or substitutes for the embodiments are included in the scope of rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are for illustrative purposes only and should not be construed as limiting. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this specification, the terms "include" or "have" are intended to indicate the presence of features, numbers, steps, actions, components, parts or combinations thereof described in the specification, one or more other features. It should be understood that the existence or addition possibilities of fields or numbers, steps, operations, components, parts or combinations thereof are not excluded in advance.

제1 또는 제2등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해서 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만, 예를 들어 실시예의 개념에 따른 권리 범위로부터 이탈되지 않은 채, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.The terms first or second may be used to describe various components, but the components should not be limited by the terms. The terms are for the purpose of distinguishing one component from another component, for example, without departing from the scope of rights according to the concept of the embodiment, the first component may be referred to as the second component, and similarly The second component may also be referred to as the first component.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person skilled in the art to which the embodiment belongs. Terms, such as those defined in a commonly used dictionary, should be interpreted to have meanings consistent with meanings in the context of related technologies, and should not be interpreted as ideal or excessively formal meanings unless explicitly defined in the present application. Does not.

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same reference numerals are assigned to the same components regardless of reference numerals, and redundant descriptions thereof will be omitted. In describing the embodiments, when it is determined that detailed descriptions of related well-known technologies may unnecessarily obscure the subject matter of the embodiments, detailed descriptions thereof will be omitted.

도 1은 일 실시예에 따른 악성 소프트웨어 탐지 방법을 수행하는 장치를 나타낸다.1 shows an apparatus for performing a malicious software detection method according to an embodiment.

악성 소프트웨어 탐지 장치(100)는 API 호출(API calls)에 의존할 뿐만 아니라, 이들 간의 관계를 분석하고, 탐지를 회피하는 공격자를 방지할 수 있도록 상위 수준의 의미(higher-level semantics, 또는 상위 수준의 의미 체계)를 생성하여 새로운 악성 소프트웨어 탐지 방법을 수행할 수 있다. 이를 통해, 악성 소프트웨어 탐지 장치(100)는 Android 악성 코드, Window 악성 코드 등 다양한 운영 체제 내의 악성 코드를 탐지할 수 있다.The malicious software detection apparatus 100 not only relies on API calls, but also analyzes the relationship between them and prevents an attacker who evades detection from higher-level semantics, or higher-level semantics. To create a semantic system) to detect new malicious software. Through this, the malicious software detection device 100 can detect malicious codes in various operating systems such as Android malicious codes and Window malicious codes.

악성 소프트웨어 탐지 장치(100)는 소프트웨어와 관련 API들 간의 풍부한 관계를 통해 이기종 정보 네트워크(HIN(heterogeneous information network))를 구축하고, 다음에 메타 경로(meta-path) 기반 방법을 수행하여 소프트웨어 및 API들에 대한 의미적 관련성(semantic relevance)을 분석할 수 있다.The malicious software detection apparatus 100 builds a heterogeneous information network (HIN) through a rich relationship between software and related APIs, and then performs a meta-path based method to perform software and API Semantic relevance can be analyzed.

악성 소프트웨어 탐지 장치(100)는 각 메타 경로를 사용하여 소프트웨어(예를 들어, PE 파일들) 간의 유사성을 계산하고, 탐지 모델을 구성하기 위해 MKL(Multi-kernel Learning)을 사용하여 서로 다른 유사성을 집계함으로써, 탐지 모델을 학습시켰다.The malicious software detection apparatus 100 calculates the similarity between software (for example, PE files) using each meta-path, and uses a multi-kernel learning (MKL) to construct a detection model, and uses different similarities. By counting, the detection model was trained.

악성 소프트웨어 탐지 장치(100)는 상대적으로 높은 탐지율(high detection rate)과 낮은 오 탐지율(low false detection rate)을 획득할 수 있다.The malicious software detection apparatus 100 may obtain a relatively high detection rate and a low false detection rate.

도 2는 도 1에 도시된 악성 소프트웨어 탐지 장치의 개략적인 블록도이다.FIG. 2 is a schematic block diagram of the malicious software detection device shown in FIG. 1.

악성 소프트웨어 탐지 장치(100)는 수신기(200), 컨트롤러(300), 및 메모리(400)를 포함한다.The malicious software detection device 100 includes a receiver 200, a controller 300, and a memory 400.

메모리(400)는 컨트롤러(300)에 의해 실행가능한 인스트럭션들(또는 프로그램을 저장할 수 있다. 예를 들어, 인스트럭션들은 컨트롤러(300)에 포함된 각 구성(도 3에 도시된 310 내지 370)의 동작을 실행하기 위한 인스트럭션들을 포함할 수 있다.The memory 400 may store instructions (or programs) executable by the controller 300. For example, instructions may be performed for each configuration (310 to 370 shown in FIG. 3) included in the controller 300. It may include instructions for executing.

수신기(200)는 PE 파일들을 수신할 수 있다.The receiver 200 may receive PE files.

컨트롤러(300)는 악성 소프트웨어 탐지 장치(100)의 전반적인 동작을 제어할 수 있다. 컨트롤러(300)는 수신기(200)로부터 수신된 PE 파일들이 악성 소프트웨어에 해당하는지 탐지할 수 있다.The controller 300 may control the overall operation of the malicious software detection device 100. The controller 300 may detect whether the PE files received from the receiver 200 correspond to malicious software.

컨트롤러(300)는 PE 파일들을 분석하고, 동일한 패키지 이름을 사용하거나 같은 속성 값(property value) 등을 포함하여 이들 사이의 관계를 더 자세히 분석할 수 있다. 컨트롤러(300)는 API들과 PE 파일들 간의 관계들 및 PE 파일들 자체 간의 다양한 유형 관계들을 통해 더 높은 수준의 의미 분석을 수행할 수 있다. The controller 300 may analyze PE files and analyze the relationship between them using the same package name or the same property value. The controller 300 may perform a higher level semantic analysis through various types of relationships between APIs and PE files and the PE files themselves.

컨트롤러(300)는 관계들의 풍부한 의미를 표현하기 위해, PE 파일들과 API들을 표현하는 구조화된 이기종 정보 네트워크(HIN) 표현을 생성할 수 있다. 그리고, 컨트롤러(300)는 메타 패스(meta-path)를 사용하여 더 높은 수준의 의미를 통합하여 PE 파일들의 의미 관련성을 구축할 수 있다.The controller 300 may generate a structured heterogeneous information network (HIN) representation representing PE files and APIs to express the rich meaning of the relationships. In addition, the controller 300 may build a semantic relevance of PE files by integrating a higher level of meaning using a meta-path.

컨트롤러(300)는 이러한 방식으로 동일한 API들을 사용하는지 여부를 계산할 수 있을 뿐만 아니라 동일한 패키지와 같은 사용 패턴들(usage patterns)이 유사한 지 여부를 계산하여 PE 파일들 간의 유사성을 계산할 수 있다. 이때, 동일한 두 PE 파일들 간의 유사성을 설명하는 경로가 다르기 때문에, 컨트롤러(300)는 다중 커널 학습 알고리즘(Multi-kernel Learning algorithms)을 사용하여 서로 다른 유사성의 가중치를 자동으로 데이터로 학습할 수 있다.The controller 300 can calculate whether to use the same APIs in this way, as well as calculate whether the usage patterns such as the same package are similar to calculate similarity between PE files. At this time, since the paths describing the similarity between two identical PE files are different, the controller 300 can automatically learn the weights of different similarities using data using multi-kernel learning algorithms. .

컨트롤러(300)는 커널 학습 알고리즘을 통해 학습된 메타 경로를 이용하여 PE 파일들이 악성 소프트웨어에 해당하는지 탐지할 수 있다.The controller 300 may detect whether the PE files correspond to malicious software by using the meta path learned through the kernel learning algorithm.

도 3은 도 2에 도시된 컨트롤러의 개략적인 블록도이다.FIG. 3 is a schematic block diagram of the controller shown in FIG. 2.

컨트롤러(300)는 PE 파일 분석기(PE File Analyzer; 310), HIN 생성기(HIN Constructor; 330), 다중 커널 학습기(Multikernel Learner; 350), 분류 모델(Classification Model; 370)를 포함한다.The controller 300 includes a PE File Analyzer (PE File Analyzer) 310, a HIN Constructor (330), a Multi Kernel Learner (350), and a Classification Model (370).

PE 파일 분석기(310)는 모든 실행 파일(PE 파일들)의 PE 테이블을 파싱하여(또는 구문 분석하여), 각 DLL 내부의 모든 PE 헤더 정보(PE header information), DLL 이름(DLL names), Opcode 시퀀스(Opcode sequence) 및 API 함수들(API functions)를 특성들(rfeatures)로 추출할 수 있다.The PE file analyzer 310 parses (or parses) the PE tables of all executable files (PE files), so that all PE header information inside each DLL, DLL names, and Opcode The sequence (Opcode sequence) and API functions (API functions) can be extracted as features (rfeatures).

HIN 생성기(330)는 추출한 특성들을 기반으로 HIN을 구성할 수 있다. HIN 생성기(330)는 PE 파일들과 추출된 API 호출 사이의 연결을 먼저 구축하고, 이러한 API 호출 간의 관계 유형 및 PE 파일들과 PE 파일 정보 간의 연결을 정의할 수 있다.The HIN generator 330 may configure HIN based on the extracted characteristics. The HIN generator 330 may first establish a connection between PE files and the extracted API call, and define a relationship type between these API calls and a connection between PE files and PE file information.

그런 다음 서로 다른 객체 유형들(different object types) 간의 인접 행렬들(adjacency matrices)이 생성되고, 다른 메타 경로들의 가환 행렬들(commuting matrices)이 열거되고 작성될 수 있다. Adjacency matrices between different object types can then be generated, and commuting matrices of different metapaths can be enumerated and written.

다중 커널 학습기(350)는 HIN의 가환 행렬들(commuting matrices)이 주어지면 SVM(Support Vector Machines)의 커널을 빌드할 수 있다. 다중 커널 학습기(350)는 표준 다중 커널 학습을 사용하여 서로 다른 메타 경로의 가중치를 최적화할 수 있다. 메타 경로 가중치가 주어지면, 모든 가환 행렬들(commuting matrices)을 결합하여 보다 강력한 멀웨어 탐지 커널을 공식화할 수 있다.The multi-kernel learner 350 may build a kernel of SVM (Support Vector Machines) given HIN commuting matrices. The multi-kernel learner 350 may use standard multi-kernel learning to optimize weights of different meta paths. Given a meta-path weight, all commuting matrices can be combined to formulate a more powerful malware detection kernel.

분류 모델(370)은 악성 소프트웨어 탐지기라고도 할 수 있다. 새로 수집된 알 수 없는 소프트웨어마다 PE 파일 분석기(310)를 통해 이 소프트웨어 PE 파일을 먼저 구문 분석한 다음 PE 헤더 정보, DLL 이름, Opcode 시퀀스 및 API 호출을 추출하고 더 분석될 것이다. 이러한 추출된 특성들을 바탕으로 작성된 분류 모델(370)을 사용하여 PE 파일(예를 들어, 소프트웨어)는 양성 또는 악성 코드 중 하나로 분류될 수 있다.The classification model 370 may also be referred to as a malicious software detector. For each newly collected unknown software, the PE file parser 310 will first parse this software PE file, then extract the PE header information, DLL name, Opcode sequence, and API calls and further analyze. Using the classification model 370 created based on these extracted characteristics, a PE file (eg, software) may be classified as either benign or malicious code.

도 4는 도 3에 도시된 PE 파일 분석기의 개락적인 블록도를 나타내고, 도 5a 및 도 5b는 도 3에 도시된 PE 파일 분석기가 특성을 추출하는 동작의 일 예를 설명하기 위한 도면이다.FIG. 4 shows a schematic block diagram of the PE file analyzer shown in FIG. 3, and FIGS. 5A and 5B are diagrams for explaining an example of an operation in which the PE file analyzer shown in FIG. 3 extracts characteristics.

도 4 및 도 5에서는 PE 파일들을 추출된 특성들을 이용하여 어떻게 표현하는지에 대한 자세한 접근 방법 및 추출된 특성들을 기반으로 분류 문제를 해결하는 방법을 설명한다. 4 and 5 describe a detailed approach on how to express PE files using extracted characteristics and a method of solving a classification problem based on the extracted characteristics.

PE 파일 분석기(310)는 디컴파일러(decompiler; 313), 특성 추출기(Feature Extractor; 315), 분석기(317)를 포함할 수 있다. 디컴파일러(313)는 PE 파일들을 전처리하고(pre-process), 전처리된 PE 파일들을 디컴파일할 수 있다. 특성 추출기(315)는 PE 파일들로부터 특성들, 예를 들어 API 호출(API calls) 및 기타 관련 정보를 자동으로 추출할 수 있다.The PE file analyzer 310 may include a decompiler (313), a feature extractor (315), and an analyzer (317). The decompiler 313 may pre-process PE files and decompile the pre-processed PE files. The feature extractor 315 can automatically extract features, such as API calls and other related information, from PE files.

PE 파일 분석기(310)는 디컴파일러(decompiler; 313), 특성 추출기(Feature Extractor; 315) 및 분석기(317)를 포함할 수 있다. 디컴파일러(313)는 PE 파일들을 전처리하고(pre-process), 전처리된 PE 파일들을 디컴파일할 수 있다. The PE file analyzer 310 may include a decompiler (313), a feature extractor (315), and an analyzer (317). The decompiler 313 may pre-process PE files and decompile the pre-processed PE files.

특성 추출기(315)는 PE 파일들로부터 특성들, 예를 들어 API 호출(API calls) 및 기타 관련 정보를 자동으로 추출할 수 있다. The feature extractor 315 can automatically extract features, such as API calls and other related information, from PE files.

이때, API 호출(API calls)은 해당 API 호출(API calls)의 정적 실행 순서(static execution sequence)를 나타내는 글로벌(또는 전역) 정수 ID들(global integer IDs)의 그룹으로 변환될 수 있다. 마찬가지로, PE 파일 정보, Opcode 시퀀스 및 DLL도 해당 전역 정수 ID들(corresponding global integer IDs)로 변환될 수 있다.In this case, API calls may be converted into a group of global (or global) integer IDs representing a static execution sequence of the corresponding API call. Similarly, PE file information, Opcode sequences, and DLLs can also be converted into corresponding global integer IDs.

API 시퀀스를 추출할 때, 특성 추출기(315)는 먼저 컨트롤 흐름 그래프(control flows graph)를 생성해야 한다. 컨트롤 흐름(control flow)은 단일 어셈블러 명령어(single assembler instruction)로 구성된 실행 경로(run path)에 의한 프로그램 명령문(program statements)의 시퀀스이다. 각 함수는 기본 블록(basic block)이며 연속적인 어셈블리 명령어들(consecutive assembly instructions)의 모음(collection)이다. 컨트롤 흐름에 대한 항목(entry)은 기본 블록의 시작 명령어(start instruction)이며, 명령어가 끝난 후 이 기본 블록에서 점프한다. 어셈블리 코드는 여러 개의 하위 함수들(sub-functions)을 포함하고 있으며, 이러한 하위 함수들은 도 5a와 같이 전체 프로그램 컨트롤 흐름 그래프를 구성하기 위해 기본 블록 사이의 점프를 증가시킨다.When extracting the API sequence, the feature extractor 315 must first generate a control flows graph. A control flow is a sequence of program statements by a run path consisting of a single assembler instruction. Each function is a basic block and is a collection of consecutive assembly instructions. The entry for the control flow is the start instruction of the basic block, and jumps from this basic block after the instruction ends. The assembly code includes several sub-functions, and these sub-functions increase the jump between basic blocks to construct the entire program control flow graph as shown in Fig. 5A.

어셈블리 명령어들(the assembly instructions)에서, 컨트롤 흐름 그래프의 역할에 따라 일반, jmp, jcc, 호출(call), 리턴(return), 시작(start)으로 나눌 수 있습니다. API 호출을 추출하기 위해 호출 명령어 노드, 주요 함수 항목 및 리턴 만 유지한다. 이와 같은 방법으로, 특성 추출기(315)는 도 5b와 같이 컨트롤 흐름 그래프의 단순화된 다이어그램을 얻는다. 마지막으로, 특성 추출기(315)는 API 시퀀스(API call 시퀀스)를 추출하기 위해 명령어의 순서에 따라 단순화된 컨트롤 흐름 그래프를 탐색한다.In the assembly instructions, depending on the role of the control flow graph, it can be divided into general, jmp, jcc, call, return, and start. It keeps only the call instruction node, key function item and return to extract the API call. In this way, the feature extractor 315 obtains a simplified diagram of the control flow graph as shown in FIG. 5B. Finally, the feature extractor 315 searches the simplified control flow graph according to the order of the instructions to extract the API sequence (API call sequence).

opcode 시퀀스의 추출은 API 시퀀스의 추출과 유사하다. opcode에 대한 컨트롤 흐름 그래프를 작성한 다음, 플로우 그래프를 탐색하여 가능한 모든 실행 순서들(execution sequences)을 추출한다. 특성 추출기(315)는 서브 함수에 포함된 자체 루프(self-loop)에 따라 루프에 대해서만 탐색하므로 PE 파일들의 opcode 시퀀스를 추출할 수 있다.The extraction of opcode sequences is similar to the extraction of API sequences. Build a control flow graph for the opcode, then navigate through the flow graph to extract all possible execution sequences. Since the feature extractor 315 searches only for a loop according to a self-loop included in a sub-function, an opcode sequence of PE files can be extracted.

분석기(317)는 추출된 특성들을 이용하여 행렬들을 생성할 수 있다. 행렬들은 PE 파일에 관한 행렬들로, PE 파일과 추출된 특성들 간의 관계를 표현하는 행렬들일 수 있다. 예를 들어, 관계는 PE 파일과 API 호출에 대한 제1 관계, PE 파일의 PE 헤더 정보의 속성값에 대한 제2 관계, API 호출이 속하는 패키지에 대한 제3 관계, PE 파일의 API 시퀀스에 대한 제4 관계, PE 파일의 Opcode 시퀀스에 대한 제5 관계를 포함할 수 있다.The analyzer 317 may generate matrices using the extracted characteristics. The matrices are matrices for the PE file, and may be matrices expressing the relationship between the PE file and the extracted characteristics. For example, the relationship is the first relationship for the PE file and the API call, the second relationship for the attribute value of the PE header information in the PE file, the third relationship for the package to which the API call belongs, and the API sequence for the PE file. The fourth relationship may include a fifth relationship for the Opcode sequence of the PE file.

■ PE 헤더 정보■ PE header information

PE 헤더 정보는 이름, 크기, 오프셋, 유형(type) 등과 같은 중요한 정보를 포함한다. 상술한 정보를 속성(property)이라고 하며 이 값들을 속성 값(property value)이라고 한다. 같은 속성 값을 가진 두 개의 PE 파일들은 유사성을 가지고 있다. 이러한 종류의 관계

을 표현하기 위해, 분석기(317)는 속성값 행렬(property-value matrix)

을 생성한다. 여기서, 각 요소

는 속성들의 쌍(pair)이 같은 값인지를 나타낸다.PE header information includes important information such as name, size, offset, and type. The above-described information is called a property, and these values are called a property value. Two PE files with the same attribute value have similarities. This kind of relationship

In order to represent, the analyzer 317 has a property-value matrix.

Produces Where each element

Indicates whether the pair of properties is the same value.

■ API 패키지■ API Package

API 호출들은 PE 파일의 행동(behavior)을 나타내기 위해 사용될 수 있으며, 이들(API calls) 사이의 관계는 악성 소프트웨어 탐지에 중요한 정보를 암시할 수 있다. PE 파일의 모든 API 호출들은 동일하거나 다른 DLL(dynamic link library)에 속한다. 우리는 동일한 DLL에 속한 API 호출은 항상 동일한 의도를 나타낸다는 것을 발견했다. 동일한 DLL에있는 API는 같은 패키지에 속한다고 한다. API calls can be used to indicate the behavior of a PE file, and the relationship between API calls can imply information that is important for detecting malicious software. All API calls in PE files belong to the same or different dynamic link library (DLL). We found that API calls belonging to the same DLL always show the same intent. It is said that APIs in the same DLL belong to the same package.

제공된 함수에 따르면, Windows API는 7가지 카테고리로 분류될 수 있다. 7가지 카테고리는 기본 서비스(kernel32.dll, advapi32.dll 등), 그래픽 장치 인터페이스, 그래픽 사용자 인터페이스(user32.dll), 공용 대화 링크 라이브러리(common dialog links library), 유니버설 스페이스 링크 라이브러리(universal space link library), Windows 쉘(Windows shell), 웹 서비스 web services)을 포함한다. According to the provided function, the Windows API can be classified into 7 categories. The seven categories are basic services (kernel32.dll, advapi32.dll, etc.), graphical device interface, graphical user interface (user32.dll), common dialog links library, and universal space link library. ), Windows shell, web services.

예를 들어 "advapi32.DLL" DLL 내 API 호출들은 레지스트리 호출(registry calls)과 관련이 있다. API 호출들은 동일한 패키지에서 공동으로 나타나며 둘 사이의 강력한 관계를 나타낸다. For example, the API calls in the "advapi32.DLL" DLL are related to registry calls. API calls appear jointly in the same package and represent a strong relationship between the two.

이러한 종류의 관계

를 나타내기 위해, 분석기(317)는 공통 패키지 행렬(co-package matrix)

을 생성한다. 여기서, 각 요소

는 API 호출들의 한 쌍이 같은 패키지에 있는지를 나타낸다. 예를 들어, 동일한 패키지 "advapi32.DLL"에서 두 개의 서로 다른 API 인 "RegDeleteKeyA"와 "RegQueryValueExA"가 두 개의 PE 파일들에서 호출된다. 이 두 API 간의 관계를 나타내는 행렬의 값은 1일 수 있다.This kind of relationship

To indicate, the analyzer 317 is a common package matrix (co-package matrix)

Produces Where each element

Indicates whether a pair of API calls are in the same package. For example, two different APIs, "RegDeleteKeyA" and "RegQueryValueExA" in the same package "advapi32.DLL" are called from two PE files. A matrix value representing a relationship between the two APIs may be 1.

■ API 시퀀스.■ API sequence.

PE 파일을 실행하는 동안, API 호출들의 호출 시퀀스(call sequence)는 중요한 관계를 나타낸다. Google은 API 호출 시퀀스 (행동) 데이터를 수집하고, 각 PE 파일 간의 관계를 나타내기 위해 API들 특성(APIs feature)의 2-시퀀스들(2-sequences)을 사용한다. 예를 들어 PE 파일 호출 API 시퀀스는 "RegCloseKey → NtClose → GetProcessHeap"이며 "RegCloseKey → NtClose"및 "NtClose → GetProcessHeap"의 두 가지 특성으로 정의되며, 이 특성 유형(feature type)은 API 시퀀스로 기록된다. 따라서 동일한 API 시퀀스를 가진 두 개의 PE 파일이 있다면, 우리는 이들이 몇 가지 유사점을 가질 수 있다고 생각한다. 이와 같은 종류의 관계

를 나타내기 위해, 분석기(317)는 API 시퀀스 행렬(API-sequence matrix)

을 생성한다. 여기서, 각 요소

은 속성들의 한 쌍(a pair of properties)이 동일한 API 시퀀스를 갖는지 여부를 나타낸다.During PE file execution, the call sequence of API calls represents an important relationship. Google collects API call sequence (behavior) data and uses 2-sequences of the APIs feature to indicate the relationship between each PE file. For example, the PE file call API sequence is "RegCloseKey → NtClose → GetProcessHeap" and is defined by two properties: "RegCloseKey → NtClose" and "NtClose → GetProcessHeap", and this feature type is recorded as an API sequence. So if you have two PE files with the same API sequence, we think they can have some similarities. This kind of relationship

To indicate, the analyzer 317 is an API sequence matrix (API-sequence matrix)

Produces Where each element

Indicates whether a pair of properties has the same API sequence.

■ Opcode 시퀀스.■ Opcode sequence.

맬웨어 분석에서, opcode 특성(opcode feature)은 실행 프로세스의 행동 특성(behavioral characteristics)을 고려할 수 있으며, 맬웨어를 더 자세히 설명할 수 있다. 각 PE 파일 간의 관계를 나타내기 위해 Opcode 특성들의 2-시퀀스를 사용한다. 예를 들어, “push →push” 및 “push →call”의 두 가지 특성들로 정의되는 “push →push →call”라는 PE 파일 Opcode 시퀀스가 있다. 이 특성 유형은 Op-sequence로 기록된다. 따라서 동일한 Op-sequence를 가진 두 개의 PE 파일이 있다면, 그것들이 약간의 유사점을 가지고 있다고 간주할 수 있다. 이러한 종류의 관계

를 표현하기 위해, 분석기(317)는 Op-sequence 행렬

을 생성한다. 여기서, 각 요소

는 속성들의 한 쌍이 동일한 Op-sequence를 갖는지 여부를 나타낸다.In malware analysis, the opcode feature can take into account behavioral characteristics of the execution process and describe the malware in more detail. A 2-sequence of Opcode properties is used to indicate the relationship between each PE file. For example, there is a PE file Opcode sequence called “push →push →call” defined by two properties: “push →push” and “push →call”. This characteristic type is recorded as Op-sequence. So if you have two PE files with the same op-sequence, you can assume that they have some similarities. This kind of relationship

To represent, the analyzer 317 is an Op-sequence matrix

Produces Where each element

Indicates whether a pair of attributes have the same op-sequence.

서로 다른 관계와 관계 행렬 내 각 요소에 대한 설명 요약은 표 1와 같을 수 있다.Table 1 shows a summary of descriptions of the different relationships and each element in the relationship matrix.

도 6은 도 3에 도시된 HIN 생성기의 동작을 설명하기 위한 도면이다.6 is a view for explaining the operation of the HIN generator shown in FIG.

도 6에서는 API의 풍부한 관계 유형들(relationship types)을 더 잘 분석하기 위해 위에서 추출한 특성들을 사용하여 HIN를 사용하여 PE 파일들을 나타내는 방법을 설명한다.In FIG. 6, a method of representing PE files using HIN using the characteristics extracted above to better analyze the rich relationship types of the API is described.

HIN 생성기(330)는 행렬들에 기초하여 HIN을 구성할 수 있다. HIN은 링크 유형 매핑 함수(link type mapping function)

및 객체 유형 매핑 함수(object type mapping function)

를 갖는 그래프

이다. 여기서, 각 객체

는 특정 객체 유형

에 속하고, 각 링크

는 특정 관계

에 속한다. 객체 유형들의 수

또는 링크 유형들의 수

인 경우, 이러한 네트워크를 HIN이라고하는 이기종 정보 네트워크라고 한다.The HIN generator 330 may configure HIN based on matrices. HIN is a link type mapping function

And object type mapping function

Graph with

to be. Where each object

Is a specific object type

Each link belongs to

Is a specific relationship

Belongs to Number of object types

Or number of link types

In this case, such a network is called a heterogeneous information network called HIN.

악성 소프트웨어 탐지를 위한 시스템에서는, 5개의 객체 유형들을 가지고 있다. 5개의 객체 유형들은 PE 파일, PE 헤더 정보, Opcode 시퀀스, API 호출 및 API 시퀀스를 포함한다. 4개의 관계 유형들이 있다. 4개의 관계 유형들은 API 호출 및 PE 헤더 정보를 포함하는 PE 파일, 동일한 패키지 내의 API 호출들, 및 동일한 API 시퀀스를 갖는 PE 파일, 및 동일한 Opcode 시퀀스를 갖는 PE 파일을 포함한다. 서로 다른 객체 유형과 서로 다른 관계 유형이 유사한 관계들의 풍부한 네트워크를 구성하므로, HIN 생성기(330)는 HIN의 메타 경로 접근법을 사용하여 객체들 간의 상위 레벨 의미 관계를 공식화할 수 있다.In the system for detecting malicious software, there are five object types. The five object types include PE file, PE header information, Opcode sequence, API call and API sequence. There are four types of relationships. The four relationship types include a PE file containing API calls and PE header information, API calls in the same package, and a PE file with the same API sequence, and a PE file with the same Opcode sequence. Since different object types and different relationship types constitute a rich network of similar relationships, HIN generator 330 can formulate high-level semantic relationships between objects using HIN's metapath approach.

메타 경로

는 네트워크 스키마(network schema)

의 그래프 상의 경로이다. 그 형식은

와 같이 기록되는데, 유형

와 유형

간의 복합 관계(composite relationship)

를 정의한다. 여기서 "

"는 관계(relation)에 대한 복합 연산(complex operation)을 나타낸다. PE 파일의 경우, 일반적인 메타 경로는

이다. 이것은 HIN 내 동일한 API를 통해 두 개의 서로 다른 PE 파일을 연결할 수 있음을 의미한다. HIN 생성기(330)는 PathSim 방법(PathSim method)을 사용하여 메타 경로를 통해 객체들의 유사성을 계산할 수 있다.Meta path

Is the network schema

Is the path on the graph. The format

It is recorded as, type

And type

Composite relationship

Define here "

"Represents a complex operation for a relation. In the case of PE files, a typical metapath is

to be. This means that you can link two different PE files through the same API in HIN. The HIN generator 330 may calculate the similarity of objects through the meta path using the PathSim method.

PathSim 방법(PathSim method)은 메타 경로 기반 유사성 측정이다. 대칭적인 메타 경로

가 주어지면, 동일한 유형의 객체 와 y에 대한 PathSim 정의(PathSim definition)는 다음과 같다.The PathSim method is a meta-path-based similarity measure. Symmetric meta path

If is given, PathSim definition for the same type of object and y is as follows.

여기서,

는 x와 y 사이의 경로 인스턴스(path instance)이고,

는 와 x 사이의 경로 인스턴스이고,

는 y와 y 사이의 경로 인스턴스(path instance)이다.here,

Is a path instance between x and y,

Is the path instance between and x,

Is a path instance between y and y.

이것은 메타 경로

가 주어진다면,

는 두 부분으로 정의될 수 있다. (1) 메타 경로를 따르는 그들 사이의 경로 수로 정의된 연결성이고 (2) 가시성의 균형이다. 가시성은 그들 사이의 경로 인스턴스들의 수로 정의된다. 경로 인스턴스의 가중치로 경로 인스턴스의 멀티플 발생을 계산할 수 있다. 경로 인스턴스의 가중치는 경로 인스턴스 내 모든 경로들의 가중치 곱이다.This is a meta path

Given,

Can be defined in two parts. (1) connectivity defined by the number of paths between them along the meta path, and (2) balance of visibility. Visibility is defined as the number of path instances between them. Multiple occurrences of the route instance may be calculated by the weight of the route instance. The weight of the path instance is the weight product of all paths in the path instance.

네트워크

와 네트워크 스키마

가 주어진 경우, 메타 경로

에 대한 맞바꿈 행렬은

로 정의된다. 여기서,

은 유형

와 유형

사이의 인접 행렬(adjacency matrix)이다.

는 메타 경로

에서 객체

과 객체

사이의 경로 인스턴스의 수를 나타낸다.network

And network schema

If given, meta path

The inversion matrix for

Is defined as here,

Silver type

And type

Is the adjacency matrix between.

Meta path

In object

And objects

Indicates the number of path instances between.

예를 들어, PE 파일들과 API 호출들 사이의 인접 행렬은

이다. 그러면 메타 경로

을 사용하여 계산된 PE 파일들의 맞바꿈 행렬은

, 즉,

이다.

를 행렬

의

번째 로우로 나타내면, PE 파일

와

의 유사도는

로 주어 지는데, 이는 단순히 두 특징 벡터의 내적(dot product)이다. 보다 복잡한 유사성은 메타 경로를 기반으로 한 맞바꿈 행렬에 의해 정의될 수 있다. 즉, 내부 API 호출만 고려하지 않고, 두 앱(예를 들어, PE 파일들) 간의 유사성을 계산한다.For example, the adjacency matrix between PE files and API calls

to be. Meta path

The inversion matrix of PE files calculated using

, In other words,

to be.

Matrix

of

In the first row, PE file

Wow

The similarity of

Is given, which is simply the dot product of two feature vectors. More complex similarities can be defined by an inversion matrix based on metapaths. That is, without considering only the internal API call, the similarity between the two apps (for example, PE files) is calculated.

도 7은 도 3에 도시된 다중 커널 학습기 및 분류 모델에 대해 설명하기 위한 도면이다.FIG. 7 is a diagram for explaining the multi-kernel learner and classification model illustrated in FIG. 3.

여러 유형의 엔티티들과 관계들이 있는 네트워크 스키마가 주어지면, 많은 메타 경로들을 열거할 수 있다. 따라서 직관적인 방법은 서로 다른 메타 경로들을 결합하는 것이다.Given a network schema with relationships with different types of entities, many meta paths can be enumerated. Therefore, an intuitive way is to combine different meta paths.

다중 커널 학습기(350)는 PE 파일들을 분류할 때, 다중 커널 학습 알고리즘을 사용하여 서로 다른 유사성들을 자동으로 통합하고, 각 메타 경로의 가중치를 결정할 수 있다. 개의 메타 경로

가 있다고 가정한다. 다중 커널 학습기(350)는 개의 메타 경로에 대응하는 맞바꿈 행렬

을 계산할 수 있다. 여기서

는 커널로 간주된다. 맞바꿈 행렬이 PSD(positive semi-definite)이 아닌 경우, 맞바꿈 행렬의 음의 고유값들(negative eigenvalues)을 제거한다. 다음과 같이, 다중 커널 학습기(350)는 커널들의 선형 결합을 사용하여 새 커널을 형성할 수 있다When classifying PE files, the multi-kernel learner 350 may automatically integrate different similarities using a multi-kernel learning algorithm and determine the weight of each metapath. Meta paths

Suppose there is. The multi-kernel learner 350 has an inversion matrix corresponding to the meta paths of the dogs.

Can be calculated. here

Is considered a kernel. When the inversion matrix is not PSD (positive semi-definite), the negative eigenvalues of the inversion matrix are removed. As described below, the multi-kernel learner 350 may form a new kernel using linear combination of kernels.

여기서,

이고,

을 만족한다. here,

ego,

Satisfies

각 메타 경로의 가중치를 배우기 위해, 라벨링된 데이터(labeled data)의 집합

을 가정한다. 여기서,

은 PE 파일 (여기서,

을 ID로 간주할 수 있음)이고,

은 라벨(label)이다. 그런 다음 다중 커널 학습기(350)는 다음과 같은 목적 함수(objective function)를 갖는 p-norm 다중 커널 학습 프레임워크(-norm Multi-kernel Learning framework)를 사용하여 다음과 같은 목적 함수를 사용하여 파라미터들을 학습할 수 있다.Set of labeled data to learn the weight of each metapath

Suppose here,

PE file (where

Can be considered as ID),

Is a label. The multi-kernel learner 350 then uses the p-norm multi-kernel learning framework with the following objective function to set parameters using the following objective function: I can learn.

여기서, 각 커널에 대해 파라미터 벡터

를 학습한다. 각 데이터

에 대해, 슬랙 파라미터

는 오분류(misclassification)를 혀용하기 위해 도입되었다.

는 커널을 정의하는 Hilbert 공간에서의 특징들(features)의 비선형 맵핑이다. 여기서,

이다. 그런 다음 표현 정리(representation theorem)를 적용하면,

을 얻을 수 있다.

는 이중 공식(dual formulation)을 사용하여 해결할 수 있으며, 0이 아닌

들은 지원 벡터들(support vector)로 이어진다.Here, the parameter vector for each kernel

To learn. Each data

About, slack parameter

Was introduced to take advantage of the misclassification.

Is a non-linear mapping of features in the Hilbert space that defines the kernel. here,

to be. Then apply the representation theorem,

Can get

Can be solved using a dual formulation, nonzero

These lead to support vectors.

다중 커널 학습 프레임 워크에서,

이외의 다른 매개 변수 집합은

이다. 여기서 -norm

은

들의 최적화를 정규화하는 데 사용된다. 경험적으로 문제에 2- 놈(2-norm)을 적용할 수 있다. 최적화 후, 가중치

는 커널들로 사용되는 메타 경로들의 중요성을 나타내기 위해 최적화된다. 새로운 PE 파일 x가 오는 경우, PE 파일이 악의적인지 여부를 평가하는 데

이 사용된다.In a multi-kernel learning framework,

A set of parameters other than

to be. Where -norm

silver

Used to normalize their optimization. Empirically, you can apply a 2-norm to a problem. After optimization, weight

Is optimized to indicate the importance of metapaths used by kernels. When a new PE file x comes, it is used to evaluate whether the PE file is malicious

Is used.

도 8은 일 실시예에 따른 악성 소프트웨어 탐지 방법에 의해 수행된 실험 결과를 설명하기 위한 그래프이다.8 is a graph for explaining the experimental results performed by the malicious software detection method according to an embodiment.

1000 개의 악성 샘플(malicious samples)과 1000 개의 양성 샘플(benign samples)을 선택하여 실험을 수행했다. 악성 샘플에는 웜(Worm), 백도어(Backdoor), 트로이 목마(Trojan), 루트 킷(Rootkit) 등이 포함되었으며, 양성 샘플에는 뷰어(viewers), 게임(games), 브라우저(browsers,) 등이 포함되었다. 제안된 방법의 성능을 평가하기 위해, 정확도, 참 긍정 비율(TPR(true positive ratio)), 참 부정 비율(TNR(true negative ratio)), 거짓 긍정 비율(FPR(false positive ratio)), 거짓 부정 비율(FNR(false negative ratio))을 측정하였고, 이러한 관련 측정은 표 2와 같다.The experiment was performed by selecting 1000 malicious samples and 1000 benign samples. Malicious samples included Worm, Backdoor, Trojan, and Rootkit, and positive samples included viewers, games, browsers, etc. Became. To evaluate the performance of the proposed method, accuracy, true positive ratio (TPR), true negative ratio (TNR), false positive ratio (FPR), false negative The ratio (FNR (false negative ratio)) was measured, and the relevant measurements are shown in Table 2.

이 실험 세트에서는 악의적인 샘플과 양성 샘플을 무작위로 5개의 하위 집합, 4개의 하위 집합으로 나누어 분류 모델(또는 탐지 모델)을 구성했다(이중 800 개는 양성 샘플, 나머지 800 개는 악성 샘플). 나머지 1개의 하위 집합은 모델 테스트(그 중 200 개는 양성으로 분류되고 200개는 악성으로 분류됩니다))에 사용되었다.In this set of experiments, a classification model (or detection model) was constructed by randomly dividing the malicious and positive samples into 5 subsets and 4 subsets (800 of which are positive samples, and the remaining 800 are malicious samples). The remaining 1 subset was used for model testing (200 of which were classified as positive and 200 as malignant).

실험 세트에서 특성들을 추출하고 이들 간의 관계를 생성한 다음 SVM(Support Vector Machine)을 사용하여 5개의 메타 경로를 구성하고 탐지 성능을 비교했다. 또한, 다중 커널 학습을 적용하여 실험하기 위해 모든 메타 경로(SS^T, SVS^T, SPS^T, SAS^T, SOS^T)를 사용했다. 이들 실험 결과를 표 3 및 도 8에 도시된 바와 같다.After extracting the characteristics from the experimental set and creating a relationship between them, five meta-paths were constructed using SVM (Support Vector Machine) and the detection performance was compared. In addition, all metapaths (SS ^T , SVS ^T , SPS ^T , SAS ^T , SOS ^T ) were used to experiment by applying multi-kernel learning. The results of these experiments are shown in Table 3 and FIG. 8.

실험에서, 단일 메타 경로 생성 모델을 적용하고 API 시퀀스의 메타 경로를 기반으로 한 실험 결과가 가장 우수함을 확인할 수 있다. 95.5 %의 진정한 탐지율과 3.5 %의 오 탐지율에 해당한다. In the experiment, we can apply the single meta-path generation model and confirm that the experimental results based on the meta-path of the API sequence are the best. This corresponds to a true detection rate of 95.5% and a false detection rate of 3.5%.

반대로, PE 헤더 정보 속성 값의 메타 경로를 기반으로 한 실험 결과는 낮다. 84.5 %의 진정한 탐지율과 16.5 %의 오 탐지율에 해당한다. PE 헤더 정보의 속성에 많은 신뢰할 수 없는 속성 값이 있다고 볼 수 있다. 이러한 신뢰할 수 없는 속성 값은 실험 결과를 악화시킬 수 있다.Conversely, the experimental results based on the meta-path of PE header information attribute values are low. This corresponds to a true detection rate of 84.5% and a false detection rate of 16.5%. It can be seen that there are many untrusted attribute values in the attributes of PE header information. These unreliable attribute values can exacerbate experimental results.

MKL(Multi-kernel Learning)을 사용하여 모든 메타 경로 구성 탐지 모델을 결합하여 최상의 실험 결과를 얻었으며 실제 탐지율은 98.5 %이고 오 탐지율은 2 %이며, 정확도는 98.25%에 해당한다. 이 실험 결과는 제안된 방법의 성능이 효과적임을 보여준다.Using MKL (Multi-kernel Learning), all meta-path configuration detection models were combined to obtain the best experimental results. The actual detection rate was 98.5%, the false detection rate was 2%, and the accuracy was 98.25%. The results of this experiment show that the performance of the proposed method is effective.

상술한 실험 결과를 바탕으로, 실시예에 따른 악성 소프트웨어 탐지 방법은 PE 헤더 정보의 속성을 필터링하고 신뢰할 수 없는 속성을 제거하며 분류 모델 성능을 향상시키는 더 중요한 특성을 추출하여 이용할 수 있다. 또한, 실시예에 따른 악성 소프트웨어 탐지 방법은 메타 경로의 길이를 확장하고 메타 경로(예를 들어, SOAO^TS^T, SVPV^TS^T 등) 간의 연결을 더 늘릴 수도 있다.Based on the above-described experimental results, the malicious software detection method according to the embodiment may filter and use attributes of PE header information, remove unreliable attributes, and extract and use more important characteristics to improve classification model performance. In addition, the malicious software detection method according to the embodiment may extend the length of the meta-path and further increase the connection between meta-paths (eg, SOAO ^T S ^T , SVPV ^T S ^T, etc.).

도 1 내지 도 8을 참조하여, 이종 정보 망 (HIN) 기반의 새로운 악성 소프트웨어 탐지 방법을 설명했다. 실시예들은 PE 파일을 분석하여 PE 헤더 정보, API 호출, DLL 및 opcode를 피쳐로 추출하고 속성들 간의 관계에 대한 HIN를 구축한 다음, 메타 경로 기반 방법을 사용하여 해당 PE 파일의 의미 관련성을 설명할 수 있다.1 to 8, a new malicious software detection method based on a heterogeneous information network (HIN) has been described. Embodiments analyze PE files, extract PE header information, API calls, DLLs, and opcodes as features, build HINs for relationships between attributes, and then use meta-path-based methods to describe the semantic relevance of the PE files can do.

탐지 시스템(예를 들어, 분류 모델)을 구축하기 위해, 실시예들은 각 메타 경로를 적용하여 PE 파일들 간의 유사성을 계산하고, 다중 커널 학습을 사용하여 서로 다른 유사점을 집계할 수 있다.To build a detection system (e.g., a classification model), embodiments can calculate the similarity between PE files by applying each metapath, and aggregate different similarities using multi-kernel learning.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, or the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiments or may be known and usable by those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs, DVDs, and magnetic media such as floptical disks. Includes hardware devices specifically configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language code that can be executed by a computer using an interpreter, etc., as well as machine language codes produced by a compiler. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instruction, or a combination of one or more of these, and configure the processing device to operate as desired, or process independently or collectively You can command the device. Software and/or data may be interpreted by a processing device, or to provide instructions or data to a processing device, of any type of machine, component, physical device, virtual equipment, computer storage medium or device. , Or may be permanently or temporarily embodied in the transmitted signal wave. The software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described by the limited drawings, those skilled in the art can apply various technical modifications and variations based on the above. For example, the described techniques are performed in a different order than the described method, and/or the components of the described system, structure, device, circuit, etc. are combined or combined in a different form from the described method, or other components Alternatively, proper results can be achieved even if replaced or substituted by equivalents.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

Extracting features from Portable Executable (PE) files;
Generating a heterogeneous information network (HIN) for the relationship between the PE files and the characteristics; And
Detecting whether the PE files are malicious software by using the meta path on the HIN
Malicious software detection method comprising a.

According to claim 1,
The features include PE header information, API calls (calls), DLL and Opcode sequences.

According to claim 1,
The detecting step,
Detecting whether the PE files are malicious software by calculating the similarity between the PE files using the meta path on the HIN
Malicious software detection method comprising a.

According to claim 1,
The relationship is
A first relationship for PE files and API calls;
A second relationship to attribute values of PE header information of PE files;
A third relationship to the package to which the API call belongs;
A fourth relationship to the API sequence of the PE file; And
Fifth Relationship to Opcode Sequences in PE Files
Malicious software detection method comprising a.

According to claim 4,
The meta path,
A first meta-path constructed through a first matrix representing the first relationship;
A second meta-path constructed through a second matrix representing the second relationship;
A third meta-path constructed through a third matrix representing the third relationship;
A fourth meta-path constructed through a fourth matrix expressing the fourth relationship; And
A fifth meta-path constructed through a fifth matrix expressing the fifth relationship
Malicious software detection method comprising a.

The method of claim 5,
The meta path,
Malicious software detection method further comprising a linearly combined meta-path by optimizing the first meta-path, the second meta-path, the third meta-path, the fourth meta-path, and the fifth meta-path through multi-kernel learning .

According to claim 4,
The meta path,
A first meta-path constructed through a first matrix representing the first relationship;
A second meta-path constructed through a second matrix representing the second relationship;
A third meta-path constructed through a third matrix representing the third relationship;
A fourth meta-path constructed through a fourth matrix expressing the fourth relationship; And
A fifth meta-path constructed through a fifth matrix expressing the fifth relationship
Is a meta-path that is linearly combined by optimizing through multi-kernel learning.

A receiver that receives Portable Executable (PE) files; And
A controller that extracts characteristics from the PE files, creates a heterogeneous information network (HIN) for the relationship between the PE files and the characteristics, and detects whether the PE files are malicious software using a meta path on the HIN
Malicious software detection device comprising a.

The method of claim 8,
The above features are PE header information, API call (call), DLL and Opcode sequence, malicious software detection device.

The method of claim 8,
The controller,
Malicious software detection device for detecting whether the PE files are malicious software by calculating the similarity between the PE files using the meta path on the HIN.

The method of claim 8,
The relationship is
A first relationship for PE files and API calls;
A second relationship to attribute values of PE header information of PE files;
A third relationship to the package to which the API call belongs;
A fourth relationship to the API sequence of the PE file; And
Fifth Relationship to Opcode Sequences in PE Files
Malicious software detection device comprising a.

The method of claim 11,
The meta path,
A first meta-path constructed through a first matrix representing the first relationship;
A second meta-path constructed through a second matrix representing the second relationship;
A third meta-path constructed through a third matrix representing the third relationship;
A fourth meta-path constructed through a fourth matrix expressing the fourth relationship; And
A fifth meta-path constructed through a fifth matrix expressing the fifth relationship
Malicious software detection device comprising a.

The method of claim 12,
The meta path,
Malicious software detection device further comprising a linearly combined metapath by optimizing the first metapath, the second metapath, the third metapath, the fourth metapath, and the fifth metapath through multi-kernel learning .

The method of claim 11,
The meta path,
A first meta-path constructed through a first matrix representing the first relationship;
A second meta-path constructed through a second matrix representing the second relationship;
A third meta-path constructed through a third matrix representing the third relationship;
A fourth meta-path constructed through a fourth matrix expressing the fourth relationship; And
A fifth meta-path constructed through a fifth matrix expressing the fifth relationship
Malicious software detection device, which is a meta-path that combines and optimizes through multi-kernel learning.