KR20200061830A

KR20200061830A - Malware detection and classification method and system, including pattern key parts of android applications

Info

Publication number: KR20200061830A
Application number: KR1020180147579A
Authority: KR
Inventors: 조성제; 정재민; 이현재
Original assignee: 단국대학교 산학협력단
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2020-06-03
Also published as: KR102140714B1

Abstract

The present invention relates to a technique which patterns (imaging, sounding, etc.) a main part representing an executable code that may contain a malicious code in Android applications, and applies the patterned information to machine learning to diagnose the malicious code, thereby improving the malware detection performance in the Android environment. A system for detecting and classifying Android malware comprises: an inspection area extractor for selectively extracting major parts of the Android application that may contain malicious code; a conversion unit for generating pattern data by patterning the data of the major parts extracted by the inspection area extractor; and a malicious code inspection unit which diagnoses whether pattern data generated by the conversion unit includes malicious code in the pattern data by comparing it with a conventional malicious code pattern.

Description

Malware detection classification method and system patterning major parts of Android applications {MALWARE DETECTION AND CLASSIFICATION METHOD AND SYSTEM, INCLUDING PATTERN KEY PARTS OF ANDROID APPLICATIONS}

본 발명은 안드로이드 애플리케이션의 주요부분을 패턴화한 멀웨어 탐지 분류 방법 및 시스템에 관한 것으로서, 보다 상세하게는 안드로이드 애플리케이션에서 실행코드를 포함하는 주요부분을 패턴화(이미지화, 사운드화 등)하고, 그 패턴화된 정보를 머신러닝에 적용하여 악성코드를 진단함으로서, 안드로이드 환경에서 악성코드 탐지 및 패밀리 분류 성능을 향상시키는 방법 및 시스템에 관한 기술이다.The present invention relates to a malware detection classification method and system patterning a main part of an Android application, and more specifically, patterning (imaging, sounding, etc.) the main part including the execution code in the Android application, and the pattern thereof. It is a technology for a method and system for improving malware detection and family classification performance in the Android environment by diagnosing malicious code by applying the refined information to machine learning.

모바일 플랫폼 시장 중 안드로이드를 대상으로 하는 악성코드의 등장 속도가 나날이 증가하고 있다. 보안솔루션업체 McAfee에 따르면, 2017년 3분기에 5천 760만개의 새로운 악성코드가 등장하며 최고치를 경신했다.The emergence rate of malicious codes targeting Android in the mobile platform market is increasing day by day. According to security solution company McAfee, 57.6 million new malicious codes appeared in the third quarter of 2017, breaking the peak.

폭발적으로 증가하는 악성코드에 비하여 기존의 시그니처 및 이상행위 탐지 등을 기반으로 하는 안드로이드 악성코드 탐지 시스템은 시그니처 및 anomaly DB의 업데이트 속도가 충분히 현 상황에 대처할 만큼 빠르지 않다는 한계점이 나타났다.Compared to the explosively increasing malicious code, the Android malware detection system based on the existing signature and abnormal behavior detection has a limitation that the update rate of the signature and anomaly DB is not fast enough to cope with the current situation.

이러한 한계점으로 인해 우회가 상대적으로 쉬우며 새로운 유형의 악성코드나 제로데이에 취약하다는 단점이 존재한다.Due to these limitations, there is a disadvantage that the bypass is relatively easy and vulnerable to new types of malware or zero-day.

종래 악성코드 탐지의 한계점을 극복하기 위해 머신러닝 기법이 도입되고 있다.Machine learning techniques have been introduced to overcome the limitations of conventional malware detection.

악성코드를 분석하는 방식으로는 크게 정적 분석과 동적 분석이 존재한다. 악성코드를 실행하지 않고 디컴파일 혹은 디스어셈블된 코드를 분석하는 방법인 정적 분석은 코드 커버리지(Code coverage)가 높다는 장점을 가지나, 분석에 많은 시간이 소요되며 난독화 등의 분석 방해 기법에 의해 제한적으로 이용이 가능하다. 반면, 동적 분석은 악성코드를 제한된 환경에서 실행하며 일부를 분석하는 방법이다. 이는 분석 방해 기법으로부터 상대적으로 자유로우며 상대적으로 적은 분석 시간을 요구한다는 장점이 있으나, 코드 커버리지가 협소하여 Logic bomb이나 Time bomb 등에 대처가 힘들고, 제한된 환경을 구성하기가 복잡하다는 단점이 있다.There are largely static and dynamic analysis methods for analyzing malicious code. Static analysis, which is a method of analyzing decompiled or disassembled code without executing malicious code, has the advantage of high code coverage, but it takes a lot of time to analyze and is limited by analysis obstruction techniques such as obfuscation. Can be used as On the other hand, dynamic analysis is a method of executing a malicious code in a limited environment and analyzing a part. This has the advantage that it is relatively free from analysis interference techniques and requires relatively little analysis time, but it has disadvantages that it is difficult to cope with logic bombs or time bombs due to narrow code coverage, and it is difficult to construct a limited environment.

미국 공개특허 2012-0047580United States Patent Publication 2012-0047580

이에 본 발명은 상기와 같은 종래의 제반 문제점을 해소하기 위해 제안된 것으로, 본 발명의 목적은 안드로이드 애플리케이션에서 실행코드를 포함하는 주요부분을 패턴화(이미지화, 사운드화 등)하고, 그 패턴화된 정보를 머신러닝에 적용하여 악성코드를 진단함으로서, 안드로이드 환경에서 악성코드 탐지 및 패밀리 분류 성능을 향상시키는 방법 및 시스템을 제공하기 위한 것이다.Accordingly, the present invention has been proposed to solve the above-mentioned general problems, and the object of the present invention is to pattern the main part including the execution code in an Android application (image, sound, etc.), and to pattern the It is intended to provide a method and system for improving malware detection and family classification performance in an Android environment by diagnosing malware by applying information to machine learning.

또한, 본 발명의 목적은 안드로이드 애플리케이션(APK)의 DEX 파일 전체를 패턴화하여 머신러닝에 입력하는 종래 기술을 개선하기 위해, DEX 파일에서 실행코드가 포함되는 code_item을 패턴화하고, 그 패턴화된 정보를 머신러닝에 적용하여 악성코드를 탐지 분류하는 방법 및 시스템을 제공하기 위한 것이다.In addition, the object of the present invention is to pattern the entire DEX file of the Android application (APK) to improve the conventional technique of inputting into machine learning, patterning the code_item including the execution code in the DEX file, and patterning the DEX file It is intended to provide a method and system for detecting and classifying malicious codes by applying information to machine learning.

상기와 같은 목적을 달성하기 위하여 본 발명의 기술적 사상에 의한 안드로이드 애플리케이션의 주요부분을 패턴화한 멀웨어 탐지 분류 시스템은 안드로이드 애플리케이션에서 악성코드가 포함될 수 있는 주요부분의 데이터를 선택 추출하는 검사영역 추출부와; 상기 검사영역 추출부가 추출한 주요부분을 패턴화하여 패턴데이터를 생성하는 변환부와; 상기 변환부가 생성한 패턴데이터를 종래 악성코드 패턴이 학습된 머신러닝에 입력하여 상기 패턴데이터의 악성코드 포함 여부를 진단하는 악성코드 검사부를 포함하는 것을 특징으로 한다.In order to achieve the above object, the malware detection classification system patterning the main part of the Android application according to the technical idea of the present invention is an inspection area extracting unit for selectively extracting and extracting data of a main part that may contain malicious code in the Android application. Wow; A conversion unit for patterning the main part extracted by the inspection area extraction unit to generate pattern data; It characterized in that it comprises a malicious code inspection unit for diagnosing whether the pattern data includes the malicious code by inputting the pattern data generated by the conversion unit into the machine learning in which the conventional malicious code pattern is learned.

또한, 상기 변환부에서 생성된 패턴데이터는 이미지 포맷의 데이터인 것을 특징으로 할 수 있다.In addition, the pattern data generated by the conversion unit may be characterized in that the image format data.

또한, 상기 변환부는 상기 주요부분을 이진코드(binary code) 형태로 로드한 후 기 설정된 단위로 분할하고, 분할된 이진코드를 대응되는 명암 또는 색상으로 변환하는 것을 특징으로 할 수 있다.In addition, the conversion unit may be characterized in that after loading the main portion in the form of a binary code (binary code), and divides into a predetermined unit, and converts the divided binary code into a corresponding lightness or color.

또한, 상기 변환부는 상기 악성코드 검사부의 머신러닝 알고리즘에 입력되는 데이터 포맷에 대응하여 이미지 포맷의 패턴데이터를 확대 또는 압축하는 것을 특징으로 할 수 있다.In addition, the conversion unit may be characterized in that the pattern data of the image format is enlarged or compressed in response to a data format input to the machine learning algorithm of the malicious code inspection unit.

또한, 상기 변환부에서 생성된 패턴데이터는 사운드 포맷의 데이터인 것을 특징으로 할 수 있다.In addition, the pattern data generated by the conversion unit may be characterized in that the data of the sound format.

또한, 상기 변환부는 상기 주요부분의 이진코드를 MIDI 포맷으로 변환한 후, 변환된 MIDI 포맷의 데이터를 wav 포맷 또는 MFCC(Mel-Frequency Cepstral Coefficients) 포맷으로 변환하는 것을 특징으로 할 수 있다.In addition, the conversion unit may be characterized in that after converting the binary code of the main part to MIDI format, the converted MIDI format data to a wav format or MFCC (Mel-Frequency Cepstral Coefficients) format.

또한, 상기 변환부는 주요부분의 MIDI 포맷 변환 시, 주요부분을 1 바이트 단위로 분할한 후, 상기 1 바이트를 2 비트로 구성된 제1채널과, 6 비트로 구성된 제2채널로 구성하는 것을 특징으로 할 수 있다.In addition, when converting the MIDI format of the main part, the converter may divide the main part into 1-byte units, and then configure the 1 byte into a first channel composed of 2 bits and a second channel composed of 6 bits. have.

또한, 상기 변환부는 제1채널과 제2채널의 음이 서로 중복되지 않게 적어도 어느 하나의 채널에 가중치를 더한 후 MIDI 포맷으로 변환하는 것을 특징으로 할 수 있다.In addition, the conversion unit may be characterized in that the first channel and the second channel are converted into MIDI format after adding weights to at least one channel so that the sound does not overlap with each other.

또한, 상기 주요부분은 상기 안드로이드 애플리케이션의 DEX파일 중 실행코드를 나타내는 code-item이 포함되는 것을 특징으로 할 수 있다.In addition, the main part may be characterized in that a code-item indicating execution code is included in the DEX file of the Android application.

또한, 상기 검사영역 추출부는 Data 섹션의 실행코드 부분과, 상기 안드로이드 애플리케이션의 프로그래밍 언어에 대응하는 추가영역을 추출하는 것을 특징으로 할 수 있다.In addition, the inspection area extraction unit may be characterized in that it extracts the execution code portion of the Data section, and an additional area corresponding to the programming language of the Android application.

또한, 상기 추가영역은 프로그래밍 언어가 C 또는 C++인 경우 so 확장자의 파일이고, 프로그래밍 언어가 C#인 경우 Assembly-CSharp.dll 또는 App.dll 등의 dll 파일이며, 프로그래밍 언어가 .NET libraries인 경우 System.dll 및 System.core.dll 등의 dll 파일이고, 프로그래밍 언어가 HTML인 경우 index.html 파일이며, 프로그래밍 언어가 Javascript인 경우 index.js과 같은 js 파일인 것을 특징으로 할 수 있다.In addition, the additional area is a file with a so extension when the programming language is C or C++, a dll file such as Assembly-CSharp.dll or App.dll when the programming language is C#, and System when the programming language is .NET libraries. It may be characterized by dll files such as .dll and System.core.dll, index.html file when the programming language is HTML, and js file such as index.js when the programming language is Javascript.

또한, 상기 검사영역 추출부는 Data 섹션의 실행코드 부분과, 상기 안드로이드 애플리케이션의 구글플레이 카테고리에 대응하는 추가영역을 추출하고, 상기 악성코드 검사부는 상기 애플리케이션의 카테고리에 대응하는 진단을 실시하는 것을 특징으로 할 수 있다.In addition, the scanning area extracting unit extracts the execution code portion of the Data section, and an additional area corresponding to the Google Play category of the Android application, and the malicious code scanning unit performs diagnostics corresponding to the category of the application. can do.

또한, 상기 검사영역 추출부는 Data 섹션의 실행코드 부분과, 상기 악성코드 검사부가 진단 대상으로 하는 악성코드의 멀웨어 패밀리에 대응하는 추가영역을 추출하는 것을 특징으로 할 수 있다.In addition, the inspection area extraction unit may be characterized in that the execution section of the data section, and the malicious code inspection unit extracts an additional area corresponding to the malware family of the malicious code to be diagnosed.

한편, 상기와 같은 목적을 달성하기 위하여 본 발명의 기술적 사상에 의한 안드로이드 애플리케이션의 주요부분을 패턴화한 멀웨어 탐지 분류 방법은 검사영역 추출부가 안드로이드 애플리케이션에서 악성코드가 포함될 수 있는 주요부분의 데이터를 추출하는 단계와, 변환부가 상기 검사영역 추출부에서 추출된 주요부분의 데이터를 패턴화하여 패턴데이터를 생성하는 단계와, 악성코드 검사부가 상기 변환부에서 생성된 패턴데이터를 종래 악성코드 패턴과 대비하여 상기 패턴데이터의 악성코드 포함 여부를 진단하는 단계를 포함하는 것을 특징으로 한다.On the other hand, in order to achieve the above object, the malware detection classification method patterning the main part of the Android application according to the technical idea of the present invention extracts the data of the main part that may include malicious code from the Android application in the scan area extraction unit. And generating a pattern data by patterning the data of the main part extracted from the inspection area extracting unit, and comparing the pattern data generated by the conversion unit with the conventional malicious code pattern by the malicious code inspection unit. And diagnosing whether the pattern data includes malicious code.

본 발명에 의한 안드로이드 애플리케이션의 주요부분을 패턴화한 멀웨어 탐지 분류 방법 및 시스템에 따르면,According to the malware detection classification method and system patterning the main part of the Android application according to the present invention,

첫째, 종래의 안드로이드 애플리케이션 악성코드 진단 기술에서는 추가적인 처리 과정 없이 DEX파일 전체를 진단 대상으로 하기 때문에 진단을 위한 데이터 볼륨이 상당하고 악성코드 진단에 불필요한 정보까지 고려하였으나, 본 발명은 DEX파일 중에서 Data 섹션의 실행코드(code_item)만을 추출하여 악성코드 진단을 실시하기 때문에 분석 대상이 되는 데이터 볼륨이 감축되면서도 불필요한 정보를 제외할 수 있어 악성코드의 진단율도 개선할 수 있다.First, in the conventional Android application malicious code diagnosis technology, since the entire DEX file is diagnosed without additional processing, the volume of data for diagnosis is considerable, and even unnecessary information for malicious code diagnosis is considered. Since only the execution code (code_item) is extracted to diagnose malicious code, the volume of data to be analyzed can be reduced, and unnecessary information can be excluded, thereby improving the diagnostic rate of malicious code.

둘째, 본 발명은 프로그래밍 언어에 따라 추가적으로 감염될 수 있는 영역을 발견하여 해당 영역들을 검사 데이터에 추가함으로써 악성코드 검사를 실패할 확률이 현저히 감소하게 된다.Second, according to the present invention, the probability of failing to scan a malicious code is significantly reduced by discovering an area that can be additionally infected according to a programming language and adding the areas to the inspection data.

셋째, 검사영역 추출부가 안드로이드 애플리케이션에서 주요부분만을 추출하면, 패턴데이터를 머신러닝에 입력할 수 있는 크기로 압축하더라도 손실이 현저히 더 적게 되어 악성코드 진단율이 현저히 향상될 수 있게 된다.Third, if the scan area extractor extracts only the main part from the Android application, even if the pattern data is compressed to a size that can be input to machine learning, the loss is significantly less, and the diagnosis rate of malware can be significantly improved.

도 1은 안드로이드 애플리케이션에 포함되는 파일을 나타낸 참고 도면.
도 2는 안드로이드 애플리케이션에 포함된 파일 중 classes.dex 파일에 포함된 섹션들과, Data 섹션의 세부 구성을 나타낸 참고 도면.
도 3은 DEX 클래스 구성의 색인(Dex class member indexing)을 나타낸 참고 도면.
도 4는 data 섹션과 class_defs 섹션의 관계를 나타낸 참고 도면.
도 5는 본 발명의 실시예에 따른 안드로이드 멀웨어 탐지 분류 시스템의 구성도.
도 6은 멀웨어 패밀리의 리스트를 나타낸 예시 도면.
도 7은 주요부분을 이미지 포맷으로 변환하는 실시예에 있어서, 주요부분의 이진코드를 대응하는 색상으로 변환하는 것으로 이미지를 생성하는 과정을 나타낸 도면.
도 8은 주요부분을 사운드 포맷으로 변환하는 실시예에 있어서, 주요부분의 이진코드를 MIDI 포맷으로 변환하는 과정을 나타낸 도면.
도 9는 주요부분을 사운드 포맷으로 변환하는 실시예에 있어서, MIDI 포맷으로 변환된 데이터가 wav, MFCC 등의 다른 포맷으로 변환되는 것을 나타낸 도면.
도 10은 본 발명의 실시예에 따른 안드로이드 악성앱 탐지 방법의 순서도.1 is a reference diagram showing a file included in an Android application.
2 is a reference diagram showing the detailed configuration of sections and Data sections included in the classes.dex file among files included in the Android application.
Figure 3 is a reference diagram showing the index (Dex class member indexing) of the DEX class configuration.
4 is a reference diagram showing a relationship between a data section and a class_defs section.
5 is a block diagram of an Android malware detection classification system according to an embodiment of the present invention.
6 is an exemplary diagram showing a list of malware families.
7 is a view showing a process of generating an image by converting a binary code of a main part into a corresponding color in an embodiment of converting the main part into an image format.
8 is a diagram illustrating a process of converting a binary code of a main part into a MIDI format in an embodiment of converting the main part into a sound format.
FIG. 9 is a diagram showing that data converted to MIDI format is converted to other formats such as wav and MFCC in an embodiment in which the main part is converted into a sound format.
10 is a flow chart of a method for detecting malicious apps in Android according to an embodiment of the present invention.

첨부한 도면을 참조하여 본 발명의 실시예들에 의한 안드로이드 애플리케이션의 주요부분을 패턴화한 멀웨어 탐지 분류 방법 및 시스템에 대하여 상세히 설명한다. 본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는바, 특정 실시예들을 도면에 예시하고 본문에 상세하게 설명하고자 한다. 그러나 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.With reference to the accompanying drawings will be described in detail with respect to the malware detection classification method and system patterning the main part of the Android application according to embodiments of the present invention. The present invention can be variously changed and can have various forms, and specific embodiments will be illustrated in the drawings and described in detail in the text. However, this is not intended to limit the present invention to a specific disclosure form, and it should be understood that it includes all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing each drawing, similar reference numerals are used for similar components.

또한, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.In addition, unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person skilled in the art to which the present invention pertains. Terms, such as those defined in a commonly used dictionary, should be interpreted as having meanings consistent with meanings in the context of related technologies, and should not be interpreted as ideal or excessively formal meanings unless explicitly defined in the present application. Does not.

이 실시예는 단일 또는 복수개로 구성된 컴퓨팅 시스템에서 실행되어 각 구성의 기능이 실시될 수 있다.This embodiment may be executed in a single or multiple computing system to perform functions of each configuration.

먼저, 안드로이드 애플리케이션의 구조를 설명한다.First, the structure of the Android application will be described.

도 1은 Java로 개발된 안드로이드 애플리케이션 APK의 내부 구조로서, AndroidManifest.xml, assets/, META-INF/, lib/, classes.dex, res/, resources.arsc 등의 파일이 포함된다(Unity, Xamarin, PhoneGap, Cordova, Cocos2d 등의 교차 플랫폼 앱 개발도구로 개발된 애플리케이션의 경우에는 APK 구조가 상이할 수 있다).1 is an internal structure of an Android application APK developed in Java, and includes files such as AndroidManifest.xml, assets/, META-INF/, lib/, classes.dex, res/, resources.arsc (Unity, Xamarin For applications developed with cross-platform app development tools such as, PhoneGap, Cordova, Cocos2d, etc., the APK structure may be different).

도 2를 참조하면, 이 중 classes.dex(이하, DEX파일)의 구조는 크게 header, string_ids, type_ids, proto_ids, fields_ids, method_ids, class_defs, link_data, 식별자 및 클래스 관련 오프셋들이 저장된 배열들과 DEX파일 내 실질적인 데이터와 실행코드(명령어)를 가지는 영역인 Data 섹션으로 구성된다.Referring to FIG. 2, among them, the structures of classes.dex (hereinafter, DEX file) are largely header, string_ids, type_ids, proto_ids, fields_ids, method_ids, class_defs, link_data, arrays in which identifiers and class-related offsets are stored, and in DEX files It consists of a Data section, which is an area that contains actual data and executable code (commands).

Data 섹션은 바이트코드 및 메소드의 정보가 존재하는 code_item, String 값이 저장되는 string_data, 디버깅(debugging) 관련 정보 등이 포함되는 Optional, 모든 섹션과 컴포넌트(component)의 크기와 오프셋(offset)을 가지는 map list 등으로 구성된다.Data section is code_item where byte information and method information exist, string_data where String value is stored, optional including debugging related information, and map with size and offset of all sections and components It consists of a list.

Data 섹션 외 다른 섹션들은 데이터가 아닌 오프셋과 크기에 대한 정보를 가지고 있다.Sections other than the Data section contain information about offset and size, not data.

도 3 및 도 4를 참조하면, DEX 파일의 class_defs 섹션은 클라스를 나타내는 class_def_item 들을 포함한다. class_def_item에는 class_data_off가 포함되는데, class_data_off가 class_data_item을 가리킨다.3 and 4, the class_defs section of the DEX file includes class_def_item representing classes. class_def_item includes class_data_off, where class_data_off points to class_data_item.

class_data_item은 각 클래스의 데이터를 포함한다. 또한, class_data_item은 DEX 파일의 data 섹션에 존재한다.class_data_item contains data of each class. Also, class_data_item exists in the data section of the DEX file.

data 섹션의 class_data_item은 encoded method 포맷의 direct method 및 virtual method 멤버를 포함한다. encoded method 포맷은 code_off를 포함하는데, 각 메소드는 encoded method에 의해 표현된다. code_off는 code_item을 의미하며, code_item은 각 메소드의 명령들을 포함한다. code_item의 insns_size 및 insns 멤버 필드가 메소드의 바이트 코드 즉, 실행코드(instructions)를 나타낸다.The class_data_item in the data section contains direct method and virtual method members in the encoded method format. The encoded method format includes code_off, where each method is represented by an encoded method. code_off means code_item, and code_item contains instructions of each method. The insns_size and insns member fields of code_item indicate the byte code of the method, that is, instructions.

안드로이드 애플리케이션에 관한 연구 경험에 비추어 DEX파일을 심도있게 분석한 결과, 악성코드가 동작하는 실행코드가 code_item에 위치하게 됨을 발견할 수 있었다.As a result of in-depth analysis of the DEX file in light of the research experience of the Android application, it was found that the executable code in which the malicious code operates is located in code_item.

도 5를 참조하면, 본 발명의 실시예에 의한 안드로이드 멀웨어 탐지 분류 시스템(100)은 안드로이드 애플리케이션에서 악성코드가 포함될 수 있는 주요부분의 데이터를 추출하는 검사영역 추출부(120), 상기 검사영역 추출부(120)가 추출한 주요부분을 패턴화하여 패턴데이터를 생성하는 변환부(140), 상기 변환부(140)가 생성한 패턴데이터를 종래 악성코드 패턴이 학습된 머신러닝에 입력하여 상기 패턴데이터의 악성코드 포함 여부를 진단하는 악성코드 검사부(160)를 포함한다.Referring to FIG. 5, the Android malware detection classification system 100 according to an embodiment of the present invention includes an inspection area extraction unit 120 for extracting data of a main part that may contain malicious code in an Android application, and extracting the inspection area The conversion unit 140 for patterning the main part extracted by the unit 120 to generate pattern data, and inputting the pattern data generated by the conversion unit 140 into machine learning in which a conventional malicious code pattern has been learned, so that the pattern data It includes a malicious code inspection unit 160 for diagnosing whether the malicious code is included.

또한, 진단 대상 안드로이드 애플리케이션을 입력받는 입력부(110)도 포함할 수 있다.In addition, the input unit 110 for receiving the diagnosis target Android application may be included.

주요부분의 데이터는 안드로이드 애플리케이션의 DEX파일 중 Data 섹션의 code_item이 포함된다.Data of the main part includes code_item of Data section among DEX files of Android applications.

검사영역 추출부(120)는 안드로이드 애플리케이션의 APK를 압축해제(unzip)하여 DEX파일을 추출한 후, DEX파일의 헤더를 파싱하여 Data 섹션의 오프셋(offset)을 획득한다. 이후, 오프셋을 기준으로 파일을 분리하여 실행코드가 포함된 code_item을 주요부분으로 정의한다.The inspection area extraction unit 120 extracts the DEX file by unzipping the APK of the Android application, and then parses the header of the DEX file to obtain the offset of the data section. After that, the file is separated based on the offset, and the code_item including the execution code is defined as the main part.

종래의 안드로이드 애플리케이션 악성코드 진단 기술에서는 별도의 전처리 과정 없이 DEX파일 전체를 진단 대상으로 하기 때문에 진단을 위한 데이터 볼륨이 상당하였다. 반면, 본 발명의 실시예는 DEX파일 심도있게 분리하여 code_item만을 추출하고 악성코드를 진단하기 때문에 분석 대상이 되는 데이터 볼륨이 현저히 감축되면서도 악성코드의 진단율에는 영향을 주지 않게 된다.In the conventional Android application malicious code diagnosis technology, the entire DEX file is diagnosed without a separate pre-processing process, so the data volume for diagnosis is considerable. On the other hand, since the embodiment of the present invention deeply separates the DEX file and extracts only code_item and diagnoses the malicious code, the data volume to be analyzed is significantly reduced, but the diagnostic rate of the malicious code is not affected.

한편, 안드로이드는 다양한 개발 환경에서도 애플리케이션 제작이 가능하다. 예를 들어, Java 뿐만 아니라, C#, Javascript, C++, HTML 등의 프로그래밍 언어로도 애플리케이션의 제작이 가능하다. 하지만, 프로그래밍 언어에 따라 악성코드가 위치할 수 있는 영역이 추가로 발생됨을 발견하였다.On the other hand, Android can make applications in various development environments. For example, applications can be produced not only in Java but also in programming languages such as C#, Javascript, C++, and HTML. However, it was found that an additional area where malicious codes can be located is generated depending on the programming language.

따라서, 본 발명의 실시예에 의한 안드로이드 멀웨어 탐지 분류 시스템(100)의 검사영역 추출부(120)는 code_item과 함께 추출될 추가영역을 안드로이드 애플리케이션의 프로그래밍 언어에 대응하여 선택한다.Therefore, the inspection area extraction unit 120 of the Android malware detection classification system 100 according to an embodiment of the present invention selects an additional area to be extracted together with code_item in correspondence with the programming language of the Android application.

구체적으로, 검사영역 추출부(120)는 Java로 제작된 애플리케이션에서는 반드시 code_item을 추출한다. C 또는 C++로 제작된 애플리케이션에서는 code_item과 so 확장자의 파일을 추출한다. C#으로 제작된 애플리케이션에서는 code_item과 Assembly-CSharp.dll 또는 App.dll 등의 dll 파일을 추출한다. .NET libraries로 제작된 애플리케이션에서는 code_item, System.dll 및 System.core.dll 등의 dll 파일을 추출한다. HTML로 제작된 애플리케이션에서는 code_item 및 index.html 파일을 추출한다. Javascript로 제작된 애플리케이션에서는 code_item 및 index.js 등의 dll 파일을 추출한다.Specifically, the inspection area extraction unit 120 necessarily extracts code_item from an application made in Java. In applications written in C or C++, files with code_item and so extensions are extracted. In C# applications, code_item and dll files such as Assembly-CSharp.dll or App.dll are extracted. In applications created with .NET libraries, dll files such as code_item, System.dll and System.core.dll are extracted. Code_item and index.html files are extracted from applications created in HTML. In applications created with Javascript, dll files such as code_item and index.js are extracted.

다른 실시예로서, 검사영역 추출부(120)는 애플리케이션의 프로그래밍 언어에 관계없이, code_item, so 확장자 파일, dll 확장자 파일, js 확장자 파일, html 확장자 파일을 일괄적으로 추출할 수 있다.As another embodiment, the inspection area extraction unit 120 may collectively extract code_item, so extension files, dll extension files, js extension files, and html extension files regardless of the programming language of the application.

이 실시예는 애플리케이션에서 code_item에 더하여 악성코드가 감염될 수 있는 영역을 선택적으로 추가 추출하기 때문에 애플리케이션 전체를 대상으로 악성코드 검사를 실시하는 종래기술보다 악성코드 검사 대상이 되는 데이터 볼륨이 감축되는 효과가 있다. 데이터 볼륨이 감축되면 악성코드 검사 시간이 감축되고, 보다 정밀한 검사가 가능해지는 등 향상된 효과가 다수 발생된다. 아울러, 종래기술들은 DEX파일만을 검사하는 경향이 있어 애플리케이션이 다른 프로그래밍 언어로 제작된 경우에는 악성코드가 포함된 영역을 검사하지 않는 문제가 있으나, 이 실시예는 프로그래밍 언어에 따라 추가적으로 감염될 수 있는 영역을 발견하여 해당 영역들을 검사 데이터에 추가함으로써 악성코드 검사가 실패할 확률을 현저히 감소시켰다.Since this embodiment selectively extracts an area that can be infected with malicious code in addition to code_item in the application, the effect of reducing the volume of data that is subject to malicious code scanning is reduced compared to the prior art that performs malicious code scanning on the entire application. There is. When the volume of data is reduced, the time to scan malicious code is reduced, and a number of improved effects such as more precise inspection are possible. In addition, the prior art tends to scan only the DEX file, but when the application is made in a different programming language, there is a problem of not scanning the area containing the malicious code, but this embodiment may be additionally infected depending on the programming language. By discovering the area and adding the areas to the inspection data, the probability of malicious code inspection failure is significantly reduced.

또한, 애플리케이션은 카테고리에 따라 이용되는 클래스, 메소드(API), 코드, 컴포넌트(액티비티, 서비스, 콘텐츠 제공자, Broadcast receiver), 스트링(문자열), 인텐트(intent) 등에서 차이가 있다. 카테고리란 게임, 금융, 문서 편집기, 백신, 유틸리티 등 애플리케이션을 기능 중심으로 분류한 그룹이라 볼 수 있다. 안드로이드 애플리케이션의 배포를 주도하는 구글플레이(2018년 현재의 명칭)는 등록된 애플리케이션을 카테고리별로 분류하여 제공하고 있다. 다른 실시예로서, 검사영역 추출부(120)는 code-item과 함께 추출될 추가영역을 구글플레이의 애플리케이션 카테고리에 대응하여 선택하고, 악성코드 검사부(160)는 애플리케이션의 카테고리에 대응하는 진단을 실시할 수 있다. 예를 들어, 악성코드가 다수 이용하는 음성API(악성 앱이 다수 사용하는 API) 및 카테고리 별 애플리케이션들이 다수 이용하는 양성API(정상 앱이 다수 사용하는 API)들을 악성코드 검사부(160)의 머신러닝에 기 학습시키고, 검사영역 추출부(120)가 검사 대상 애플리케이션이 호출하는 API를 추출하여, 추출된 API가 관련 카테고리의 양성API의 패턴과 유사한지, 악성API의 패턴과 유사한지 대조할 수 있게 실시될 수 있다. 이 밖에도, 카테고리에 따라 애플리케이션에서 나타나는 고유 특징을 사전에 정의한 후, 검사 대상 애플리케이션의 카테고리에 대응하여 악성코드 진단이 실시되게 할 수 있다.In addition, the application differs in class, method (API), code, component (activity, service, content provider, broadcast receiver), string (string), and intent used according to the category. A category is a group that categorizes applications such as games, finance, text editors, vaccines, and utilities by function. Google Play, which leads the distribution of Android applications (as of 2018), categorizes and provides registered applications by category. As another embodiment, the inspection area extraction unit 120 selects an additional area to be extracted along with the code-item corresponding to the application category of Google Play, and the malicious code inspection unit 160 performs diagnosis corresponding to the application category. can do. For example, based on the machine learning of the malicious code inspection unit 160, the voice API (API used by many malicious apps) and the positive API (API used by many normal apps) used by categories are used by the malicious code. Learning, and the inspection area extracting unit 120 extracts the API called by the application to be inspected, so that the extracted API is similar to the pattern of the positive API of the related category or the pattern of the malicious API. Can be. In addition, after unique characteristics appearing in an application are defined in advance according to a category, malicious code diagnosis may be performed in response to a category of an application to be inspected.

도 6은 멀웨어 패밀리(Malware family)의 예를 나타낸 표이다. 악성코드는 유형에 따라 멀웨어 패밀리로 분류할 수 있다. 악성코드의 유형에는 바이러스, 웜, 트로이목마, 백도어, 논리폭탄, 봇, adware, spyware, ransomware 등이 있으며, 멀웨어 패밀리는 이러한 악성코드를 유형별로 분류하는 기준이 된다.6 is a table showing an example of a malware family. Malicious codes can be classified into malware families according to their types. The types of malicious codes include viruses, worms, Trojan horses, backdoors, logical bombs, bots, adware, spyware, ransomware, etc., and the malware family is the basis for classifying these malicious codes by type.

다른 실시예로서, 검사영역 추출부(120)는 code-item과 함께 추출될 추가영역을 악성코드 검사부(160)가 진단 대상으로 하는 악성코드 종류에 대응하여 선택하는 것을 특징으로 할 수 있다.As another embodiment, the inspection area extraction unit 120 may be characterized in that the additional area to be extracted together with the code-item is selected by the malicious code inspection unit 160 corresponding to the type of malicious code to be diagnosed.

이론적으로, 애플리케이션에서 모든 악성코드를 진단하는 것이 바람직하지만, 실제 산업에서 진단 소프트웨어를 제공하는 기업은 멀웨어 패밀리의 유형 중 일부에서만 탁월한 진단 기술을 가지고 있다. 예를 들어, 어느 한 기업은 바이러스, 웜 및 트로이목마의 탐지 기술에서 선도적이지만, 에드웨어, 스파이웨어 및 랜섬웨어에는 무력할 수 있다. 다른 한 기업은 랜섬웨어의 진단과 치료에는 선도적이지만 그 외 악성코드에는 무력할 수 있다. 이러한 이유로 검사영역 추출부(120)는 진단하고자 하는 악성코드의 유형이 멀웨어 패밀리 중 일부로 한정될 수 있는데, 검사영역 추출부(120)는 한정된 악성코드 유형에 대응하여 추출영역을 선택한다.Theoretically, it is desirable to diagnose all malware in an application, but companies that provide diagnostic software in the real industry have excellent diagnostic skills in only a few types of malware families. For example, one company is a leader in the detection of viruses, worms, and Trojan horses, but can be powerless against adware, spyware, and ransomware. Another company is leading in the diagnosis and treatment of ransomware, but may be powerless for other malware. For this reason, the type of malicious code to be diagnosed may be limited to a part of the malware family, and the scan area extracting unit 120 selects an extraction area corresponding to the limited type of malicious code.

도 7을 참조하면, 본 발명의 제1실시예에 따른 안드로이드 멀웨어 탐지 분류 시스템(100)은 주요부분의 데이터 및 추가영역을 이미지 포맷의 패턴데이터로 변환한 후 악성코드를 진단한다. 제1실시예는 변환부(140)에서 생성된 패턴데이터가 이미지 포맷의 데이터이고, 악성코드 검사부(160)는 정적 데이터 분석에 강인한 머신러닝 알고리즘을 이용하여 패턴데이터의 악성코드 포함 여부를 진단한다.Referring to FIG. 7, the Android malware detection classification system 100 according to the first embodiment of the present invention diagnoses malicious code after converting the main part data and additional areas into image format pattern data. In the first embodiment, the pattern data generated by the conversion unit 140 is image format data, and the malicious code inspection unit 160 diagnoses whether the pattern data includes malicious code using a machine learning algorithm robust to static data analysis. .

구체적으로, 변환부(140)는 주요부분을 이진코드 형태로 로드한 후 기 설정된 단위로 분할한다. 또한, 분할된 이진코드를 대응되는 명암 또는 색상으로 변환하는 것으로 이미지를 생성한다. 도면을 참조하면, 이 실시예는 검사영역 추출부(120)가 추출한 데이터를 이진코드로 읽어 8-bit 벡터(vector)로 변환하고, 이것을 하나의 픽셀(pixel)로 표현하여 한 픽셀 당 0에서 255의 값을 가지는 그레이스케일 이미지(Grayscale Image)를 생성했다.Specifically, the conversion unit 140 loads the main portion in the form of binary code and divides it into predetermined units. In addition, an image is generated by converting the divided binary code into a corresponding contrast or color. Referring to the drawings, this embodiment reads the data extracted by the inspection area extractor 120 as a binary code, converts it into an 8-bit vector, and expresses it as one pixel, from 0 per pixel. A grayscale image with a value of 255 was generated.

이미지는 데이터의 볼륨에 대응하여 일정 너비(width)를 가지는 행으로 구성된다. 너비는 픽셀 수 단위가 될 수 있다. 실시예로서, 이미지의 너비는 데이터의 볼륨에 대응하여 정사각형이 되는 픽셀 수가 될 수 있다. 예를 들어, 데이터의 볼륨이524,288 bit라면, 이미지의 너비는 256 픽셀이 될 수 있다(256×256 이미지 생성).The image is composed of rows having a certain width corresponding to the volume of data. The width can be in pixels. As an embodiment, the width of the image may be the number of pixels that become a square corresponding to the volume of data. For example, if the volume of data is 524,288 bits, the width of the image may be 256 pixels (256×256 image creation).

데이터 볼륨 범위 [KB]Data volume range [KB] 이미지 너비 [pixel]Image width [pixel] <10<10 3232 10 - 3010-30 6464 30 - 6030-60 128128 60 - 10060-100 256256 100 - 200100-200 384384 200 - 500200-500 512512 500 - 1000500-1000 768768 >1000>1000 10241024

변환부(140)는 마지막 이진코드가 8 bit를 완성하지 않고 종료되어 픽셀을 생성할 수 없거나, 이진코드가 종료되어 사각형 이미지의 마지막 영역이 일부 완성되지 않으면, 사각형 이미지가 완성될 수 있게 부족한 영역에 0-padding을 삽입한다. 이로써, 변환부(140)는 선(linear) 성격의 이미지를 생성할 수 있게 된다.The converting unit 140 is unable to generate a pixel because the last binary code is completed without completing 8 bits, or if the final region of the square image is not partially completed due to the binary code being terminated, an insufficient region to complete the square image Insert 0-padding in In this way, the conversion unit 140 can generate an image of a linear nature.

악성코드 검사부(160)는 이미지 내 악성코드 포함 여부를 진단하기 위해 정적 데이터 분석에 강인한 머신러닝 알고리즘을 이용한다. 정적 데이터 분석에 강인한 머신러닝 알고리즘에는 합성곱신경망(Convolutional Neural Network, CNN) 등이 있다. 이 실시예는 이미지화된 패턴데이터의 분석을 위해 CNN을 이용하였다.The malicious code inspection unit 160 uses a machine learning algorithm robust to static data analysis in order to diagnose whether the image includes malicious code. Machine learning algorithms that are robust to static data analysis include a convolutional neural network (CNN). This example used CNN for analysis of imaged pattern data.

CNN은 이미지 분석에 특화된 알고리즘이다. 대표적으로 Google의 AlphaGo, Facebook의 얼굴인식 알고리즘 등이 CNN을 이용한다. CNN은 특정 입력에 해당하는 최적의 출력을 찾아주는 성능이 뛰어나고, 코드의 커버리지가 광범위한 장점이 있다.CNN is an algorithm specialized for image analysis. Typically, Google's AlphaGo and Facebook's face recognition algorithms use CNN. CNN has excellent performance in finding the optimal output for a specific input, and has a wide range of advantages in code coverage.

실험에서는 CNN 중에서도 state-of-the-art CNN 모델인 Inception-V3, Inception-ResNet-V2을 이용하였다. Inception-V3는 GoogLeNet을 개량시킨 모델로써, 많은 정적 데이터 분석 연구에서 응용되고 있다. 한편, Inception-ResNet-V2는 Inception-V3에 ResNet의 특성을 결합한 모델이다. 각 CNN 모델에 적용되는 최적화 방법(Optimization method)은 RMSprop(Root Mean Square Propagation), Adam(Adaptive Moment Estimation), SGD(Stochastic Gradient Descent)를 이용하였다. Adam은 인기가 있는 알고리즘으로써, 많은 딥러닝 프레임워크(Deep Learning framework)에서 이용한다. SGD는 RMSprop과 Adam의 원형이 되는 알고리즘으로, Inception-v3와 결합되었을 때, 높은 성능으로 악성코드를 탐지하는 특징이 있다.In the experiment, the state-of-the-art CNN models Inception-V3 and Inception-ResNet-V2 were used among the CNNs. Inception-V3 is an improved model of GoogLeNet and is applied in many static data analysis studies. Meanwhile, Inception-ResNet-V2 is a model that combines the characteristics of ResNet with Inception-V3. As an optimization method applied to each CNN model, RMSprop (Root Mean Square Propagation), Adam (Adaptive Moment Estimation), and SGD (Stochastic Gradient Descent) were used. Adam is a popular algorithm and is used by many deep learning frameworks. SGD is a prototype algorithm of RMSprop and Adam, and when combined with Inception-v3, it has the feature of detecting malicious code with high performance.

악성코드 검사부(160)의 정확한 악성코드 진단을 위해, 머신러닝 알고리즘을 공지된 악성코드로부터 변환된 이미지 패턴으로 학습이 실시된다. 학습부(180)는 공지된 악성코드들을 이미지 패턴화하여 악성코드 검사부(160)의 머신러닝에 입력함으로써, 머신러닝이 악성코드의 패턴을 학습할 수 있게 지원한다. 변환부(140)가 생성하는 패턴데이터와 학습부(180) 생성하는 이미지 패턴화된 악성코드는 동일한 포맷을 가지는 것이 바람직하다.In order to accurately diagnose the malicious code of the malicious code inspection unit 160, the machine learning algorithm is trained as an image pattern converted from a known malicious code. The learning unit 180 patterns the known malicious codes and inputs them into the machine learning of the malicious code inspection unit 160 to support machine learning to learn the patterns of the malicious codes. It is preferable that the pattern data generated by the conversion unit 140 and the image-patterned malicious code generated by the learning unit 180 have the same format.

머신러닝 알고리즘에 입력되는 데이터는 기 설정된 포맷을 만족해야 한다. 예를 들어, 머신러닝 알고리즘은 256×256 크기의 이미지만을 입력 받을 수 있다. 이 경우, 더 작은 이미지로 생성된 패턴데이터는 확대되어야 하고, 더 큰 이미지로 생성된 패턴데이터는 압축되어야 한다. 이미지화된 패턴데이터를 머신러닝의 입력 포맷에 대응하게 확대 또는 압축하는 것은 변환부(140)가 실시한다.Data input to the machine learning algorithm must satisfy a preset format. For example, the machine learning algorithm can receive only 256×256 images. In this case, pattern data generated with a smaller image should be enlarged, and pattern data generated with a larger image should be compressed. The conversion unit 140 performs enlargement or compression of the imaged pattern data corresponding to the input format of the machine learning.

이때, 검사영역 추출부(120)가 안드로이드 애플리케이션에서 주요부분의 데이터만을 추출하는 특징이 큰 강점이 된다. 종래 기술과 같이 애플리케이션의 모든 영역이나, 모든 DEX파일이 검사 대상으로 설정되어 이미지화된 패턴데이터가 생성되면 방대한 데이터 볼륨에 의해 이미지의 크기도 커지게 된다. 큰 이미지가 작은 크기의 이미지로 압축될수록 손실이 다수 발생되는 것은 자명한 사항이다. 이 실시예는 검사영역 추출부(120)가 악성코드가 포함될 수 있는 영역만을 선택적으로 추출하기 때문에, 종래 기술과 대비하여 동일한 애플리케이션의 진단을 실시하더라도 생성되는 패턴데이터의 이미지 크기가 훨씬 더 작다. 따라서 패턴데이터를 머신러닝에 입력할 수 있는 크기로 압축하더라도 손실이 현저히 더 적게 된다. 적은 손실로 압축된 이미지의 패턴은 머신러닝이 선행 학습한 악성코드의 패턴과 대비하기가 더 용이하므로, 악성코드 검사부(160)의 진단율이 현저히 향상될 수 있게 된다.At this time, the feature that the inspection area extraction unit 120 extracts only the main part of the data from the Android application is a great strength. As in the prior art, if all areas of the application or all DEX files are set as inspection targets and imaged pattern data is generated, the size of the image is also increased by the massive data volume. It is obvious that the larger the image is compressed into a smaller image, the more loss occurs. In this embodiment, since the inspection area extracting unit 120 selectively extracts only the area in which the malicious code may be included, the image size of the generated pattern data is much smaller even if the same application is diagnosed in comparison with the prior art. Therefore, even if the pattern data is compressed to a size that can be input to machine learning, the loss is significantly less. Since the pattern of the image compressed with little loss is easier to contrast with the pattern of the malicious code previously learned by the machine learning, the diagnosis rate of the malicious code inspection unit 160 can be significantly improved.

한편, 도 8을 참조하면, 본 발명의 제2실시예에 따른 안드로이드 멀웨어 탐지 분류 시스템(100)은 주요부분의 데이터 및 추가영역을 사운드 포맷의 패턴데이터로 변환한 후 악성코드를 진단한다. 제2실시예는 변환부(140)에서 생성된 패턴데이터는 사운드 포맷의 데이터이고, 악성코드 검사부(160)는 동적 데이터 분석에 강인한 머신러닝 알고리즘을 이용하여 패턴데이터의 악성코드 포함 여부를 진단한다.Meanwhile, referring to FIG. 8, the Android malware detection classification system 100 according to the second embodiment of the present invention diagnoses malicious code after converting the main part data and the additional area into pattern data in a sound format. In the second embodiment, the pattern data generated by the conversion unit 140 is sound format data, and the malicious code inspection unit 160 diagnoses whether the pattern data includes malicious code using a machine learning algorithm robust to dynamic data analysis. .

구체적으로, 변환부(140)는 주요부분의 데이터 및 추가영역의 이진코드(binary code)를 MIDI(Musical Instrument Digital Interface) 포맷으로 변환한다. 또한, 변환된 MIDI 포맷의 주요부분의 데이터 및 추가영역을 wav 포맷 또는 MFCC(Mel-Frequency Cepstral Coefficients) 포맷으로 재 변환한다.Specifically, the conversion unit 140 converts the data of the main part and the binary code of the additional area into a MIDI (Musical Instrument Digital Interface) format. In addition, data and additional areas of the main part of the converted MIDI format are re-converted to a wav format or a Mel-Frequency Cepstral Coefficients (MFCC) format.

MIDI는 디지털 음원 생성을 위한 언어로서, 음원은 아니지만, 음에 대한 정보를 기록한다. MIDI에는 악기의 종류, 음의 높낮이, 감쇠(Attenuation), 동시 연구되는 악기를 의미하는 채널, Note ON/OFF 등의 정보들이 포함된다.MIDI is a language for generating digital sound sources, but is not a sound source, but records information about sounds. MIDI includes information such as the type of instrument, the pitch of the sound, the attenuation, the channel representing the instrument being studied at the same time, and Note ON/OFF.

하나의 채널은 하나의 악기로 볼 수 있다. 복수의 채널이 설정되면 여러 악기의 협연으로 볼 수 있다. 5선지에 악기별로 음표가 작성되듯 채널별로 음표가 설정된다.One channel can be viewed as one instrument. If multiple channels are set, it can be viewed as a concert of several instruments. The notes are set for each channel as if the notes were written for each instrument in the 5 papers.

실시예로서, 변환부(140)는 주요부분의 데이터 및 추가영역의 이진코드를 1 바이트 단위로 분할하여 MIDI 포맷으로 변환한다.As an embodiment, the conversion unit 140 divides the data of the main part and the binary code of the additional area into 1 byte units and converts it into MIDI format.

피치(Pitch)는 음의 높낮이를 의미하는 주파수이고, 이것은 음표로 나타낼 수 있다.Pitch (Pitch) is a frequency that means the pitch of the note, it can be represented by a note.

MIDI에서 음표의 범위는 0 부터 127 까지, 즉 7 비트까지의 데이터를 표현할 수 있다. 그러나 1 바이트는 8 비트로서, 0 내지 255까지 나타낼 수 있으므로, MIDI 음표에서는 1 바이트를 음표로 표현하는 것이 불가능하다.The range of notes in MIDI can represent data from 0 to 127, that is, 7 beats. However, since 1 byte is 8 bits and can represent 0 to 255, it is impossible to express 1 byte as a note in a MIDI note.

이 문제를 해결하기 위해, 제2실시예의 변환부(140)는 1 바이트의 8 비트 중 2개의 비트를 제1채널의 음표로 설정하고, 나머지 6개의 비트를 제2채널의 음표로 설정한다. 제1채널은 0 내지 3까지 4개의 음을 출력하는 악기인 것으로 가정하고, 제2채널은 0 내지 63까지 64개의 음을 출력하는 악기인 것으로 가정하여 두 개의 악기가 동시에 하모니를 이루는 것으로 가정하는 것이다. 이러한 방식으로 1 바이트를 나누어 MIDI화 하면, 하나의 음이 재생되는 타이밍에 256개(1 바이트)의 정보를 모두 담을 수 있게 된다.To solve this problem, the conversion unit 140 of the second embodiment sets two bits out of 8 bits of one byte as notes of the first channel and sets the remaining six bits as notes of the second channel. Assuming that the first channel is an instrument that outputs 4 notes from 0 to 3, and that the second channel is an instrument that outputs 64 notes from 0 to 63, it is assumed that the two instruments form harmony at the same time. will be. When MIDI is divided into 1 byte in this way, 256 (1 byte) information can be stored at the timing at which a single sound is reproduced.

또한, 제2실시예의 변환부(140)는 제1채널과 제2채널에서 음이 서로 중복되는 것을 방지하기 위해, 적어도 어느 하나의 채널에 가중치를 더한 후 MIDI 포맷으로 변환한다. 즉, 제1채널이 음표를 설정하는 범위와, 제2채널이 음표를 설정하는 범위가 중복되지 않게 한다. 예를 들어, 제1채널의 이진 값이 11이고, 제2채널의 이진 값이 000011이면, 두 채널의 음표가 동일 위치에 설정되어 데이터의 식별이 어렵게 된다. 이 문제의 해결을 위해, 이 실시예는 제2채널의 이진 값에 가중치 24를 더한 후 음표를 설정하였다. 24를 더한 이유는 피아노가 최대 88개의 음을 가지는 것에 착안하여 6 비트로 표현되는 최대값이 24와 더해질 때 88이 되게 한 것이다. 제2채널에 더해진 수 24에 의해, 제2채널은 0 내지 23에 해당되는 음표가 나타나지 않게 된다. 따라서, 0 내지 23의 범위에 나타는 음표는 제1채널의 음표인 것으로 식별할 수 있게 된다.In addition, the converter 140 of the second embodiment adds a weight to at least one channel and converts it into MIDI format in order to prevent the sound from overlapping each other in the first channel and the second channel. That is, the range in which the first channel sets the note and the range in which the second channel sets the note do not overlap. For example, if the binary value of the first channel is 11 and the binary value of the second channel is 000011, the notes of the two channels are set at the same position, making it difficult to identify data. To solve this problem, in this embodiment, a note is set after adding a weight of 24 to the binary value of the second channel. The reason for adding 24 is that the maximum value represented by 6 beats becomes 88 when the piano is added to 24, considering that the piano has up to 88 notes. By the number 24 added to the second channel, notes corresponding to 0 to 23 are not displayed in the second channel. Therefore, the note appearing in the range of 0 to 23 can be identified as being the note of the first channel.

또한, 제1채널은 4가지의 음을 표현할 수 있으나, 0 내지 23의 넓은 범위에서 음표의 설정이 가능하므로, 음 간 식별력을 강화하기 위해, 이진 값에 가중치를 더하거나 곱해줄 수 있다. 이 실시예는 제1채널의 이진 값에 3을 곱하였다. 이진 값이 1이면 3번째 음표, 2이면 6번째 음표, 3이면 9번째 음표가 설정되므로 제1채널의 음 식별이 더 용이하게 된다.In addition, the first channel can express four kinds of notes, but since a note can be set in a wide range of 0 to 23, in order to enhance the discrimination between the notes, a weight can be added to or multiplied with a binary value. In this example, the binary value of the first channel is multiplied by 3. If the binary value is 1, the 3rd note is set, 2 is the 6th note, and 3 is the 9th note, making it easier to identify the first channel.

도 8을 참조하면, 변환부(140)는 MIDI 포맷으로 변환된 주요부분의 데이터 및 추가영역을 wav 포맷으로 변환한다. MIDI는 음의 정보를 포함하지만, 그 자체가 오디오 파일은 아니다. wav 포맷으로 변환된 주요부분의 데이터는 오디오 재생이 가능하게 된다.Referring to FIG. 8, the conversion unit 140 converts data and additional areas of main parts converted to MIDI format into a wav format. MIDI contains sound information, but it is not itself an audio file. Audio of the main part converted to wav format can be played.

wav 파일은 16 비트, 44100Hz의 일반적 수준으로 변환되었을 때 1초당 87KB의 용량을 가진다. 샘플링 레이트(sampling rate)가 높을수록 정교한 오디오가 재생되지만, 일정 수준을 넘으면 오히려 wav 파일의 볼륨이 너무 크게 되어 오디오 분석에 투입되는 자원이 과도하게 된다. 따라서, 이 실시예는 샘플링 레이트를 22050 Hz 이하로 설정하였다.A wav file has a capacity of 87KB per second when converted to a normal level of 16 bits, 44100Hz. The higher the sampling rate, the more sophisticated audio is played, but if it exceeds a certain level, the volume of the wav file becomes too large, and the resources for audio analysis are excessive. Therefore, in this example, the sampling rate was set to 22050 Hz or less.

또한, 변환부(140)는 MIDI 포맷 또는 wav 포맷의 주요부분의 데이터 및 추가영역을 MFCC(Mel-frequency cepstral coefficients) 포맷으로 변환할 수 있다.In addition, the conversion unit 140 may convert data and an additional area of a main part of a MIDI format or a wav format into a MFC-Mel-frequency cepstral coefficients (MFCC) format.

악성코드 검사부(160)는 사운드 포맷의 패턴데이터에서 악성코드 포함 여부를 진단하기 위해 동적 데이터 분석에 강인한 머신러닝 알고리즘을 이용한다. 사운드 포맷의 정보는 시간 축을 따라 출력되는 데이터가 상이하므로 동적 데이터 분석 기법이 적용되는 것이 바람직하다. 동적 데이터 분석에 강인한 머신러닝 알고리즘에는 순환신경망(Recurrent Neural Network, RNN) 등이 있다. 이 실시예는 사운드화된 패턴데이터의 분석을 위해 RNN을 이용하였다.The malicious code inspection unit 160 uses a machine learning algorithm robust to dynamic data analysis to diagnose whether malicious code is included in the sound format pattern data. Since the data outputted along the time axis of the sound format information is different, it is preferable to apply a dynamic data analysis technique. Machine learning algorithms that are robust to dynamic data analysis include Recurrent Neural Network (RNN). This example used RNN for analysis of sounded pattern data.

CNN은 현재의 출력이 과거의 입력에 영향을 받는 시간적 종속성(temporal dependency)을 표현하지 못한다. 하지만, RNN(Recurrent Neural Network)은 시계열(time series) 패턴 또는 서열(sequence) 데이터 분석의 수학적 모델링에 적합하다. 일정 시간 간격으로 시간에 종속적으로 측정된 시계열(time series)자료에는 주가, 매출액, 물가지수, 환율, 실업률 등이 있다. 또한, 순서가 의미를 가지는 서열(sequence)자료에는 텍스트, 음성, 동영상, DNA 가닥의 염기쌍 등이 있다. RNN은 현재의 출력이 과거의 입력에 영향을 받는 시간적 종속성(temporal dependency)을 표현하는 능력을 가지고 있다. RNN을 시간 전개에 따라 신경망의 구조를 전개하면 전향신경망(feed-forward network)과 같은 구조를 나타낸다. 따라서, RNN은 음성, 동영상, 언어 모델 분석에 다수 활용되고 있다.CNN does not express a temporal dependency in which the current output is affected by the past input. However, the Recurrent Neural Network (RNN) is suitable for mathematical modeling of time series pattern or sequence data analysis. Time series data measured in a time-dependent manner at regular time intervals include stock prices, sales, price index, exchange rates, and unemployment rates. In addition, there are texts, voices, videos, base pairs of DNA strands, etc. in sequence data having a sequence. RNN has the ability to express temporal dependencies in which the current output is influenced by the past input. When RNN is deployed in a neural network structure over time, it exhibits the same structure as a feed-forward network. Therefore, RNN is widely used for voice, video, and language model analysis.

악성코드 검사부(160)의 정확한 악성코드 진단을 위해, 머신러닝 알고리즘은 공지된 악성코드로부터 변환된 사운드 패턴으로 학습이 실시된다. 학습부(180)는 공지된 악성코드들을 사운드 패턴화하여 악성코드 검사부(160)의 머신러닝에 입력함으로써, 머신러닝이 악성코드의 패턴을 학습할 수 있게 지원한다.In order to accurately diagnose the malicious code of the malicious code inspection unit 160, the machine learning algorithm is trained with a sound pattern converted from a known malicious code. The learning unit 180 patterns known malicious codes and inputs them into the machine learning of the malicious code inspection unit 160, thereby enabling machine learning to learn patterns of the malicious codes.

학습부(180)가 변환하는 악성코드의 사운드 패턴은 변환부(140)에서 변환하는 패턴데이터와 동일 포맷인 것이 바람직하다.The sound pattern of the malicious code converted by the learning unit 180 is preferably in the same format as the pattern data converted by the conversion unit 140.

안드로이드 멀웨어 탐지 분류 시스템(100)은 악성코드가 일정한 패턴을 가지고 있음에 착안하여 제안된 것으로서, 진단 대상 애플리케이션의 주요영역 패턴과 악성코드의 패턴을 대비하기 때문에 종래 데이터 코드를 진단하는 방법에서 탐지할 수 없었던 은폐(난독화, 패킹 등 이용)된 악성코드도 탐지할 수 있게 된다.The Android malware detection classification system 100 is proposed in view of the fact that the malicious code has a certain pattern, and it detects in the method of diagnosing the conventional data code because it prepares for the pattern of the main area and the malicious code of the application to be diagnosed. It can also detect malicious codes that could not be hidden (using obfuscation, packing, etc.).

이어서, 도 10을 참조하여 본 발명의 실시예에 따른 안드로이드 악성앱 탐지 방법을 설명한다.Next, a method for detecting an Android malicious app according to an embodiment of the present invention will be described with reference to FIG. 10.

본 발명의 실시예에 따른 안드로이드 악성앱 탐지 방법은 검사영역 추출부(120)가 안드로이드 애플리케이션에서 악성코드가 포함될 수 있는 주요부분의 데이터를 추출하는 단계(S120)와, 변환부(140)가 상기 검사영역 추출부(120)에서 추출된 주요부분의 데이터를 패턴화하여 패턴데이터를 생성하는 단계(S140)와, 악성코드 검사부(160)가 상기 변환부(140)에서 생성된 패턴데이터를 종래 악성코드 패턴과 대비하여 상기 패턴데이터의 악성코드 포함 여부를 진단하는 단계(S160)를 포함한다.In the Android malicious app detection method according to an embodiment of the present invention, the scanning area extracting unit 120 extracts data of a main part that may include malicious code from the Android application (S120) and the converting unit 140 includes the above. Patterning the data of the main part extracted from the inspection area extraction unit 120 to generate pattern data (S140), and the malicious code inspection unit 160 uses the pattern data generated by the conversion unit 140 as a conventional malicious And diagnosing whether or not the pattern data includes malicious codes (S160).

이밖에, 본 발명의 실시예에 따른 안드로이드 악성앱 탐지 방법은 안드로이드 악성앱 탐지에서 실시된 기능이 각 단계에 대응하여 순차적으로 실행되는 것으로 실시될 수 있다.In addition, the Android malicious app detection method according to an embodiment of the present invention may be implemented as the functions performed in the Android malicious app detection are sequentially executed in response to each step.

이상에서 본 발명의 바람직한 실시예를 설명하였으나, 본 발명은 다양한 변화와 변경 및 균등물을 사용할 수 있다. 본 발명은 상기 실시예를 적절히 변형하여 동일하게 응용할 수 있음이 명확하다. 따라서 상기 기재 내용은 다음 특허청구범위의 한계에 의해 정해지는 본 발명의 범위를 한정하는 것이 아니다.The preferred embodiments of the present invention have been described above, but the present invention can use various changes, modifications, and equivalents. It is clear that the present invention can be equally applied by appropriately modifying the above embodiments. Therefore, the above description is not intended to limit the scope of the present invention as defined by the following claims.

100 : 안드로이드 멀웨어 탐지 분류 시스템
110 : 입력부
120 : 검사영역 추출부
140 : 변환부
160 : 악성코드 검사부
180 : 학습부100: Android malware detection classification system
110: input unit
120: inspection area extraction unit
140: conversion unit
160: malicious code inspection unit
180: learning department

Claims

An inspection area extracting unit for selectively extracting and extracting data of a main part that may contain malicious code from the Android application;
A conversion unit for patterning the main portion extracted by the inspection area extraction unit to generate pattern data;
Android malware detection classification system, characterized in that it comprises a malicious code inspection unit for diagnosing whether the pattern data contains the malicious code by inputting the pattern data generated by the conversion unit into the machine learning in which the conventional malicious code pattern is learned.

According to claim 1,
Android malware detection classification system, characterized in that the pattern data generated by the converter is image format data.

According to claim 2,
The conversion unit loads the main part in the form of binary code (binary code), and then divides it into a predetermined unit, and converts the divided binary code into a corresponding contrast or color.

According to claim 2,
The conversion unit Android malware detection classification system, characterized in that for expanding or compressing the pattern data of the image format in response to the data format input to the machine learning algorithm of the malware inspection unit.

According to claim 1,
Android malware detection classification system, characterized in that the pattern data generated by the conversion unit is data in a sound format.

The method of claim 5,
The conversion unit converts the binary code of the main part into a MIDI format, and then converts the converted MIDI format data into a wav format or a Mel-Frequency Cepstral Coefficients (MFCC) format.

The method of claim 6,
When converting the MIDI format of the main part, the main part is divided into 1 byte units, and then the 1 byte is composed of a first channel composed of 2 bits and a second channel composed of 6 bits. Classification system.

The method of claim 7,
The conversion unit adds weights to at least one channel so that the sound of the first channel and the second channel do not overlap with each other, and then converts it into MIDI format.

According to claim 1,
The main part is an Android malware detection classification system, characterized in that a code-item indicating an execution code is included in the DEX file of the Android application.

The method of claim 9,
The inspection area extracting unit extracts the execution code portion of the Data section, and an additional area corresponding to the programming language of the Android application Android malware detection classification system.

The method of claim 10, wherein the additional area
If the programming language is C or C++, it is a file with the so extension,
If the programming language is C#, it is a dll file containing Assembly-CSharp.dll or App.dll,
If the programming language is .NET libraries, it is a dll file containing System.dll and System.core.dll,
If the programming language is HTML, it is an index.html file.
Android malware detection classification system, characterized in that if the programming language is Javascript, it is a js file containing index.js.

The method of claim 9,
The inspection area extraction unit extracts the execution code part of the Data section and an additional area corresponding to the Google Play category of the Android application,
The malware detection unit Android malware detection classification system, characterized in that to perform a diagnosis corresponding to the category of the application.

The method of claim 9,
The inspection area extraction unit Android malware detection classification system, characterized in that the execution section of the data section, and the malicious code inspection unit extracts an additional area corresponding to the malware family of malware to be diagnosed.

A step of extracting data of a main part in which the inspection area extracting unit may include malicious code in the Android application;
Generating a pattern data by patterning data of a main part extracted from the inspection area extraction unit by a conversion unit;
And a malicious code inspection unit diagnosing whether the pattern data generated by the conversion unit includes the malicious code of the pattern data in comparison with a conventional malicious code pattern.