WO2022097898A1 - Malware detection model training method and malware detection method - Google Patents

Malware detection model training method and malware detection method Download PDF

Info

Publication number
WO2022097898A1
WO2022097898A1 PCT/KR2021/012224 KR2021012224W WO2022097898A1 WO 2022097898 A1 WO2022097898 A1 WO 2022097898A1 KR 2021012224 W KR2021012224 W KR 2021012224W WO 2022097898 A1 WO2022097898 A1 WO 2022097898A1
Authority
WO
WIPO (PCT)
Prior art keywords
code
app
native
algorithm
malicious code
Prior art date
Application number
PCT/KR2021/012224
Other languages
French (fr)
Korean (ko)
Inventor
박우길
Original Assignee
영남대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 영남대학교 산학협력단 filed Critical 영남대학교 산학협력단
Publication of WO2022097898A1 publication Critical patent/WO2022097898A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to a method for learning a malicious code detection model and a method for detecting a malicious code using a previously learned malicious code detection model.
  • the pattern matching technique has high accuracy and fast detection speed in detecting malicious codes showing the same pattern, but has a disadvantage in that it can avoid detection because the code is modified or a hidden technique is used.
  • Methods of evading or bypassing detection of malicious code include code obfuscation of malicious code or a method of using native code.
  • Code obfuscation is a method of removing code patterns such as changing the code order, adding meaningless code, and abbreviating symbol information while maintaining the function of java bitecode.
  • the malicious code is code obfuscated, there is a problem in that it is difficult to detect the malicious code using the conventional pattern matching technique.
  • native code is a method of implementing the code to which malicious code is applied as native code, not Java bytecode, and it is difficult to detect malicious code with a conventional malware scanner that mainly searches for Java bytecode.
  • a code of an app is converted into a native code, an instruction code pair is extracted from the native code, and a plurality of features are based on the instruction code pair. Its main purpose is to provide a method for detecting a malicious code by extracting a common feature from a feature extraction algorithm and a method for learning a malicious code detection model.
  • a native code included in the app ) to convert to;
  • a process of extracting a pair of consecutive instructions code which is a pair of consecutive instructions code, based on a code segment extracted from all native codes of the app; obtaining a feature commonly extracted by each feature extraction algorithm by using two or more feature extraction algorithms from the instruction code pair; and using the acquired feature as input data of the malicious code detection model to obtain a result of performing malicious code detection.
  • a computer program stored in a computer-readable recording medium to execute each process of the above-described malicious code detection method.
  • a malicious code is detected by converting an app's code into a native code, extracting an instruction code pair from the native code, and extracting common features from a plurality of feature extraction algorithms based on the instruction code pair.
  • FIG. 1 is a flowchart illustrating a malicious code detection method and a learning method of a malicious code detection model according to an embodiment of the present disclosure.
  • FIG. 2 is a detailed flowchart of a learning method of a malicious code detection model according to an embodiment of the present disclosure.
  • 3A and 3B are diagrams illustrating performance of malicious code detection using a malicious code detection model according to an embodiment of the present disclosure.
  • FIG. 1 is a flowchart illustrating a method for detecting a malicious code and a method for learning a malicious code detection model according to an embodiment of the present disclosure.
  • Converts java bitecode of an app eg, an Android app
  • native code S100
  • the conversion to the native code may be performed using a conversion tool such as Java bytecodes Dalvik Virtual Machine (Dalvik VM) or ART tool (Android RunTimetool). Accordingly, the entire code of the app becomes native code, and malicious code written in Java bytecode as well as malicious code written in native code can be detected only with the malicious code detection model of the present disclosure.
  • a conversion tool such as Java bytecodes Dalvik Virtual Machine (Dalvik VM) or ART tool (Android RunTimetool).
  • Two consecutive instruction code pairs are extracted from the code segment as features ( S104 ).
  • Such an instruction code may include all or part of an operation code (OP-code), a mode, and an address field of an operand.
  • OP-code operation code
  • a simple arithmetic operation code or the like may be eliminated.
  • An example of a method of extracting an instruction code pair is as follows. If instruction codes A, B, C, A, and D are sequentially present in a code segment, and C is a commonly used instruction code and is the code to be removed, the final extracted instruction code pair and feature is (A, B), (A , D) becomes At this time, (A, B) and (B, A) are processed as the same instruction code pair.
  • the number of instruction codes extracted as a feature is not necessarily limited to two, as the number of instruction codes extracted as a feature increases, it is greatly affected by code obfuscation. It is necessary to limit the number of codes to obtain performance.
  • features are extracted from the instruction code pair extracted in step S104, and among these features, features commonly extracted by each algorithm are obtained as input data of the malicious code detection model. (S106).
  • a method of using all the features extracted by each feature extraction algorithm is used to detect malicious codes, but in this case, there is a problem in that the amount of computation is excessively increased and the computation speed is lowered.
  • step S106 there are a Pearson correlation algorithm, a mutual information algorithm, a Kendall correlation algorithm, a Spearmen correlation algorithm, a Chi squared algorithm, a Fischer scoring algorithm, and the like.
  • a Pearson correlation algorithm a mutual information algorithm
  • a Kendall correlation algorithm a Spearmen correlation algorithm
  • a Chi squared algorithm a Fischer scoring algorithm
  • two or more of these known algorithms may be used to obtain common features.
  • each algorithm extracts features in parallel.
  • Such feature acquisition may be performed by extracting a preset number of features having high importance from each algorithm, and selecting only common features among the extracted features.
  • the malicious code detection of the app is performed from the pre-learned malicious code detection model (S108).
  • Steps S100 to S108 are preferably performed before the device, such as a terminal capable of executing the app, executes the app, for example, at the same time the app is downloaded to the device.
  • step S108 is step S106 Based on the features acquired from , the malware detection model is machine-learned.
  • This malicious code detection model receives features extracted by pre-processing the app's code and classifies the app.
  • Any machine learning algorithm used for learning may be employed as the machine learning algorithm in the present disclosure as long as it is an algorithm that can be easily employed by a person skilled in the art based on classification accuracy and classification speed.
  • FIG. 2 is a flowchart illustrating a method for learning a malicious code detection model according to an embodiment of the present disclosure.
  • the dex files of the apk files, or the Optimized Ahead of Time (OAT) files of the dex files are converted into native codes (S200).
  • Such conversion may be performed, for example, using the dex2aot tool, which is a tool that converts dex files into Java bytecodes.
  • the shared object file of the app is extracted and merged with the converted native code in order to use the native code together for feature extraction (S212).
  • step S210 and/or step S212 the native code of the app is disassembling (disassembling) (S220). In this way, you can get the assembly code of the app.
  • a pair of consecutive instruction codes is extracted from the code segment of the assembly code (S230).
  • Training data is extracted based on the extracted features to learn a decision tree-based malicious code detection model (S250).
  • Such learning is either supervised learning in which the extracted feature learns based on a labeled data set whether or not malicious code is included, or unsupervised learning by a data set that does not include malicious code. can be performed.
  • the malicious code detection model is verified by extracting test data based on the extracted features (S260), and performance is evaluated (S270).
  • 3A and 3B are diagrams illustrating performance of malicious code detection using a malicious code detection model according to an embodiment of the present disclosure.
  • FIG. 3A (a) is a diagram illustrating a data set used to evaluate the malicious code detection performance of the present disclosure.
  • the data set that is not code-obfuscated (Un-obfuscated in Fig. 3a (a))
  • the code-obfuscated data set (Obfuscated in Fig. 3a (a))
  • the app's code consist only of native code.
  • the app code is a data set (Mixed in Fig. 3a (a)) in which native code, code obfuscated code, and code not obfuscated are mixed.
  • Each data set consists of apps containing malicious code and apps without malicious code.
  • Adagio Adagio
  • MUDFLOW MUDFLOW
  • Droid-Native algorithm-based malicious code detection tools were used together.
  • the Adagio algorithm is an algorithm that does not support code-obfuscated malware detection and native code-configured malware detection
  • DroidNative is an algorithm that supports only native code-based malware detection
  • the DroidSieve algorithm is code-obfuscated. It is an algorithm that supports malware detection and only partially supports malware detection composed of native code.
  • Figure 3a (b) is the result of learning and verification based on the non-code obfuscation data set
  • Figure 3a (c) is the result of learning and verifying based on the code obfuscation data set
  • Figure 3b (d) shows the results of learning and verification based on a data set consisting only of native code
  • (e) of FIG. 3b is based on a data set in which native code, code obfuscation code, and code not obfuscated are mixed. Shows the results of learning and verification.
  • the malicious code detection method of the present disclosure is shown as “Proposed method (all segments w/ feature selection)”, and the feature selection process in the present disclosure (S106 in FIG. 1 ) step, step S240 of Figure 2) was not applied ("Proposed method (all segments w / o feature selection)" in Figures 3a and 3b (b) to (e)) was also evaluated.
  • the detection rate is always excellent when the malicious code detection method of the present disclosure is used compared to when other methods are used.
  • the average run time per sample is about 0.4 times longer than when the malicious code detection method without the feature selection process is applied. It is a natural phenomenon as we carry out more stages of selection.
  • the malicious code detection method of the present disclosure it can be confirmed that the execution time is significantly reduced compared to the case where the Adagio, DroidNative, and DroidSieve algorithms are applied.
  • the detection rate of 98.3% is excellent when the malicious code detection method of the present disclosure is applied.
  • This detection rate is 1% lower than the DroidSieve algorithm, which shows the best performance in detection rate, but considering that the average execution time per sample of the malicious code detection method of the present disclosure is only about 13% compared to the execution time of the DroidSieve algorithm, It has a significant reduction in computation time while maintaining an excellent detection rate compared to existing algorithms.
  • Various implementations of the devices, units, processes, steps, etc., described herein include digital electronic circuits, integrated circuits, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or a combination thereof. These various implementations may include being implemented in one or more computer programs executable on a programmable system.
  • the programmable system includes at least one programmable processor (which may be a special purpose processor) coupled to receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device. or may be a general-purpose processor).
  • Computer programs also known as programs, software, software applications or code
  • the computer-readable recording medium includes all types of recording devices in which data readable by a computer system is stored. These computer-readable recording media are non-volatile or non-transitory, such as ROM, CD-ROM, magnetic tape, floppy disk, memory card, hard disk, magneto-optical disk, and storage device. It may further include a medium or a transitory medium such as a data transmission medium. In addition, the computer-readable recording medium may be distributed in a network-connected computer system, and the computer-readable code may be stored and executed in a distributed manner.
  • the computer includes a programmable processor, a data storage system (including volatile memory, non-volatile memory, or other types of storage systems or combinations thereof), and at least one communication interface.
  • the programmable computer may be one of a server, a network appliance, a set-top box, an embedded device, a computer expansion module, a personal computer, a laptop, a Personal Data Assistant (PDA), a cloud computing system, or a mobile device.
  • PDA Personal Data Assistant

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Stored Programmes (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present disclosure provides a malware detection model training method and a malware detection method. According to one aspect of the present disclosure, provided is a method for training a malware detection model and a method for detecting malware by converting code of an app into native code, extracting a pair of instructions code from the native code, and extracting a common feature from a plurality of feature extraction algorithms on the basis of the pair of instructions code.

Description

악성코드 탐지모델 학습방법 및 악성코드 탐지방법Malicious code detection model learning method and malicious code detection method
본 발명은 악성코드 탐지모델 학습방법 및 기 학습된 악성코드 탐지모델을 이용한 악성코드 탐지방법에 관한 것이다.The present invention relates to a method for learning a malicious code detection model and a method for detecting a malicious code using a previously learned malicious code detection model.
이 부분에 기술된 내용은 단순히 본 실시예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The content described in this section merely provides background information for the present embodiment and does not constitute the prior art.
안드로이드 앱(android app)은 역어셈블(disassemble)이 가능하므로, 악성코드(malware)가 쉽게 내장될 수 있다. 이러한 문제를 해결하기 위한 다양한 악성코드 탐지방법이 제시되고 있다. 이러한 악성코드 탐지방법으로는, 패턴매칭(pattern matching) 기법이 있다. 패턴매칭 기법은 동일한 패턴을 보이는 악성코드 탐지에는 높은 정확도를 가지고 탐색 속도도 빠르지만, 코드가 변형되거나 은닉기법이 사용되어 탐지를 회피할 수 있는 단점이 있다.Since an android app can disassemble, malware can be easily embedded. To solve this problem, various malicious code detection methods have been proposed. As such a malicious code detection method, there is a pattern matching technique. The pattern matching technique has high accuracy and fast detection speed in detecting malicious codes showing the same pattern, but has a disadvantage in that it can avoid detection because the code is modified or a hidden technique is used.
악성코드 탐지를 회피 내지 우회하는 방법으로는 악성코드를 코드난독화(code obfuscation)하거나, 네이티브 코드(native code)를 사용하는 방법 등이 있다. 코드 난독화는 자바 바이트코드(java bitecode)의 기능은 그대로 유지하면서 코드의 순서를 바꾸거나 의미 없는 코드를 추가하고 심볼 정보를 축약하는 등 코드의 패턴을 없애는 방법이다. 악성코드가 코드 난독화된 경우, 종래의 패턴매칭기법으로는 악성코드를 탐지하기 어려운 문제가 있다.Methods of evading or bypassing detection of malicious code include code obfuscation of malicious code or a method of using native code. Code obfuscation is a method of removing code patterns such as changing the code order, adding meaningless code, and abbreviating symbol information while maintaining the function of java bitecode. When the malicious code is code obfuscated, there is a problem in that it is difficult to detect the malicious code using the conventional pattern matching technique.
네이티브 코드의 사용은 악성코드가 적용된 코드를 자바 바이트코드가 아니라 네이티브 코드로 구현하는 방법으로, 자바 바이트코드의 탐색을 주로 수행하는 종래의 악성코드 스캐너(malware scanner)로는 악성코드를 탐지하기 어렵다.The use of native code is a method of implementing the code to which malicious code is applied as native code, not Java bytecode, and it is difficult to detect malicious code with a conventional malware scanner that mainly searches for Java bytecode.
따라서, 자바 바이트코드와 네이티브 코드를 모두 포함하는 안드로이드 앱의 특성 및 코드 난독화를 고려한 악성코드 탐지 방안의 고안이 필요하다.Therefore, it is necessary to devise a malicious code detection method in consideration of the characteristics of the Android app including both Java bytecode and native code and code obfuscation.
본 개시의 일 측면에 의하면, 앱(app)의 코드를 네이티브 코드(native code)로 변환하고, 네이티브 코드로부터 인스트럭션 코드 쌍(pair of instructions code)을 추출하고, 인스트럭션 코드 쌍을 기초로 복수의 피처 추출 알고리즘(feature extraction algorithm)으로부터 공통된 피처(feature)를 추출하여 악성코드를 탐지하는 방법 및 악성코드 탐지모델을 학습시키는 방법을 제공하는 데 주된 목적이 있다.According to one aspect of the present disclosure, a code of an app (app) is converted into a native code, an instruction code pair is extracted from the native code, and a plurality of features are based on the instruction code pair. Its main purpose is to provide a method for detecting a malicious code by extracting a common feature from a feature extraction algorithm and a method for learning a malicious code detection model.
본 개시의 일 측면에 의하면, 기 학습된 악성코드 탐지모델을 이용하여 앱(app)으로부터 악성코드를 탐지하는 방법에 있어서, 상기 앱에 포함된 자바 바이트코드(java bitecode)를 네이티브 코드(native code)로 변환하는 과정; 상기 앱의 모든 네이티브 코드로부터 추출한 코드 세그먼트(code segment)를 기초로 연속된 인스트럭션 코드(instructions code)의 쌍인 인스트럭션 코드 쌍(pair of consecutive instructions code)을 추출하는 과정; 상기 인스트럭션 코드 쌍으로부터 2 이상의 피처 추출 알고리즘을 이용하여 각 피처 추출 알고리즘이 공통적으로추출한 피처(feature)를 획득하는 과정; 및 획득한 피처를 상기 악성코드 탐지모델의 입력 데이터로 하여, 악성코드 탐지를 수행한 결과를 획득하는 과정을 포함하는 것을 특징으로 하는 악성코드 탐지방법을 제공한다.According to one aspect of the present disclosure, in a method of detecting a malicious code from an app using a pre-learned malicious code detection model, a native code (java bitecode) included in the app ) to convert to; A process of extracting a pair of consecutive instructions code, which is a pair of consecutive instructions code, based on a code segment extracted from all native codes of the app; obtaining a feature commonly extracted by each feature extraction algorithm by using two or more feature extraction algorithms from the instruction code pair; and using the acquired feature as input data of the malicious code detection model to obtain a result of performing malicious code detection.
본 개시의 다른 측면에 의하면, 전술한 악성코드 탐지방법의 각 과정을 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록매체에 저장된 컴퓨터 프로그램을 제공한다.According to another aspect of the present disclosure, there is provided a computer program stored in a computer-readable recording medium to execute each process of the above-described malicious code detection method.
본 개시의 또 다른 측면에 의하면, 앱(app)의 악성코드(malware)를 탐지하기 위한 악성코드 탐지모델(malware detection model)을 학습시키는 방법에 있어서, 데이터 셋(dataset)의 각 앱에 포함된 자바 바이트코드(java bitecode)를 네이티브 코드(native code)로 변환하는 과정; 상기 앱의 모든 네이티브 코드로부터 추출한 코드 세그먼트(code segment)로부터 연속된 인스트럭션 코드의 쌍인 인스트럭션 코드 쌍(pair of consecutive instructions code)을 추출하는 과정; 상기 인스트럭션 코드 쌍으로부터 2 이상의 피처 추출 알고리즘을 이용하여 각 피처 추출 알고리즘이 공통적으로 추출한 피처(feature)를 획득하는 과정; 및 획득한 피처를 기초로 상기 악성코드 탐지모델을 기계학습(machine learning)시키는 과정을 포함하는 것을 특징으로 하는 악성코드 탐지모델 학습방법을 제공한다.According to another aspect of the present disclosure, in a method for learning a malware detection model for detecting malware of an app, each app included in a dataset converting java bitecode into native code; a process of extracting a pair of consecutive instructions code, which is a pair of consecutive instruction codes, from a code segment extracted from all native codes of the app; obtaining a feature commonly extracted by each feature extraction algorithm by using two or more feature extraction algorithms from the instruction code pair; and machine learning the malicious code detection model based on the acquired features.
본 개시의 일 측면에 의하면, 앱의 코드를 네이티브 코드로 변환하고, 네이티브 코드로부터 인스트럭션 코드 쌍을 추출하고, 인스트럭션 코드 쌍을 기초로 복수의 피처 추출 알고리즘으로부터 공통된 피처를 추출하여 악성코드를 탐지하는 방법 및 악성코드 탐지모델을 학습시키는 방법을 제공함으로써, 코드 난독화된 악성코드 또는 네이티브 코드로 구현된 악성코드를 신속하게 탐지하는 효과가 있다.According to one aspect of the present disclosure, a malicious code is detected by converting an app's code into a native code, extracting an instruction code pair from the native code, and extracting common features from a plurality of feature extraction algorithms based on the instruction code pair. By providing a method and a method for training a malicious code detection model, there is an effect of rapidly detecting code obfuscated malicious code or malicious code implemented as native code.
도 1은 본 개시의 일 실시예에 따른 악성코드 탐지방법 및 악성코드 탐지모델의 학습방법을 나타내는 순서도이다.1 is a flowchart illustrating a malicious code detection method and a learning method of a malicious code detection model according to an embodiment of the present disclosure.
도 2는 본 개시의 일 실시예에 따른 악성코드 탐지모델의 학습방법을 구체화한 순서도이다.2 is a detailed flowchart of a learning method of a malicious code detection model according to an embodiment of the present disclosure.
도 3a 및 도 3b는 본 개시의 일 실시예에 따른 악성코드 탐지모델을 이용한 악성코드 탐지의 성능을 나타내는 도표이다.3A and 3B are diagrams illustrating performance of malicious code detection using a malicious code detection model according to an embodiment of the present disclosure.
이하, 본 개시의 일부 실시예들을 예시적인 도면을 통해 상세하게 설명한다. 각 도면의 구성요소들에 열람부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 개시를 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 개시의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, some embodiments of the present disclosure will be described in detail with reference to exemplary drawings. It should be noted that, in adding the reference numerals to the components of each drawing, the same components are to have the same reference numerals as much as possible even if they are displayed on different drawings. In addition, in describing the present disclosure, if it is determined that a detailed description of a related known configuration or function may obscure the gist of the present disclosure, the detailed description thereof will be omitted.
또한, 본 개시의 구성 요소를 설명하는 데 있어서, 제2, 제1 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 명세서 전체에서, 어떤 부분이 어떤 구성요소를 '포함', '구비'한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 '…부', '모듈' 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.In addition, in describing the components of the present disclosure, terms such as second and first may be used. These terms are only for distinguishing the elements from other elements, and the essence, order, or order of the elements are not limited by the terms. Throughout the specification, when a part 'includes' or 'includes' a certain element, this means that other elements may be further included, rather than excluding other elements, unless otherwise stated. . In addition, the '... Terms such as 'unit' and 'module' mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software.
첨부된 도면과 함께 이하에 개시될 상세한 설명은 본 개시의 예시적인 실시형태를 설명하고자 하는 것이며, 본 개시가 실시될 수 있는 유일한 실시형태를 나타내고자 하는 것이 아니다.DETAILED DESCRIPTION The detailed description set forth below in conjunction with the appended drawings is intended to describe exemplary embodiments of the present disclosure and is not intended to represent the only embodiments in which the present disclosure may be practiced.
도 1은 본 개시의 일 실시예에 따른 악성코드 탐지방법 및 악성코드탐지모델 학습방법을 나타내는 순서도이다.1 is a flowchart illustrating a method for detecting a malicious code and a method for learning a malicious code detection model according to an embodiment of the present disclosure.
앱(예: 안드로이드 앱)의 자바 바이트코드(java bitecode)를 네이티브 코드(native code)로 변환한다(S100). 이러한 네이티브 코드로의 변환은, 자바 바이트코드를 달빅 VM(Dalvik Virtual Machine) 또는 ART 툴(Android RunTimetool) 등과 같은 변환 툴을 이용하여 수행될 수 있다. 이로써 앱의 코드 전체가 네이티브 코드가 되어 자바 바이트코드로 작성된 악성코드는 물론, 네이티브 코드로 작성된 악성코드 또한 본 개시의 악성코드 탐지모델만으로 탐지할 수 있게 된다.Converts java bitecode of an app (eg, an Android app) into native code (S100). The conversion to the native code may be performed using a conversion tool such as Java bytecodes Dalvik Virtual Machine (Dalvik VM) or ART tool (Android RunTimetool). Accordingly, the entire code of the app becomes native code, and malicious code written in Java bytecode as well as malicious code written in native code can be detected only with the malicious code detection model of the present disclosure.
피처 추출에 있어서 과도한 데이터 처리 없이 피처를 추출할 수 있도록, 변환된 앱의 네이티브 코드 중 코드 세그먼트(code segment)만 남기고 나머지 리소스(resource), 데이터(data) 등은 제거한다(S102). 변환되기 전부터 앱에 존재하였던 네이티브 코드는, 앱의 공유 라이브러리 파일(shared library file)의 네이티브 코드를 추출하여 변환된 네이티브 코드와 병합한 후 S102 단계를 수행할 수 있다.In order to extract features without excessive data processing in feature extraction, only the code segment of the converted app's native code is left and the remaining resources and data are removed (S102). Native code that existed in the app before conversion may be performed by extracting the native code of the shared library file of the app and merging it with the converted native code, and then performing step S102.
코드 세그먼트로부터 연속된 두 인스트럭션 코드(instructions code) 쌍(pair)을 피처(feature)로서 추출한다(S104). 이러한 인스트럭션 코드는, 연산 코드(OP-code), 모드 및 피연산자의 주소 필드를 전부 또는 일부 포함할 수 있다. 이때, 악성코드와 앱의 코드로부터 공통적이면서도 일반적으로 사용되는 인스트럭션 코드는 제외하고 추출하거나, 최종 피처 추출 이전에 제거됨이 바람직하다. 예컨대, 단순한 사칙연산(arithmetic operation) 코드 등이 제거될 수 있다. 악성코드와 앱의 코드에서 공통적이면서도 일반적으로 사용되는 인스트럭션 코드를 제외함으로써, 추후 피처 추출 알고리즘을 이용한 피처 획득의 연산 시간을 감소시킬 수 있다.Two consecutive instruction code pairs are extracted from the code segment as features ( S104 ). Such an instruction code may include all or part of an operation code (OP-code), a mode, and an address field of an operand. In this case, it is preferable to extract the common and generally used instruction code from the malicious code and the code of the app, or to remove it before the final feature extraction. For example, a simple arithmetic operation code or the like may be eliminated. By excluding common and commonly used instruction codes from malicious codes and app codes, it is possible to reduce the computation time for feature acquisition using a feature extraction algorithm in the future.
인스트럭션 코드 쌍을 추출하는 방법의 예를 들면 다음과 같다. 코드 세그먼트에 인스트럭션 코드 A, B, C, A 및 D가 차례로 존재하고, C가 흔히 사용되는 인스트럭션 코드로서 제거될 코드인 경우, 최종 추출되는 인스트럭션 코드 쌍이자 피처는 (A, B), (A, D)가 된다. 이때, (A, B)와 (B, A)는 동일한 인스트럭션 코드 쌍으로 처리한다.An example of a method of extracting an instruction code pair is as follows. If instruction codes A, B, C, A, and D are sequentially present in a code segment, and C is a commonly used instruction code and is the code to be removed, the final extracted instruction code pair and feature is (A, B), (A , D) becomes At this time, (A, B) and (B, A) are processed as the same instruction code pair.
피처로 추출되는 인스트럭션 코드가 반드시 두 개로 한정되는 것은 아니나, 피처로 추출하는 인스트럭션 코드의 개수가 증가할수록 코드 난독화(code obfuscation)의 영향을 크게 받으므로, 코드 난독화의 영향을 최소화하면서도 높은탐지 성능을 획득하기 위하여 코드의 개수를 제한할 필요가 있다.Although the number of instruction codes extracted as a feature is not necessarily limited to two, as the number of instruction codes extracted as a feature increases, it is greatly affected by code obfuscation. It is necessary to limit the number of codes to obtain performance.
복수의 피처 추출 알고리즘(feature extraction algorithm)을 이용하여, S104 단계에서 추출한 인스트럭션 코드 쌍으로부터 피처를 추출하고, 이러한 피처들 중 각 알고리즘에서 공통적으로 추출되는 피처를 악성코드 탐지모델의 입력 데이터로서 획득한다(S106). 종래에는 악성코드 탐지에 각 피처 추출 알고리즘이 추출한 피처들을 모두 이용하는 방식을 채용하였으나, 이 경우 연산량이 지나치게 증가하여 연산속도가 저하되는 문제가 있었다.By using a plurality of feature extraction algorithms, features are extracted from the instruction code pair extracted in step S104, and among these features, features commonly extracted by each algorithm are obtained as input data of the malicious code detection model. (S106). Conventionally, a method of using all the features extracted by each feature extraction algorithm is used to detect malicious codes, but in this case, there is a problem in that the amount of computation is excessively increased and the computation speed is lowered.
피처 추출 알고리즘으로는, Pearson correlation 알고리즘, Mutual information 알고리즘, Kendall correlation 알고리즘, Spearmen correlation 알고리즘, Chi squared 알고리즘, Fischer scored 알고리즘 등이 있다. S106 단계에서는 이러한 알려진 알고리즘을 둘 이상 이용하여 공통되는 피처를 획득할 수 있다. 각 알고리즘이 피처를 추출하는 과정은 병렬적으로 수행됨이 바람직하다.As the feature extraction algorithm, there are a Pearson correlation algorithm, a mutual information algorithm, a Kendall correlation algorithm, a Spearmen correlation algorithm, a Chi squared algorithm, a Fischer scoring algorithm, and the like. In step S106, two or more of these known algorithms may be used to obtain common features. Preferably, each algorithm extracts features in parallel.
이러한 피처 획득은, 각 알고리즘으로부터 중요도(importance)가 높은 기 설정된 개수의 피처를 추출하고, 추출된 피처들 중 공통되는 피처만을 선택함으로써 수행될 수 있다.Such feature acquisition may be performed by extracting a preset number of features having high importance from each algorithm, and selecting only common features among the extracted features.
S106 단계에서 획득한 피처를 기초로 악성코드 탐지모델의 입력 데이터로 하여, 기 학습된 악성코드 탐지모델로부터 앱의 악성코드 탐지를 수행한다(S108). S100 단계 내지 S108 단계는, 앱을 실행할 수 있는 단말 등의 장치가 앱을 실행하기 전, 예컨대 장치에 앱이 다운로드됨과 동시에 수행됨이 바람직하다.Based on the features acquired in step S106 as input data of the malicious code detection model, the malicious code detection of the app is performed from the pre-learned malicious code detection model (S108). Steps S100 to S108 are preferably performed before the device, such as a terminal capable of executing the app, executes the app, for example, at the same time the app is downloaded to the device.
또는, S100 단계의 앱이 데이터 셋(dataset)에 포함된 앱으로서, 악성코드 탐지모델을 학습시키기 위해 악성코드 존재 여부, 악성코드의 위치 등이 라벨링(labeling)되어 있는 경우, S108 단계는 S106 단계에서 획득한 피처를 기초로 악성코드 탐지모델을 기계학습(machine learning)시킨다. 이러한 악성코드 탐지모델은 앱의 코드 전처리로 추출된 피처를 입력받아, 앱의 분류를 수행한다. 학습에 사용되는 기계학습 알고리즘은 분류 정확도와 분류 속도를 기초로 통상의 기술자가 용이하게 채용할 수 있는 알고리즘이면 어떤 것이든 본 개시에서의 기계학습 알고리즘으로 채용할 수 있다.Alternatively, if the app of step S100 is an app included in the dataset, and the presence of malicious code, the location of malicious code, etc. are labeled in order to learn a malicious code detection model, step S108 is step S106 Based on the features acquired from , the malware detection model is machine-learned. This malicious code detection model receives features extracted by pre-processing the app's code and classifies the app. Any machine learning algorithm used for learning may be employed as the machine learning algorithm in the present disclosure as long as it is an algorithm that can be easily employed by a person skilled in the art based on classification accuracy and classification speed.
도 2는 본 개시의 일 실시예에 따른 악성코드 탐지모델 학습방법을 구체화한 순서도이다.2 is a flowchart illustrating a method for learning a malicious code detection model according to an embodiment of the present disclosure.
데이터 셋의 모든 앱의 apk 파일, apk 파일의 dex 파일 또는 dex 파일의 OAT(Optimized Ahead of Time) 파일 등에 대하여, 파일의 자바 바이트코드를 네이티브 코드로 변환시킨다(S200). 이러한 변환은 예컨대, dex 파일의 자바 바이트코드로 변환시키는 툴인 dex2aot 툴을 이용하여 수행될 수 있다.For the apk files of all apps in the data set, the dex files of the apk files, or the Optimized Ahead of Time (OAT) files of the dex files, the Java bytecodes of the files are converted into native codes (S200). Such conversion may be performed, for example, using the dex2aot tool, which is a tool that converts dex files into Java bytecodes.
변환되기 전의 파일에 네이티브 코드가 존재한다고 판단한 경우(S210), 해당 네이티브 코드를 피처 추출에 함께 사용하기 위해서 앱의 shared object 파일을 추출하여 변환된 네이티브 코드와 병합한다(S212).If it is determined that the native code exists in the file before conversion (S210), the shared object file of the app is extracted and merged with the converted native code in order to use the native code together for feature extraction (S212).
S210 단계 및/또는 S212 단계 후, 앱의 네이티브 코드를 역어셈블링(disassembling)한다(S220). 이로써, 앱의 어셈블리 코드(assembly code)를 얻을 수 있다.After step S210 and/or step S212, the native code of the app is disassembling (disassembling) (S220). In this way, you can get the assembly code of the app.
어셈블리 코드의 코드 세그먼트(code segment)로부터 연속된 인스트럭션 코드의 쌍을 추출한다(S230).A pair of consecutive instruction codes is extracted from the code segment of the assembly code (S230).
인스트럭션 코드 쌍을 입력값으로, Pearson correlation 알고리즘, Mutual information 알고리즘, Kendall correlation 알고리즘, Spearmen correlation 알고리즘, Chi squared 알고리즘 및 Fischer scored 알고리즘의 전부 또는 일부를 실행하여 악성코드 탐지모델을 트레이닝할 피처를 추출한다(S240). 이러한 알고리즘은 피처 추출 알고리즘의 예시적인 것으로, 반드시 S240 단계에 기술된 알고리즘에 한하여 피처를 추출하는 것은 아니고 통상의 기술자가 피처 추출을 위해 용이하게 채용할 수 있는 알고리즘이면 어떤 것이든 본 개시에서의 피처 추출 알고리즘으로서 채용할 수 있다.Using the instruction code pair as an input value, all or part of the Pearson correlation algorithm, the mutual information algorithm, the Kendall correlation algorithm, the Spearmen correlation algorithm, the Chi squared algorithm, and the Fischer scored algorithm are executed to extract the features to train the malicious code detection model ( S240). This algorithm is an example of a feature extraction algorithm, and does not necessarily extract features limited to the algorithm described in step S240, and any algorithm that a person skilled in the art can easily employ for feature extraction is used in the present disclosure. It can be employed as an extraction algorithm.
추출된 피처를 기초로 트레이닝 데이터를 추출하여 결정트리(decision tree) 기반의 악성코드 탐지모델을 학습시킨다(S250). 이러한 학습은, 추출된 피처에 악성코드 포함 여부가 라벨링된 데이터 셋을 기초로 학습하는 지도학습(supervised learning)되거나, 악성코드 포함 여부가 라벨링되지 않은 데이터 셋에 의하여 비지도 학습(unsupervised learning)되어 수행될 수 있다.Training data is extracted based on the extracted features to learn a decision tree-based malicious code detection model (S250). Such learning is either supervised learning in which the extracted feature learns based on a labeled data set whether or not malicious code is included, or unsupervised learning by a data set that does not include malicious code. can be performed.
추출된 피처를 기초로 테스트 데이터를 추출하여 악성코드 탐지모델을 검증하고(S260), 성능을 평가한다(S270).The malicious code detection model is verified by extracting test data based on the extracted features (S260), and performance is evaluated (S270).
도 1 및 도 2에서는 과정 각 과정을 순차적으로 실행하는 것으로 기재하고 있으나, 이는 본 개시의 일 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것이다. 다시 말해, 본 개시의 일 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 개시의 일 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 도 1 및 도 2에 기재된 순서를 변경하여 실행하거나 각 과정 중 하나 이상의 과정을 병렬적으로 실행하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이므로, 도 1 및 도 2의 시계열적인 순서로 한정되는 것은 아니다.Although it is described that each process is sequentially executed in FIGS. 1 and 2 , this is merely illustrative of the technical idea of an embodiment of the present disclosure. In other words, one of ordinary skill in the art to which an embodiment of the present disclosure pertains may change the order described in FIGS. 1 and 2 or perform one of each process without departing from the essential characteristics of an embodiment of the present disclosure. Since the above process may be variously modified and modified by executing the above process in parallel, it is not limited to the time series sequence of FIGS. 1 and 2 .
도 3a 및 도 3b는 본 개시의 일 실시예에 따른 악성코드 탐지모델을 이용한 악성코드 탐지의 성능을 나타내는 도표이다.3A and 3B are diagrams illustrating performance of malicious code detection using a malicious code detection model according to an embodiment of the present disclosure.
도 3a의 (a)는 본 개시의 악성코드 탐지 성능을 평가하기 위하여 사용된 데이터 셋을 나타내는 도표이다. 본 성능 평가에서는 코드 난독화되지 않은 데이터 셋(도 3a의 (a)의 Un-obfuscated), 코드 난독화된 데이터 셋(도 3a의 (a)의 Obfuscated), 앱의 코드가 네이티브 코드로만 구성된 데이터 셋(도 3a의 (a)의 Native), 앱의 코드가 네이티브 코드, 코드 난독화된 코드 및 코드 난독화되지 않은 코드가 혼재된 데이터 셋(도 3a의 (a)의 Mixed)이다. 각 데이터 셋은 악성코드가 포함된 앱과 악성코드가 포함되지 않은 앱으로 구성되어 있다.3A (a) is a diagram illustrating a data set used to evaluate the malicious code detection performance of the present disclosure. In this performance evaluation, the data set that is not code-obfuscated (Un-obfuscated in Fig. 3a (a)), the code-obfuscated data set (Obfuscated in Fig. 3a (a)), and the app's code consist only of native code. Set (Native in Fig. 3a (a)), the app code is a data set (Mixed in Fig. 3a (a)) in which native code, code obfuscated code, and code not obfuscated are mixed. Each data set consists of apps containing malicious code and apps without malicious code.
본 개시의 악성코드 탐지모델의 성능을 대비하기 위하여, 기존의 악성코드 탐지 툴인 Adagio, MUDFLOW, Droid-Native 알고리즘 기반의 악성코드 탐지 툴을 함께 사용하였다. 여기에서, Adagio 알고리즘은 코드 난독화된 악성코드 탐지 및 네이티브 코드로 구성된 악성코드 탐지를 지원하지 않는 알고리즘이고, DroidNative는 네이티브 코드로 구성된 악성코드 탐지만 지원하는 알고리즘이며, DroidSieve 알고리즘은 코드 난독화된 악성코드 탐지를 지원하고, 네이티브 코드로 구성된 악성코드 탐지는 부분적으로만 지원하는 알고리즘이다.In order to prepare for the performance of the malicious code detection model of the present disclosure, existing malicious code detection tools such as Adagio, MUDFLOW, and Droid-Native algorithm-based malicious code detection tools were used together. Here, the Adagio algorithm is an algorithm that does not support code-obfuscated malware detection and native code-configured malware detection, DroidNative is an algorithm that supports only native code-based malware detection, and the DroidSieve algorithm is code-obfuscated. It is an algorithm that supports malware detection and only partially supports malware detection composed of native code.
도 3a의 (b)는 코드 난독화되지 않은 데이터 셋을 기초로 학습 및 검증한 결과이고, 도 3a의 (c)는 코드 난독화된 데이터 셋을 기초로 학습 및 검증한 결과를, 도 3b의 (d)는 네이티브 코드로만 구성된 데이터 셋을 기초로 학습 및 검증한 결과를, 도 3b의 (e)는 네이티브 코드, 코드 난독화된 코드 및 코드 난독화 되지 않은 코드가 혼재된 데이터 셋을 기초로 학습 및 검증한 결과를 나타낸다.Figure 3a (b) is the result of learning and verification based on the non-code obfuscation data set, Figure 3a (c) is the result of learning and verifying based on the code obfuscation data set, Figure 3b (d) shows the results of learning and verification based on a data set consisting only of native code, (e) of FIG. 3b is based on a data set in which native code, code obfuscation code, and code not obfuscated are mixed. Shows the results of learning and verification.
도 3a 및 도 3b의 (b) 내지 (e)에서는 본 개시의 악성코드 탐지방법을 "Proposed method(all segments w/ feature selection)"로 나타내었고, 본 개시에서의 피처 선정 과정(도 1의 S106 단계, 도 2의 S240 단계)이 적용되지 않은 경우(도 3a 및 도 3b의 (b) 내지 (e)의 "Proposed method(all segments w/o feature selection)")의 성능도 함께 평가하였다.In FIGS. 3A and 3B (b) to (e), the malicious code detection method of the present disclosure is shown as “Proposed method (all segments w/ feature selection)”, and the feature selection process in the present disclosure (S106 in FIG. 1 ) step, step S240 of Figure 2) was not applied ("Proposed method (all segments w / o feature selection)" in Figures 3a and 3b (b) to (e)) was also evaluated.
도 3a의 (b) 및 (c)를 참조하면, 본 개시의 악성코드 탐지방법을 이용하는 경우 다른 방법을 이용하는 경우 대비 탐지율(detection rate)이 항상 우수함을 확인할 수 있다. 비록, 코드 난독화된 데이터 셋을 기초로 한 성능 평가에서는 피처 선정 과정을 거치지 않은 악성코드 탐지방법을 적용하는 경우 대비 샘플당 평균 실행시간(average run time)이 약 0.4 배 더 소요되나, 이는 피쳐 선정의 단계를 더 수행함에 따른 자연스러운 현상이다. 한편, 본 개시의 악성코드 탐지방법을 이용하는 경우 Adagio, DroidNative, DroidSieve 알고리즘을 적용하는 경우 대비 실행시간의 현저한 감소를 확인할 수 있다.Referring to (b) and (c) of FIG. 3A , it can be seen that the detection rate is always excellent when the malicious code detection method of the present disclosure is used compared to when other methods are used. Although, in the performance evaluation based on the obfuscated data set, the average run time per sample is about 0.4 times longer than when the malicious code detection method without the feature selection process is applied. It is a natural phenomenon as we carry out more stages of selection. On the other hand, when the malicious code detection method of the present disclosure is used, it can be confirmed that the execution time is significantly reduced compared to the case where the Adagio, DroidNative, and DroidSieve algorithms are applied.
도 3b의 (d)를 참조하면, 코드 난독화된 데이터 셋을 기초로 성능을 평가하는 경우, 본 개시의 악성코드 탐지방법을 적용하는 경우의 탐지율이 가장 우수함을 확인할 수 있고, Adagio, DroidNative, DroidSieve 알고리즘을 적용하는 경우 대비 실행시간이 현저히 감소되었음을 확인할 수 있다.Referring to (d) of FIG. 3B, when performance is evaluated based on the code-obfuscated data set, it can be seen that the detection rate is the best when the malicious code detection method of the present disclosure is applied, Adagio, DroidNative, It can be seen that the execution time is significantly reduced compared to the case of applying the DroidSieve algorithm.
도 3b의 (e)를 참조하면, 혼재된 데이터 셋을 기초로 성능 평가 시, 본 개시의 악성코드 탐지방법을 적용하는 경우 98.3%의 우수한 탐지율을 보임을 확인할 수 있다. 이러한 탐지율은, 탐지율에서 가장 우수한 성능을 보이는 DroidSieve 알고리즘 대비 1% 낮은 수치이나, 본 개시의 악성코드 탐지방법의 샘플당 평균 실행시간이 DroidSieve 알고리즘의 실행시간 대비 약 13 %에 불과함을 고려할 때, 기존 알고리즘 대비 우수한 탐지율을 유지하면서도 현저한 연산 시간 감소 효과가 있는 것이다.Referring to (e) of FIG. 3B , it can be seen that when the performance evaluation based on the mixed data set is applied, the detection rate of 98.3% is excellent when the malicious code detection method of the present disclosure is applied. This detection rate is 1% lower than the DroidSieve algorithm, which shows the best performance in detection rate, but considering that the average execution time per sample of the malicious code detection method of the present disclosure is only about 13% compared to the execution time of the DroidSieve algorithm, It has a significant reduction in computation time while maintaining an excellent detection rate compared to existing algorithms.
본 명세서에 설명되는 장치, 부(unit), 과정, 단계 등의 다양한 구현 예들은, 디지털 전자 회로, 집적 회로, FPGA(field programmable gate array), ASIC(application specific integrated circuit), 컴퓨터 하드웨어, 펌웨어, 소프트웨어, 및/또는 이들의 조합으로 실현될 수 있다. 이러한 다양한 구현 예들은 프로그래밍 가능 시스템상에서 실행 가능한 하나 이상의 컴퓨터 프로그램들로 구현되는 것을 포함할 수 있다. 프로그래밍 가능 시스템은, 저장 시스템, 적어도 하나의 입력 디바이스, 그리고 적어도 하나의 출력 디바이스로부터 데이터 및 명령을 수신하고 이들에게 데이터 및 명령을 전송하도록 결합된 적어도 하나의 프로그래밍 가능 프로세서(이것은 특수 목적 프로세서일 수 있거나 혹은 범용 프로세서일 수 있음)를 포함한다. 컴퓨터 프로그램들(이것은 또한 프로그램들, 소프트웨어, 소프트웨어 애플리케이션들 혹은 코드로서 알려져 있음)은 프로그래밍 가능 프로세서에 대한 명령어들을 포함하며 "컴퓨터가 읽을 수 있는 기록매체"에 저장된다.Various implementations of the devices, units, processes, steps, etc., described herein include digital electronic circuits, integrated circuits, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or a combination thereof. These various implementations may include being implemented in one or more computer programs executable on a programmable system. The programmable system includes at least one programmable processor (which may be a special purpose processor) coupled to receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device. or may be a general-purpose processor). Computer programs (also known as programs, software, software applications or code) contain instructions for a programmable processor and are stored on a "computer-readable recording medium".
컴퓨터가 읽을 수 있는 기록매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 이러한 컴퓨터가 읽을 수 있는 기록매체는 ROM, CD-ROM, 자기 테이프, 플로피디스크, 메모리 카드, 하드 디스크, 광자기 디스크, 스토리지 디바이스 등의 비휘발성(non-volatile) 또는 비 일시적인(non-transitory) 매체 또는 데이터 전송 매체(data transmission medium)와 같은 일시적인(transitory) 매체를 더 포함할 수도 있다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다.The computer-readable recording medium includes all types of recording devices in which data readable by a computer system is stored. These computer-readable recording media are non-volatile or non-transitory, such as ROM, CD-ROM, magnetic tape, floppy disk, memory card, hard disk, magneto-optical disk, and storage device. It may further include a medium or a transitory medium such as a data transmission medium. In addition, the computer-readable recording medium may be distributed in a network-connected computer system, and the computer-readable code may be stored and executed in a distributed manner.
본 명세서에 설명되는 시스템들 및 기법들의 다양한 구현예들은, 프로그램가능 컴퓨터에 의하여 구현될 수 있다. 여기서, 컴퓨터는 프로그램가능 프로세서, 데이터 저장 시스템(휘발성 메모리, 비휘발성 메모리, 또는 다른 종류의 저장 시스템이거나 이들의 조합을 포함함) 및 적어도 한 개의 커뮤니케이션 인터페이스를 포함한다. 예컨대, 프로그램가능 컴퓨터는 서버, 네트워크 기기, 셋톱박스, 내장형 장치, 컴퓨터 확장 모듈, 개인용 컴퓨터, 랩톱, PDA(Personal Data Assistant), 클라우드 컴퓨팅 시스템 또는 모바일 장치 중 하나일 수 있다.Various implementations of the systems and techniques described herein may be implemented by a programmable computer. Here, the computer includes a programmable processor, a data storage system (including volatile memory, non-volatile memory, or other types of storage systems or combinations thereof), and at least one communication interface. For example, the programmable computer may be one of a server, a network appliance, a set-top box, an embedded device, a computer expansion module, a personal computer, a laptop, a Personal Data Assistant (PDA), a cloud computing system, or a mobile device.
이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of this embodiment, and a person skilled in the art to which this embodiment belongs may make various modifications and variations without departing from the essential characteristics of the present embodiment. Accordingly, the present embodiments are intended to explain rather than limit the technical spirit of the present embodiment, and the scope of the technical spirit of the present embodiment is not limited by these embodiments. The protection scope of this embodiment should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present embodiment.

Claims (10)

  1. 기 학습된 악성코드 탐지모델을 이용하여 앱(app)으로부터 악성코드를 탐지하는 방법에 있어서,In the method of detecting malicious code from an app using a pre-learned malicious code detection model,
    상기 앱에 포함된 자바 바이트코드(java bitecode)를 네이티브 코드(native code)로 변환하는 과정;converting java bitecode included in the app into native code;
    상기 앱의 모든 네이티브 코드로부터 추출한 코드 세그먼트(code segment)를 기초로 연속된 인스트럭션 코드(instructions code)의 쌍인 인스트럭션 코드 쌍(pair of consecutive instructions code)을 추출하는 과정;A process of extracting a pair of consecutive instructions code, which is a pair of consecutive instructions code, based on a code segment extracted from all native codes of the app;
    상기 인스트럭션 코드 쌍으로부터 2 이상의 피처 추출 알고리즘을 이용하여 각 피처 추출 알고리즘이 공통적으로 추출한 피처(feature)를 획득하는 과정; 및obtaining a feature commonly extracted by each feature extraction algorithm by using two or more feature extraction algorithms from the instruction code pair; and
    획득한 피처를 상기 악성코드 탐지모델의 입력 데이터로 하여, 악성코드 탐지를 수행한 결과를 획득하는 과정A process of obtaining a result of performing malicious code detection by using the acquired features as input data of the malicious code detection model
    을 포함하는 것을 특징으로 하는 악성코드 탐지방법.A malicious code detection method comprising a.
  2. 제1항에 있어서,According to claim 1,
    상기 앱에 네이티브 코드가 본래부터 포함되어있는 경우, 상기 네이티브 코드로 변환하는 과정 이후에, 변환된 네이티브 코드와 상기 앱의 공유 라이브러리 (shared library)에 포함된 네이티브 코드를 병합하는 과정When the native code is originally included in the app, the process of merging the converted native code and the native code included in the shared library of the app after the conversion to the native code
    을 더 포함하는 것을 특징으로 하는 악성코드 탐지방법.Malicious code detection method, characterized in that it further comprises.
  3. 제1항에 있어서,According to claim 1,
    상기 코드 세그먼트는, 상기 앱의 모든 네이티브 코드를 역어셈블링 (disassembling)한 어셈블리 코드(assembly code)의 코드 세그먼트인 것을 특징으로 하는 악성코드 탐지방법.The code segment is a code segment of assembly code obtained by disassembling all native codes of the app.
  4. 제1항에 있어서,According to claim 1,
    상기 인스트럭션 코드 쌍은, 앱과 악성코드에 공통적으로 사용되는 인스트럭션 코드로서 기 설정된 인스트럭션 코드를 포함하지 않는 것을 특징으로 하는 악성코드 탐지방법.The instruction code pair is an instruction code commonly used for an app and a malicious code, and does not include a preset instruction code.
  5. 제1항에 있어서,According to claim 1,
    상기 피처를 획득하는 과정은,The process of acquiring the feature is
    상기 각 피처 추출 알고리즘으로부터 추출된 피처를 중요도(importance)를 기초로 기 지정된 개수만큼 추출하고, 추출된 피처들 중 공통되는 피처를 획득하는 것을 특징으로 하는 악성코드 탐지방법.A malicious code detection method, characterized in that the extracted features are extracted from each of the feature extraction algorithms by a predetermined number based on importance, and common features among the extracted features are obtained.
  6. 제1항에 있어서,According to claim 1,
    상기 2 이상의 피처 추출 알고리즘의 적어도 한 알고리즘은, Pearson correlation 알고리즘, Mutual information 알고리즘, Kendall correlation 알고리즘, Spearmen correlation 알고리즘, Chi squared 알고리즘 및 Fischer scored 알고리즘인 것을 특징으로 하는 악성코드 탐지방법.At least one algorithm of the two or more feature extraction algorithms is a Pearson correlation algorithm, a Mutual information algorithm, a Kendall correlation algorithm, a Spearmen correlation algorithm, a Chi squared algorithm, and a Fischer scoring algorithm.
  7. 제1항 내지 제6항에 따른 악성코드 탐지방법의 각 과정을 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록매체에 저장된 컴퓨터 프로그램.A computer program stored in a computer-readable recording medium to execute each process of the malicious code detection method according to any one of claims 1 to 6.
  8. 앱(app)의 악성코드(malware)를 탐지하기 위한 악성코드 탐지모델(malware detection model)을 학습시키는 방법에 있어서,A method for learning a malware detection model for detecting malware of an app, the method comprising:
    데이터 셋(dataset)의 각 앱에 포함된 자바 바이트코드(java bitecode)를 네이티브 코드(native code)로 변환하는 과정;A process of converting Java bytecode (java bitecode) included in each app of the data set into native code (native code);
    상기 앱의 모든 네이티브 코드로부터 추출한 코드 세그먼트(code segment)로부터 연속된 인스트럭션 코드의 쌍인 인스트럭션 코드 쌍(pair of consecutive instructions code)을 추출하는 과정;a process of extracting a pair of consecutive instructions code, which is a pair of consecutive instruction codes, from a code segment extracted from all native codes of the app;
    상기 인스트럭션 코드 쌍으로부터 2 이상의 피처 추출 알고리즘을 이용하여 각 피처 추출 알고리즘이 공통적으로 추출한 피처(feature)를 획득하는 과정; 및obtaining a feature commonly extracted by each feature extraction algorithm by using two or more feature extraction algorithms from the instruction code pair; and
    획득한 피처를 기초로 상기 악성코드 탐지모델을 기계학습(machine learning)시키는 과정The process of machine learning the malicious code detection model based on the acquired features
    을 포함하는 것을 특징으로 하는 악성코드 탐지모델 학습방법.A method for learning a malicious code detection model, comprising:
  9. 제8항에 있어서,9. The method of claim 8,
    상기 데이터 셋의 상기 각 앱 중 네이티브 코드가 본래부터 포함되어있는 앱이 존재하는 경우, 상기 네이티브 코드로 변환하는 과정 이후에, 변환된 네이티브 코드와 네이티브 코드가 본래부터 포함되어있는 앱의 공유 라이브러리(shared library)에 포함된 네이티브 코드를 병합하는 과정If there is an app that originally contains native code among the apps in the data set, after the process of converting to the native code, the converted native code and the shared library of the app that contains the native code ( The process of merging the native code included in the shared library)
    을 더 포함하는 것을 특징으로 하는 악성코드 탐지모델 학습방법.Malicious code detection model learning method, characterized in that it further comprises.
  10. 제8항에 있어서,9. The method of claim 8,
    상기 피처를 획득하는 과정은,The process of acquiring the feature is
    상기 각 피처 추출 알고리즘으로부터 추출된 피처를 중요도(importance)를 기초로 기 지정된 개수만큼 추출하고, 추출된 피처들 중 공통되는 피처를 획득하는 것을 특징으로 하는 악성코드 탐지방법.A malicious code detection method, characterized in that the extracted features are extracted from each of the feature extraction algorithms by a predetermined number based on importance, and common features among the extracted features are obtained.
PCT/KR2021/012224 2020-11-04 2021-09-08 Malware detection model training method and malware detection method WO2022097898A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020200145915A KR102434899B1 (en) 2020-11-04 2020-11-04 Method for Training Malware Detection Model And Method for Detecting Malware
KR10-2020-0145915 2020-11-04

Publications (1)

Publication Number Publication Date
WO2022097898A1 true WO2022097898A1 (en) 2022-05-12

Family

ID=81457915

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2021/012224 WO2022097898A1 (en) 2020-11-04 2021-09-08 Malware detection model training method and malware detection method

Country Status (2)

Country Link
KR (1) KR102434899B1 (en)
WO (1) WO2022097898A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20230160562A (en) 2022-05-17 2023-11-24 주식회사 엘지에너지솔루션 Electrode sheet drying device and electrode manufacturing system using same

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150205626A1 (en) * 2011-05-12 2015-07-23 Microsoft Technology Licensing, Llc Emulating mixed-code programs using a virtual machine instance
KR20170087007A (en) * 2016-01-19 2017-07-27 삼성전자주식회사 Electronic Apparatus for detecting Malware and Method thereof
KR20180001878A (en) * 2016-06-28 2018-01-05 삼성전자주식회사 Method for detecting the tampering of application code and electronic device supporting the same

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150205626A1 (en) * 2011-05-12 2015-07-23 Microsoft Technology Licensing, Llc Emulating mixed-code programs using a virtual machine instance
KR20170087007A (en) * 2016-01-19 2017-07-27 삼성전자주식회사 Electronic Apparatus for detecting Malware and Method thereof
KR20180001878A (en) * 2016-06-28 2018-01-05 삼성전자주식회사 Method for detecting the tampering of application code and electronic device supporting the same

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHAHID ALAM; ZHENGYANG QU; RYAN RILEY; YAN CHEN; VAIBHAV RASTOGI: "DroidNative: Semantic-Based Detection of Android Native Code Malware", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 15 February 2016 (2016-02-15), 201 Olin Library Cornell University Ithaca, NY 14853 , XP080683703 *
SIMEN RUNE BRAGEN: "Malware detection through opcode sequence analysis using machine learning", MASTER'S THESIS, GJøVIK UNIVERSITY COLLEGE, 1 January 2015 (2015-01-01), Gjøvik University College, , XP055712359, Retrieved from the Internet <URL:https://pdfs.semanticscholar.org/62f7/96c19ffa2ee70fc5ee7aec0fe41fae26f191.pdf> [retrieved on 20200707] *

Also Published As

Publication number Publication date
KR102434899B1 (en) 2022-08-23
KR20220060203A (en) 2022-05-11

Similar Documents

Publication Publication Date Title
US10277617B2 (en) Method and device for feature extraction
EP2955658B1 (en) System and methods for detecting harmful files of different formats
US9349006B2 (en) Method and device for program identification based on machine learning
CN109359439B (en) software detection method, device, equipment and storage medium
RU2614557C2 (en) System and method for detecting malicious files on mobile devices
US20180365420A1 (en) System and method of detecting malicious files with the use of elements of static analysis
EP3159823A1 (en) Vulnerability detection device, vulnerability detection method, and vulnerability detection program
CN111639337B (en) Unknown malicious code detection method and system for massive Windows software
US20200380125A1 (en) Method for Detecting Libraries in Program Binaries
EP3051767A1 (en) Method and apparatus for automatically identifying signature of malicious traffic using latent dirichlet allocation
CN112041815A (en) Malware detection
WO2022114392A1 (en) Feature selection-based mobile malicious code classification method, and recording medium and device for performing same
CN104680065A (en) Virus detection method, virus detection device and virus detection equipment
CN113935033A (en) Feature-fused malicious code family classification method and device and storage medium
WO2022097898A1 (en) Malware detection model training method and malware detection method
CN112765428A (en) Malicious software family clustering and identifying method and system
CN112966713A (en) DGA domain name detection method and device based on deep learning and computer equipment
US20230161879A1 (en) Malicious code detection method and apparatus based on assembly language model
US11790085B2 (en) Apparatus for detecting unknown malware using variable opcode sequence and method using the same
WO2022107964A1 (en) Adjacent-matrix-based malicious code detection and classification apparatus and malicious code detection and classification method
CN110210216B (en) Virus detection method and related device
WO2023072002A1 (en) Security detection method and apparatus for open source component package
CN116361797A (en) Malicious code detection method and system based on multi-source collaboration and behavior analysis
WO2022107925A1 (en) Deep learning object detection processing device
CN114491528A (en) Malicious software detection method, device and equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21889363

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21889363

Country of ref document: EP

Kind code of ref document: A1