WO2022097898A1

WO2022097898A1 - Malware detection model training method and malware detection method

Info

Publication number: WO2022097898A1
Application number: PCT/KR2021/012224
Authority: WO
Inventors: 박우길
Original assignee: 영남대학교 산학협력단
Priority date: 2020-11-04
Filing date: 2021-09-08
Publication date: 2022-05-12
Also published as: KR102434899B1; KR20220060203A

Abstract

The present disclosure provides a malware detection model training method and a malware detection method. According to one aspect of the present disclosure, provided is a method for training a malware detection model and a method for detecting malware by converting code of an app into native code, extracting a pair of instructions code from the native code, and extracting a common feature from a plurality of feature extraction algorithms on the basis of the pair of instructions code.

Description

Malicious code detection model learning method and malicious code detection method

The present invention relates to a method for learning a malicious code detection model and a method for detecting a malicious code using a previously learned malicious code detection model.

The content described in this section merely provides background information for the present embodiment and does not constitute the prior art.

Since an android app can disassemble, malware can be easily embedded. To solve this problem, various malicious code detection methods have been proposed. As such a malicious code detection method, there is a pattern matching technique. The pattern matching technique has high accuracy and fast detection speed in detecting malicious codes showing the same pattern, but has a disadvantage in that it can avoid detection because the code is modified or a hidden technique is used.

Methods of evading or bypassing detection of malicious code include code obfuscation of malicious code or a method of using native code. Code obfuscation is a method of removing code patterns such as changing the code order, adding meaningless code, and abbreviating symbol information while maintaining the function of java bitecode. When the malicious code is code obfuscated, there is a problem in that it is difficult to detect the malicious code using the conventional pattern matching technique.

The use of native code is a method of implementing the code to which malicious code is applied as native code, not Java bytecode, and it is difficult to detect malicious code with a conventional malware scanner that mainly searches for Java bytecode.

Therefore, it is necessary to devise a malicious code detection method in consideration of the characteristics of the Android app including both Java bytecode and native code and code obfuscation.

According to one aspect of the present disclosure, a code of an app (app) is converted into a native code, an instruction code pair is extracted from the native code, and a plurality of features are based on the instruction code pair. Its main purpose is to provide a method for detecting a malicious code by extracting a common feature from a feature extraction algorithm and a method for learning a malicious code detection model.

According to one aspect of the present disclosure, in a method of detecting a malicious code from an app using a pre-learned malicious code detection model, a native code (java bitecode) included in the app ) to convert to; A process of extracting a pair of consecutive instructions code, which is a pair of consecutive instructions code, based on a code segment extracted from all native codes of the app; obtaining a feature commonly extracted by each feature extraction algorithm by using two or more feature extraction algorithms from the instruction code pair; and using the acquired feature as input data of the malicious code detection model to obtain a result of performing malicious code detection.

According to another aspect of the present disclosure, there is provided a computer program stored in a computer-readable recording medium to execute each process of the above-described malicious code detection method.

According to another aspect of the present disclosure, in a method for learning a malware detection model for detecting malware of an app, each app included in a dataset converting java bitecode into native code; a process of extracting a pair of consecutive instructions code, which is a pair of consecutive instruction codes, from a code segment extracted from all native codes of the app; obtaining a feature commonly extracted by each feature extraction algorithm by using two or more feature extraction algorithms from the instruction code pair; and machine learning the malicious code detection model based on the acquired features.

According to one aspect of the present disclosure, a malicious code is detected by converting an app's code into a native code, extracting an instruction code pair from the native code, and extracting common features from a plurality of feature extraction algorithms based on the instruction code pair. By providing a method and a method for training a malicious code detection model, there is an effect of rapidly detecting code obfuscated malicious code or malicious code implemented as native code.

1 is a flowchart illustrating a malicious code detection method and a learning method of a malicious code detection model according to an embodiment of the present disclosure.

2 is a detailed flowchart of a learning method of a malicious code detection model according to an embodiment of the present disclosure.

3A and 3B are diagrams illustrating performance of malicious code detection using a malicious code detection model according to an embodiment of the present disclosure.

Hereinafter, some embodiments of the present disclosure will be described in detail with reference to exemplary drawings. It should be noted that, in adding the reference numerals to the components of each drawing, the same components are to have the same reference numerals as much as possible even if they are displayed on different drawings. In addition, in describing the present disclosure, if it is determined that a detailed description of a related known configuration or function may obscure the gist of the present disclosure, the detailed description thereof will be omitted.

In addition, in describing the components of the present disclosure, terms such as second and first may be used. These terms are only for distinguishing the elements from other elements, and the essence, order, or order of the elements are not limited by the terms. Throughout the specification, when a part 'includes' or 'includes' a certain element, this means that other elements may be further included, rather than excluding other elements, unless otherwise stated. . In addition, the '... Terms such as 'unit' and 'module' mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software.

DETAILED DESCRIPTION The detailed description set forth below in conjunction with the appended drawings is intended to describe exemplary embodiments of the present disclosure and is not intended to represent the only embodiments in which the present disclosure may be practiced.

1 is a flowchart illustrating a method for detecting a malicious code and a method for learning a malicious code detection model according to an embodiment of the present disclosure.

Converts java bitecode of an app (eg, an Android app) into native code (S100). The conversion to the native code may be performed using a conversion tool such as Java bytecodes Dalvik Virtual Machine (Dalvik VM) or ART tool (Android RunTimetool). Accordingly, the entire code of the app becomes native code, and malicious code written in Java bytecode as well as malicious code written in native code can be detected only with the malicious code detection model of the present disclosure.

In order to extract features without excessive data processing in feature extraction, only the code segment of the converted app's native code is left and the remaining resources and data are removed (S102). Native code that existed in the app before conversion may be performed by extracting the native code of the shared library file of the app and merging it with the converted native code, and then performing step S102.

Two consecutive instruction code pairs are extracted from the code segment as features ( S104 ). Such an instruction code may include all or part of an operation code (OP-code), a mode, and an address field of an operand. In this case, it is preferable to extract the common and generally used instruction code from the malicious code and the code of the app, or to remove it before the final feature extraction. For example, a simple arithmetic operation code or the like may be eliminated. By excluding common and commonly used instruction codes from malicious codes and app codes, it is possible to reduce the computation time for feature acquisition using a feature extraction algorithm in the future.

An example of a method of extracting an instruction code pair is as follows. If instruction codes A, B, C, A, and D are sequentially present in a code segment, and C is a commonly used instruction code and is the code to be removed, the final extracted instruction code pair and feature is (A, B), (A , D) becomes At this time, (A, B) and (B, A) are processed as the same instruction code pair.

Although the number of instruction codes extracted as a feature is not necessarily limited to two, as the number of instruction codes extracted as a feature increases, it is greatly affected by code obfuscation. It is necessary to limit the number of codes to obtain performance.

By using a plurality of feature extraction algorithms, features are extracted from the instruction code pair extracted in step S104, and among these features, features commonly extracted by each algorithm are obtained as input data of the malicious code detection model. (S106). Conventionally, a method of using all the features extracted by each feature extraction algorithm is used to detect malicious codes, but in this case, there is a problem in that the amount of computation is excessively increased and the computation speed is lowered.

As the feature extraction algorithm, there are a Pearson correlation algorithm, a mutual information algorithm, a Kendall correlation algorithm, a Spearmen correlation algorithm, a Chi squared algorithm, a Fischer scoring algorithm, and the like. In step S106, two or more of these known algorithms may be used to obtain common features. Preferably, each algorithm extracts features in parallel.

Such feature acquisition may be performed by extracting a preset number of features having high importance from each algorithm, and selecting only common features among the extracted features.

Based on the features acquired in step S106 as input data of the malicious code detection model, the malicious code detection of the app is performed from the pre-learned malicious code detection model (S108). Steps S100 to S108 are preferably performed before the device, such as a terminal capable of executing the app, executes the app, for example, at the same time the app is downloaded to the device.

Alternatively, if the app of step S100 is an app included in the dataset, and the presence of malicious code, the location of malicious code, etc. are labeled in order to learn a malicious code detection model, step S108 is step S106 Based on the features acquired from , the malware detection model is machine-learned. This malicious code detection model receives features extracted by pre-processing the app's code and classifies the app. Any machine learning algorithm used for learning may be employed as the machine learning algorithm in the present disclosure as long as it is an algorithm that can be easily employed by a person skilled in the art based on classification accuracy and classification speed.

2 is a flowchart illustrating a method for learning a malicious code detection model according to an embodiment of the present disclosure.

For the apk files of all apps in the data set, the dex files of the apk files, or the Optimized Ahead of Time (OAT) files of the dex files, the Java bytecodes of the files are converted into native codes (S200). Such conversion may be performed, for example, using the dex2aot tool, which is a tool that converts dex files into Java bytecodes.

If it is determined that the native code exists in the file before conversion (S210), the shared object file of the app is extracted and merged with the converted native code in order to use the native code together for feature extraction (S212).

After step S210 and/or step S212, the native code of the app is disassembling (disassembling) (S220). In this way, you can get the assembly code of the app.

A pair of consecutive instruction codes is extracted from the code segment of the assembly code (S230).

Using the instruction code pair as an input value, all or part of the Pearson correlation algorithm, the mutual information algorithm, the Kendall correlation algorithm, the Spearmen correlation algorithm, the Chi squared algorithm, and the Fischer scored algorithm are executed to extract the features to train the malicious code detection model ( S240). This algorithm is an example of a feature extraction algorithm, and does not necessarily extract features limited to the algorithm described in step S240, and any algorithm that a person skilled in the art can easily employ for feature extraction is used in the present disclosure. It can be employed as an extraction algorithm.

Training data is extracted based on the extracted features to learn a decision tree-based malicious code detection model (S250). Such learning is either supervised learning in which the extracted feature learns based on a labeled data set whether or not malicious code is included, or unsupervised learning by a data set that does not include malicious code. can be performed.

The malicious code detection model is verified by extracting test data based on the extracted features (S260), and performance is evaluated (S270).

Although it is described that each process is sequentially executed in FIGS. 1 and 2 , this is merely illustrative of the technical idea of an embodiment of the present disclosure. In other words, one of ordinary skill in the art to which an embodiment of the present disclosure pertains may change the order described in FIGS. 1 and 2 or perform one of each process without departing from the essential characteristics of an embodiment of the present disclosure. Since the above process may be variously modified and modified by executing the above process in parallel, it is not limited to the time series sequence of FIGS. 1 and 2 .

3A (a) is a diagram illustrating a data set used to evaluate the malicious code detection performance of the present disclosure. In this performance evaluation, the data set that is not code-obfuscated (Un-obfuscated in Fig. 3a (a)), the code-obfuscated data set (Obfuscated in Fig. 3a (a)), and the app's code consist only of native code. Set (Native in Fig. 3a (a)), the app code is a data set (Mixed in Fig. 3a (a)) in which native code, code obfuscated code, and code not obfuscated are mixed. Each data set consists of apps containing malicious code and apps without malicious code.

In order to prepare for the performance of the malicious code detection model of the present disclosure, existing malicious code detection tools such as Adagio, MUDFLOW, and Droid-Native algorithm-based malicious code detection tools were used together. Here, the Adagio algorithm is an algorithm that does not support code-obfuscated malware detection and native code-configured malware detection, DroidNative is an algorithm that supports only native code-based malware detection, and the DroidSieve algorithm is code-obfuscated. It is an algorithm that supports malware detection and only partially supports malware detection composed of native code.

Figure 3a (b) is the result of learning and verification based on the non-code obfuscation data set, Figure 3a (c) is the result of learning and verifying based on the code obfuscation data set, Figure 3b (d) shows the results of learning and verification based on a data set consisting only of native code, (e) of FIG. 3b is based on a data set in which native code, code obfuscation code, and code not obfuscated are mixed. Shows the results of learning and verification.

In FIGS. 3A and 3B (b) to (e), the malicious code detection method of the present disclosure is shown as “Proposed method (all segments w/ feature selection)”, and the feature selection process in the present disclosure (S106 in FIG. 1 ) step, step S240 of Figure 2) was not applied ("Proposed method (all segments w / o feature selection)" in Figures 3a and 3b (b) to (e)) was also evaluated.

Referring to (b) and (c) of FIG. 3A , it can be seen that the detection rate is always excellent when the malicious code detection method of the present disclosure is used compared to when other methods are used. Although, in the performance evaluation based on the obfuscated data set, the average run time per sample is about 0.4 times longer than when the malicious code detection method without the feature selection process is applied. It is a natural phenomenon as we carry out more stages of selection. On the other hand, when the malicious code detection method of the present disclosure is used, it can be confirmed that the execution time is significantly reduced compared to the case where the Adagio, DroidNative, and DroidSieve algorithms are applied.

Referring to (d) of FIG. 3B, when performance is evaluated based on the code-obfuscated data set, it can be seen that the detection rate is the best when the malicious code detection method of the present disclosure is applied, Adagio, DroidNative, It can be seen that the execution time is significantly reduced compared to the case of applying the DroidSieve algorithm.

Referring to (e) of FIG. 3B , it can be seen that when the performance evaluation based on the mixed data set is applied, the detection rate of 98.3% is excellent when the malicious code detection method of the present disclosure is applied. This detection rate is 1% lower than the DroidSieve algorithm, which shows the best performance in detection rate, but considering that the average execution time per sample of the malicious code detection method of the present disclosure is only about 13% compared to the execution time of the DroidSieve algorithm, It has a significant reduction in computation time while maintaining an excellent detection rate compared to existing algorithms.

Various implementations of the devices, units, processes, steps, etc., described herein include digital electronic circuits, integrated circuits, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or a combination thereof. These various implementations may include being implemented in one or more computer programs executable on a programmable system. The programmable system includes at least one programmable processor (which may be a special purpose processor) coupled to receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device. or may be a general-purpose processor). Computer programs (also known as programs, software, software applications or code) contain instructions for a programmable processor and are stored on a "computer-readable recording medium".

The computer-readable recording medium includes all types of recording devices in which data readable by a computer system is stored. These computer-readable recording media are non-volatile or non-transitory, such as ROM, CD-ROM, magnetic tape, floppy disk, memory card, hard disk, magneto-optical disk, and storage device. It may further include a medium or a transitory medium such as a data transmission medium. In addition, the computer-readable recording medium may be distributed in a network-connected computer system, and the computer-readable code may be stored and executed in a distributed manner.

Various implementations of the systems and techniques described herein may be implemented by a programmable computer. Here, the computer includes a programmable processor, a data storage system (including volatile memory, non-volatile memory, or other types of storage systems or combinations thereof), and at least one communication interface. For example, the programmable computer may be one of a server, a network appliance, a set-top box, an embedded device, a computer expansion module, a personal computer, a laptop, a Personal Data Assistant (PDA), a cloud computing system, or a mobile device.

The above description is merely illustrative of the technical idea of this embodiment, and a person skilled in the art to which this embodiment belongs may make various modifications and variations without departing from the essential characteristics of the present embodiment. Accordingly, the present embodiments are intended to explain rather than limit the technical spirit of the present embodiment, and the scope of the technical spirit of the present embodiment is not limited by these embodiments. The protection scope of this embodiment should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present embodiment.

Claims

In the method of detecting malicious code from an app using a pre-learned malicious code detection model,

converting java bitecode included in the app into native code;

A process of extracting a pair of consecutive instructions code, which is a pair of consecutive instructions code, based on a code segment extracted from all native codes of the app;

obtaining a feature commonly extracted by each feature extraction algorithm by using two or more feature extraction algorithms from the instruction code pair; and

A process of obtaining a result of performing malicious code detection by using the acquired features as input data of the malicious code detection model

A malicious code detection method comprising a.
According to claim 1,

When the native code is originally included in the app, the process of merging the converted native code and the native code included in the shared library of the app after the conversion to the native code

Malicious code detection method, characterized in that it further comprises.
According to claim 1,

The code segment is a code segment of assembly code obtained by disassembling all native codes of the app.
According to claim 1,

The instruction code pair is an instruction code commonly used for an app and a malicious code, and does not include a preset instruction code.
According to claim 1,

The process of acquiring the feature is

A malicious code detection method, characterized in that the extracted features are extracted from each of the feature extraction algorithms by a predetermined number based on importance, and common features among the extracted features are obtained.
According to claim 1,

At least one algorithm of the two or more feature extraction algorithms is a Pearson correlation algorithm, a Mutual information algorithm, a Kendall correlation algorithm, a Spearmen correlation algorithm, a Chi squared algorithm, and a Fischer scoring algorithm.
A computer program stored in a computer-readable recording medium to execute each process of the malicious code detection method according to any one of claims 1 to 6.
A method for learning a malware detection model for detecting malware of an app, the method comprising:

A process of converting Java bytecode (java bitecode) included in each app of the data set into native code (native code);

a process of extracting a pair of consecutive instructions code, which is a pair of consecutive instruction codes, from a code segment extracted from all native codes of the app;

obtaining a feature commonly extracted by each feature extraction algorithm by using two or more feature extraction algorithms from the instruction code pair; and

The process of machine learning the malicious code detection model based on the acquired features

A method for learning a malicious code detection model, comprising:
9. The method of claim 8,

If there is an app that originally contains native code among the apps in the data set, after the process of converting to the native code, the converted native code and the shared library of the app that contains the native code ( The process of merging the native code included in the shared library)

Malicious code detection model learning method, characterized in that it further comprises.
9. The method of claim 8,

The process of acquiring the feature is

A malicious code detection method, characterized in that the extracted features are extracted from each of the feature extraction algorithms by a predetermined number based on importance, and common features among the extracted features are obtained.