CN109271788B

CN109271788B - Android malicious software detection method based on deep learning

Info

Publication number: CN109271788B
Application number: CN201810963774.2A
Authority: CN
Inventors: 罗森林; 张寒青; 潘丽敏
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-08-23
Filing date: 2018-08-23
Publication date: 2021-10-12
Anticipated expiration: 2038-08-23
Also published as: CN109271788A

Abstract

The invention relates to an Android malicious software detection method based on deep learning, and belongs to the technical field of computers and information science. According to the method, the Android application software is subjected to feature extraction, and then the Android application files are subjected to operations such as decompression, decompiling and the like to extract related safety features. The extracted features include 3 aspects: the file structure characteristics, the safety experience characteristics and the N-gram statistical characteristics formed by the Dalvik instruction set. And then, carrying out numerical processing on the extracted features to construct feature vectors. And finally, constructing a DNN (deep Neural network) model based on the extracted relevant features. And classifying and identifying the new Android software through the constructed model. The method integrates the analysis of the instruction set, has the function of resisting the confusion of the malicious software, can enhance the feature learning by the detection of the malicious software based on the depth model, can well express the rich intrinsic information of big data, and is more easily suitable for the continuously evolved malicious software.

Description

Android malicious software detection method based on deep learning

Technical Field

The invention relates to an Android malicious software detection method based on deep learning, and belongs to the technical field of computers and information science.

Background

With the continuous development of mobile internet, intelligent terminals become an important part in the life of each person. Android, as the most widely used mobile operating system, causes a flood of malware due to its open and flexible ecological environment. How to effectively detect the Android malware is a research topic with important value. Currently, mainstream Android malicious code detection methods are roughly classified into static detection methods and dynamic detection methods.

1. Dynamic detection method

The dynamic detection and analysis is a method of extracting features and performing detection and analysis after running a program to be detected. The dynamic detection method mainly comprises the steps of operating an Android application file on an Android device, and then analyzing software by collecting data such as an API (application programming interface) calling sequence, resource use and the like in the software operation process. Although dynamic analysis has the advantage of being not influenced by limiting factors such as code shell adding and confusion, the method has the problems of difficult data acquisition and extraction, high software running cost, low code coverage rate, easiness in reverse detection of malicious software through detection of a running environment and the like in actual use. Therefore, detecting malware by a method of dynamic analysis is less used in practice.

2. Static state detection method

The static detection method mainly comprises the steps of scanning and analyzing the Android application file, and extracting security-related sensitive information and features, such as sensitive permission, system action, sensitive system calling and the like, in the Android file. These refined features are then analyzed and generalized and judged if they are malware. Compared with a dynamic analysis method, the static analysis method has higher code coverage rate and smaller time overhead, and can generally achieve better detection accuracy rate. The method is also a mainstream detection method for various virus checking and killing software at present. However, in an actual environment, an Android application developer often performs operations such as obfuscation and encryption to protect a code, and in such an environment, it is not easy to extract effective features through static analysis so as to make a misjudgment on the effective features. Meanwhile, the malware evolves and develops rapidly every year, and the conventional detection method is difficult to adapt to new malware which is continuously emerging.

In view of the above problems, the present subject matter proposes a method for classifying malware based on deep learning. On one hand, static characteristics common to some malicious software are extracted by analyzing the Android application file. On the other hand, Smalli source codes are extracted by performing decompiling on the Android application file, Dalvik operation codes are extracted from the Smalli source codes, and then the instruction set of the Smalli source codes is abstracted and the N-gram sequence characteristics are extracted. Finally, after the extracted features are normalized, abstract modeling is carried out through a deep learning algorithm to finish the identification of the malicious software. The detection system based on the instruction set analysis has the function of resisting the malicious software confusion. The feature learning can be enhanced by the malicious software detection based on the depth model, rich internal information of big data can be well expressed, and the method is more easily suitable for the continuously evolved malicious software.

Disclosure of Invention

The invention aims to solve the problems that a conventional Android malicious software detection method is low in detection accuracy, limited in detection application range and difficult to adapt to emerging software, and provides a malicious software detection method based on deep learning.

The design principle of the invention is as follows: firstly, extracting the characteristics of the Android application software. And extracting related safety features by carrying out operations such as decompression, decompiling and the like on the Android application file. The extracted features include 3 aspects: the file structure characteristics, the safety experience characteristics and the N-gram statistical characteristics formed by the Dalvik instruction set. And then, carrying out numerical processing on the extracted features to construct feature vectors. And finally, constructing a DNN (deep Neural network) model based on the extracted relevant features. And carrying out software classification and identification on the new Android through the constructed model.

The technical scheme of the invention is realized by the following steps:

step 1, obtaining an Android positive and negative sample file, and then preprocessing the file.

Step 1.1, 2452 malicious Android software libraries are obtained from http:// amd. argussab. org/bheviors, and 21000 normal software libraries are obtained from the Android market.

And step 1.2, decompressing each application software, and extracting files such as Android manifest.

Step 1.3, performing decompiling operation on the class.

And 2, extracting the characteristics of the Android application file.

And 2.1, extracting the characteristics of the file obtained in the step 1, wherein the extracted characteristics comprise file structure characteristics, safety experience characteristics and N-gram characteristics abstracted by a Dalvik instruction set.

And 2.2, digitizing the extracted features, and obtaining a feature vector of each application after normalization expression.

And 3, constructing a neural network classifier and identifying and detecting software.

And 3.1, dividing the test set and the data set according to the characteristic vectors of the database and training a neural network.

And 3.2, preprocessing and feature extraction are carried out on the new software, then classification is carried out on the basis of the constructed neural network classifier, and the software classification result is given to finish the detection of the software.

Advantageous effects

Compared with the traditional static analysis method, the method extracts more abundant software features including file structure features, experience features and N-gram statistical features based on the Dalvik instruction set. The characteristics can more comprehensively represent the characteristics of the Android software so as to achieve higher detection accuracy.

Compared with the traditional machine learning classification algorithm, the deep learning can enhance the feature learning, well express the abundant internal information of big data, and is more easily adapted to the continuously evolved malicious software.

Drawings

FIG. 1 is a schematic diagram of an Android malware detection method based on deep learning.

Detailed Description

In order to better illustrate the objects and advantages of the present invention, embodiments of the method of the present invention are described in further detail below with reference to examples.

The specific process is as follows:

step 1, obtaining an Android positive and negative sample file, and then preprocessing the file

And step 1.2, extracting files such as Android manual. xml files, res files, classes.dex files and the like of the Android application software for subsequent analysis from each application software through an Andguard tool.

Step 1.3, performing decompiling operation on the class.dex file by an Andguard tool, and then extracting Dalvik operation codes of each Smalli file.

And 2, extracting the characteristics of the Android application file.

And 2.1, performing feature extraction on the file obtained in the step 1. The first type is structured characteristics, including 63 dimensions such as the sensitive authority applied by the APK, the system action contained by the application, the activity contained by the application, the service, Broadcast Receive, and Content Provider number. The second is an empirical characteristic which mainly comprises an empirical summary characteristic of long-term malicious APK detection analysis, including whether the resource file contains an executable file, whether the assets folder contains an APK file, the number of image files contained in the resource file in the APK file, the number of functions with parameters larger than 20 and the like. Malicious code is hidden in the executable file with a high probability under the APK which contains the additional executable file in the general installation file. The malicious software has a smaller number of image files and has more parameters in order to avoid detecting the function with malicious tendency. The empirical characteristics amount to 4 dimensions. The third is the N-gram features abstracted by the Dalvik instruction set. According to analysis of the malicious software, codes for realizing malicious intentions of the malicious software are concentrated in a malicious file, a single file is used as a unit when N-gram characteristics are counted, and then the counted N-gram characteristics are weighted to serve as final characteristic vectors. The Dalvik instructions are first classified into 10 classes according to their functional characteristics, and the specific classification is shown in Table 1. Then, the symbol sequence after Dalvik instruction abstraction is counted for each Smalla file in the APK file. The sequence is then N-gram processed, where 3-gram is selected. For example, an APK file has n Smalla files, and each file has 1000-dimensional statistical characteristics and is marked as F_nThe concrete form is shown as formula (1), wherein f_nkRepresenting the k characteristic statistic value of the nth file.

F_n＝[f_n0,f_n1,f_n2……f_n999] (1)

Then weighting the N1000-dimensional features and then performing normalization expression to obtain a new 1000-dimensional feature F as a final Dalvik bytecode N-gram statistical feature, wherein the specific form is shown in formula (2).

F＝[k₀，k₁……k_m……k₉₉₉] (2)

In the formula, k_mAnd (4) representing the mth characteristic value in the new statistical characteristic.

TABLE 1 instruction symbol meaning definition

And 2.2, digitizing the features extracted in the step 2.1, and obtaining a feature vector of each application after normalization expression. Assuming a database D, the database can be represented by the following matrix.

The database D has n samples, the attribute of each sample has p dimensions, and each sample has a target value Y. Here, the target value is 0 or 1, 1 is represented as a positive sample, and 0 is represented as a negative sample. The characteristic dimension of each sample in the method is 1067, and n is 45120.

And 3.1, dividing a test set and a training set according to the characteristic vectors of the database and training a neural network.

And (3) testing results: in the experiment, 45120 positive and negative samples (part of files are damaged and cannot be extracted) are selected in total, wherein 23511 malicious samples are selected, and 21609 normal software samples are selected. 5-fold cross validation was then performed on the entire dataset based on the constructed neural network model. The average accuracy, average recall, and average F-number of the final positive and negative samples are shown in table 2. The experimental data in the table show that the method has higher detection accuracy and can achieve better detection effect.

TABLE 2 test results

The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A deep learning-based Android malicious software detection method is characterized by comprising the following steps:

step 1, obtaining an Android positive and negative sample file, and then preprocessing the file, wherein the method comprises the following steps: selecting positive and negative Android application software, and decompressing the APK file to obtain all files in the APK file; performing decompiling processing on the class.dex file, and extracting Dalvik operation codes in each smalli file in each APK;

step 2, extracting the features of the file obtained in the step 1 to obtain each software feature vector, which comprises the following steps: processing the file obtained in the step 1 to obtain experience characteristics and structural characteristics of the APK file and N-gram statistical characteristics of a Dalvik instruction set, and performing digitization and normalization processing on the characteristics to obtain a characteristic vector of each piece of software, wherein the structural characteristics comprise 63 dimensions of sensitive authority applied by the APK, and the number of activity, service, Broadcast Receive and Content Provider contained in the application; the experience characteristics mainly comprise the characteristics of experience summary of long-term malicious APK detection analysis, including whether the resource file contains an executable file, whether the assets folder contains an APK file, the number of image files contained in the resource file in the APK file and the number of functions with parameters larger than 20 are 4-dimensional in total; the N-gram statistical characteristics of the Dalvik operation codes are 1000-dimensional, and considering that functional codes for realizing malicious intentions by malicious software are concentrated in a malicious file, a single Smalli file is taken as a unit when the N-gram characteristics are counted, and then the N-gram characteristics counted by each file are weighted and normalized to be used as a final characteristic vector;

step 3, constructing a classification model according to the data extracted in the step 2, evaluating the model on a data set by adopting a 5-fold cross validation method in the construction process, and finally constructing a neural network classifier and identifying and detecting software, wherein a fully-connected deep neural network is adopted when constructing a malicious software classification model, so that the model is suitable for processing the input of high-dimensional data; on the other hand, deep learning can enhance feature learning, 1067-dimensional features extracted by the APK can be correspondingly combined and transformed in the model learning process, and deep level relation among the features is automatically excavated so as to adapt to continuously evolved malicious software and realize higher malicious software detection accuracy.