CN109271788A

CN109271788A - A kind of Android malware detection method based on deep learning

Info

Publication number: CN109271788A
Application number: CN201810963774.2A
Authority: CN
Inventors: 罗森林; 张寒青; 潘丽敏
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-08-23
Filing date: 2018-08-23
Publication date: 2019-01-25
Anticipated expiration: 2038-08-23
Also published as: CN109271788B

Abstract

The Android malware detection method based on deep learning that the present invention relates to a kind of, belongs to computer and information science technical field.The present invention carries out feature extraction to Android application software first, then by unziping it to Android application file and relevant security feature is extracted in the operations such as decompiling.The feature of extraction includes 3 aspects: the N-gram statistical nature that file structure feature, safety experience feature and Dalvik instruction set are constituted.Then numeralization processing, construction feature vector are carried out to the feature of extraction.Finally the correlated characteristic based on said extracted constructs DNN (Deep Neural Network) model.New Android software is classified and identified by the model of building.This method has merged the analysis of instruction set, have the function of that fighting Malware obscures, simultaneously the malware detection based on depth model can Enhanced feature study, the abundant internal information of big data can be expressed well, be more easier the Malware for adapting to constantly evolve.

Description

A kind of Android malware detection method based on deep learning

Technical field

The Android malware detection method based on deep learning that the present invention relates to a kind of, belongs to computer and information Science and technology field.

Background technique

With the continuous development of mobile Internet, intelligent terminal becomes important component in everyone life. Android causes Malware general as most popular Mobile operating system, since it is opened with flexible ecological environment Excessively.It is the research topic with important value that Android malware, which how is effectively detected,.Current main-stream Android malicious code detecting method is roughly divided into static detection method and dynamic testing method.

1. dynamic testing method

So-called dynamic detection and analysis, refer to that extraction feature is detected and analyzed again after allowing detected program to behave Method.Dynamic testing method is mainly by operating in Android application file in Android device, then by adopting API Calls sequence, resource in collection software running process such as use to realize the analysis to software at the data.Although dynamic analysis have Not by code shell adding, the advantages of limiting factors are influenced such as obscure, but this method has data in actual use that acquisition is extracted Difficulty, software run cost is big, code coverage is low, is easy reverse-examination survey the etc. and detection to running environment by Malware Problem.Therefore, detected by the method for dynamic analysis Malware use in practice it is less.

2. static detection method

The method of static detection mainly by the way that Android application file is scanned and is analyzed, extracts Android In file and safety-related sensitive information and feature, such as sensitive permission, system acting, sensory system are called etc..Then needle These features of refinement are analyzed and concluded and judge whether it is Malware.It is compared with dynamic analysing method, it is static The method of analysis code coverage with higher and lesser time overhead, can be normally reached preferable Detection accuracy. This method is also the detection method of current various checking and killing virus software mainstreams.But in the actual environment, Android application is opened Originator the operation such as obscures and encrypts to be protected often to will do it to code, and static analysis is just not easy to mention under this environment Validity feature is got to judge by accident to it.Meanwhile Malware is every year all in evolution and development at full speed, conventional detection Method is difficult to adapt to the new Malware continued to bring out.

For above-mentioned problem, this project proposes a kind of Malware classification method based on deep learning.On the one hand, By analyzing Android application file, the common static nature of some Malwares has been extracted.On the other hand, lead to It crosses and Smalli source code is extracted to the progress decompiling of Android application file, then extract Dalvik from Smalli source code Then operation code is abstracted its instruction set and extracts N-gram sequence signature.Finally by the feature normalizing of said extracted The identification that abstract modeling completes Malware is carried out by deep learning algorithm after change processing.The detection system of set analysis based on instruction System has the function of that fighting Malware obscures.Malware detection based on depth model can Enhanced feature study, to big The abundant internal information of data is able to carry out good expression, is more easier the Malware for adapting to constantly evolve.

Summary of the invention

Present invention aim to address conventional Android malware detection method Detection accuracies, and low, detection is applicable in model The problem of being with limit and being difficult to adapt to emerging software, proposes a kind of malware detection method based on deep learning.

Design principle of the invention are as follows: feature extraction is carried out to Android application software first.Then by pair Android application file, which is unziped it, extracts relevant security feature with operations such as decompilings.The feature of extraction includes 3 sides Face: the N-gram statistical nature that file structure feature, safety experience feature and Dalvik instruction set are constituted.Then to the spy of extraction Sign carries out numeralization processing construction feature vector.Finally the correlated characteristic based on said extracted constructs DNN (Deep Neural Network) model.Software classification and identification are carried out to new Android by the model of building.

The technical scheme is that be achieved by the steps of:

Step 1, the positive and negative sample file of Android is obtained, then file is pre-processed.

Step 1.1, malice Android software library is obtained from http://amd.arguslab.org/behaviors to be total to 24552, normal software is obtained in the market from Android for 21000.

Step 1.2, operation is unziped it to each application software, extracts Android application software The files such as AndroidManifest.xml file, res file, classs.dex file are for subsequent analysis.

Step 1.3, decompiling operation is carried out to class.dex file by Andguard tool, then extracted Dalvik operation code.

Step 2, feature extraction is carried out to Android application file.

Step 2.1, the file obtained with regard to step 1 carries out feature extraction, and the feature of extraction includes file structure feature, safety N-gram feature after empirical features and Dalvik instruction set are abstract.

Step 2.2, then quantize to the feature extracted, normalization obtained after indicating the feature of each application to Amount.

Step 3, it constructs neural network classifier and recognition detection is carried out to software.

Step 3.1, according to the feature vector partition testing collection and data set of database and training neural network.

Step 3.2, pretreatment and feature extraction are carried out to new software, is then based on the neural network classifier of building Classify, provides detection of the software classification result completion to software.

Beneficial effect

Compared to traditional static analysis method, the present invention is extracted software features more abundant, including file structure Feature, empirical features and the N-gram statistical nature based on Dalvik instruction set.These features can be to Android software characteristic It is characterized more fully hereinafter, to reach higher Detection accuracy.

Compared to traditional machine learning classification algorithm, deep learning can Enhanced feature study, big data is enriched Internal information is expressed well, is more easier the Malware for adapting to constantly evolve.

Detailed description of the invention

Fig. 1 is the Android malware detection method schematic diagram of the invention based on deep learning.

Specific embodiment

Objects and advantages in order to better illustrate the present invention are done below with reference to embodiment of the example to the method for the present invention It is further described.

Detailed process are as follows:

Step 1, the positive and negative sample file of Android is obtained, then file is pre-processed

Step 1.2, Android application software is extracted by Andguard tool to each application software The files such as AndroidManifest.xml file, res file, classs.dex file are for subsequent analysis.

Step 1.3, decompiling operation is carried out to class.dex file by Andguard tool, then extracted each The Dalvik operation code of Smalli file.

Step 2, feature extraction is carried out to Android application file.

Step 2.1, the file obtained with regard to step 1 carries out feature extraction.The first kind is structured features, including APK application Sensitive permission, using comprising system acting, using comprising activity, service, Broadcast Receive, Content Provider quantity etc. 63 is tieed up.Second is empirical features, it mainly includes the warp that long-term malice APK is tested and analyzed Test the feature of summary, including whether include executable file in resource file, in assets file whether comprising APK file, The number etc. of function of the image file number and parameter for including in resource file in APK file greater than 20.General installation file In keep malicious code in executable file under the APK maximum probability comprising additional executable file.Malware has less Image file number and in order to hide detection have malice be inclined to power function possess more parameter.Empirical features are total 4 dimensions.Third is the N-gram feature after Dalvik instruction set is abstract.Malware is recognized according to the analysis to Malware Realize that the code of malicious intent can all concentrate in a malicious file, when counting N-gram feature as unit of single file, Then the N-gram feature of statistics is weighted, as final feature vector.Dalvik is instructed according to function spy first Property be divided into 10 classes, specific situation of classifying is as shown in table 1.Then each Smalli file statistics Dalvik in APK file is referred to Enable the symbol sebolic addressing after being abstracted.Then N-gram processing is done to the sequence, that choose here is 3-gram.A such as APK text Part has n Smalli file, each file can count 1000 dimension statistical natures and be denoted as F_n, concrete form such as formula (1) It is shown, wherein f_nkIndicate n-th of file, k-th of characteristic statistics quantitative value.

F_n=[f_n0,f_n1,f_n2……f_n999] (1)

Then then 1000 new dimensional feature F are can be obtained into as most after normalization indicates in n 1000 dimensional features weightings Whole Dalvik bytecode N-gram statistical nature, shown in specific form such as formula (2).

F=[k₀, k₁……k_m……k₉₉₉] (2)

In formula, k_mIt indicates in new statistical nature, m-th of characteristic value.

The definition of 1 command character meaning of table

Step 2.2, it quantizes to the feature that step 2.1 is extracted, normalization obtains the spy of each application after indicating Levy vector.Assuming that database D, then the database can be indicated with following matrix.

Wherein, database D shares n sample, and the attribute of each sample shares p dimension, and each sample has target value Y.This Place, target value value 0 or 1,1 are expressed as positive sample, and 0 is expressed as negative sample.The characteristic dimension of each sample is in the method 1067, n 45120.

Step 3.1, according to the feature vector partition testing collection and training set of database and training neural network.

Test result: amount in experiment and choose positive negative sample 45120 (partial document damage can not extract), wherein malice 23511, sample, normal software 21609.The neural network model for being then based on building carries out 5 foldings on entire data set Cross validation.Average Accuracy, average recall rate and the average F value for finally measuring positive negative sample are as shown in table 2.From table Experimental data can be seen that this method Detection accuracy with higher can reach preferable detection effect.

2 test experiments result of table

Above-described specific descriptions have carried out further specifically the purpose of invention, technical scheme and beneficial effects It is bright, it should be understood that the above is only a specific embodiment of the present invention, the protection model being not intended to limit the present invention It encloses, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should be included in the present invention Protection scope within.

Claims

1. a kind of Android malware detection method based on deep learning, it is characterised in that the method includes walking as follows It is rapid:

Step 1, the positive and negative sample file of Android is obtained, then file is pre-processed, comprising: choose positive and negative Android and answer With software, and processing is unziped it to APK file and obtains file all in file in APK；Then to class.dex file Decompiling processing is carried out, Dalvik operation code in every portion smalli file is extracted in each APK；

Step 2, the file obtained with regard to step 1 carries out feature extraction and obtains each software features vector, comprising: to step 1 The N-gram statistics that obtained file is handled to obtain the empirical features of APK file, structure feature and Dalvik instruction set is special Sign, and the feature vector of each software is obtained after features described above is carried out numeralization and normalized；

Step 3, disaggregated model is constructed according to the data that step 2 is extracted, is tested on data set using 5 foldings intersection in building process Card method assesses model, finally carries out recognition detection based on building neural network classifier and to software.

2. a kind of Android malware detection method based on deep learning according to claim 1, feature exist In: when extracting the feature of Android application in step 2, the feature of extraction includes structure feature, empirical features and Dalvik instruction The abstract N-gram feature of collection；Structured features, the sensitive permission including APK application, using comprising system acting, using packet Activity, service, Broadcast Receive, Content Provider quantity for containing etc. 63 is tieed up；Empirical features, it Mainly include the feature for the summary of experience that long-term malice APK is tested and analyzed, including whether include executable file in resource file, Whether it is greater than comprising the image file number for including in resource file in APK file, APK file and parameter in assets file Number of 20 function etc. 4 is tieed up；3-gram statistical nature 1000 after Dalvik instruction set is abstract is tieed up.

3. a kind of Android malware detection method based on deep learning according to claim 1, feature exist In: in view of Malware realizes that the function code of malicious intent can all concentrate on one when counting N-gram feature in step 2 In malicious file, therefore as unit of when counting N-gram feature by single Smalli file, then each file is counted N-gram feature be weighted with after normalized as final feature vector.

4. a kind of Android malware detection method based on deep learning according to claim 1, feature exist In: neural network model is used when constructing Malware disaggregated model in step 3, one side deep neural network model is suitble to locate Manage the input of high dimensional data；On the other hand, deep learning can Enhanced feature study, building model process in APK can be mentioned 1067 dimensional features that take carry out corresponding combined transformation, and profound connection between automatic mining feature, with adapt to constantly to evolve Malware realizes higher malware detection accuracy rate.