CN110245493A

CN110245493A - A method of the Android malware detection based on depth confidence network

Info

Publication number: CN110245493A
Application number: CN201910431019.4A
Authority: CN
Inventors: 芦天亮; 李国友; 杜彦辉; 欧阳立; 吴警; 张翼翔; 暴雨轩
Original assignee: State Cryptography Administration Commercial Code Testing Center; CHINESE PEOPLE'S PUBLIC SECURITY UNIVERSITY
Current assignee: State Cryptography Administration Commercial Code Testing Center; CHINESE PEOPLE'S PUBLIC SECURITY UNIVERSITY
Priority date: 2019-05-22
Filing date: 2019-05-22
Publication date: 2019-09-17

Abstract

A kind of method that the application proposes Android malware detection based on depth confidence network, firstly, extracting the permission of Android application software and the feature of sensitive API；Secondly, constructing deep learning model using depth confidence network DBN, the feature extracted is handled using deep learning model, obtains the sample of characterization higher level of abstraction feature；Then sorting algorithm is used, classifies to the higher level of abstraction feature exported by deep learning model, distinguishes Malware and normal software.Deep learning model through the invention based on depth confidence network can preferably characterize the higher level of abstraction feature of Android malware, and detection effect is also significantly better than traditional neural network model and machine learning model.

Description

A method of the Android malware detection based on depth confidence network

Technical field

The present invention relates to network safety filed more particularly to a kind of Android malwares based on depth confidence network The method of detection.

Background technique

Android operation system is a kind of operating system based on Linux, by Google company and open mobile phone alliance neck It leads and develops.Compared to the operating system on other intelligent terminals, there is complete open source property, and Android application market Complicated multiplicity, so that the quantity rapid growth of Android malware.Many Android malwares can induce user installation, And a large amount of new malicious applications are downloaded, mobile data traffic is consumed, short message of deducting fees is sent, causes serious security threat.Wherein, also Some normal Android application software obtain relevant information by applying for excessive improperly permission, realize that it collects user The purpose of privacy.It is become more and more important as it can be seen that carrying out detection to Android malware.

Currently, the detection technique of Android malware is broadly divided into static detection and dynamic detection.(1) static detection Refer in the case where not executing application software judge whether contain malicious code in application software.Android malware is quiet Statically detection Android application software is generally realized in state detection by dis-assembling.Enck et al. passes through dis-assembling Android Application software analyzes its source code to find code vulnerabilities.Yang et al. proposes AppContext static detection frame, AppContext classifies to application program according to the context of triggering security sensitive behavior.Static detection passes through soft to application Part carries out the methods of decompiling, and the static nature of rapidly extracting application software is simultaneously detected, the disadvantage is that the extension of detection pattern Property is poor.(2) dynamic detection refers to the overall monitor application behavior when Android application software executes.Dynamic detection technology passes through Application software is run under sandbox or true environment to obtain information to be detected.DroidScope can be protected Running environment under dynamic detection application software.Dini et al. proposes dynamic detection frame MADAM, can be in Android kernel Layer and client layer monitor application software.Dynamic detection accuracy rate is higher, the disadvantage is that occupying when operation, resource is more, and efficiency is lower.

For static detection and dynamic detection, it usually needs artificially generate and update Android malware inspection Then, this detect the emerging Malware in part can not effectively to gauge.In order to accurately identify unknown malware, machine Study starts to be applied to Android malware test problems.DroidAPIMiner analyzes API by machine learning algorithm The Android application feature of rank.Zhao et al. proposes the feature selecting algorithm based on characteristic frequency.In traditional engineering It practises in algorithm, support vector machines (Support Vector Machine, SVM) algorithm is usually used in based on feature selecting Android malware detection.Since traditional machine learning algorithm is usually all shallow-layer framework, it can not effectively pass through association Feature carries out high-level characterization to Android software.

As it can be seen that the main problem of Android malware detection is in the prior art:

(1) scalability of the Android malware detection pattern based on static detection is poor, and more and more Android malware is by beating again the modes such as packet around static detection.

(2) occupancy resource is more when the Android malware based on dynamic detection detects operation, and efficiency is lower, and It is difficult to detect by the Android application software never occurred.

(3) the Android malware detection based on conventional machines learning algorithm, machine learning structure is mostly shallow-layer knot Structure, can not carry out the character representation of higher level of abstraction to Malware, and detection effect is not fully up to expectations.

Therefore, the present invention attempts to carry out Android malware detection by deep learning model to solve the prior art Present in above-mentioned technical problem.

Summary of the invention

The present invention provides a kind of method of Android malware detection based on depth confidence network, existing to solve There are many defects existing for malware detection method in technology.

The present invention provides a kind of method of Android malware detection based on depth confidence network, it is characterised in that Described method includes following steps:

Extract the permission of Android application software and the feature of sensitive API；

Deep learning model is constructed using depth confidence network DBN, the feature extracted is used into the depth Learning model is handled, and the sample of characterization higher level of abstraction feature is obtained；

Using sorting algorithm, classify to the sample gone out by the deep learning model inspection, it is soft to distinguish malice Part and normal software.

The feature of permission and sensitive API that Android application software is extracted described in method proposed by the present invention is specific Refer to:

The application software installation file is decompressed, AndroidManifest.xml and classes.dex text is obtained Part obtains the permission of Android application software and the feature of sensitive API by the document analysis obtained to the decompression.

The building for constructing deep learning model using depth confidence network DBN in method of the invention includes by unsupervised The pre-training stage and be made of two stages of back-propagating stage of supervision.

Pre-training phase process described in method of the invention is as follows:

The feature vector x in training set for being N for sample size_n(0≤n < N)

1) n=0

2) by x_nIt is transmitted to visual layers V₀, hidden layer H is calculated according to formula (1)₀:

P(h_0j=1 | V₀)=σ (W_jV₀) --- formula (1)

In above-mentioned formula, P is probability-distribution function, is the core of trained weight in CD algorithm；

h_ijIndicate the value of j-th of hidden unit in i-th layer of hidden layer；

V_iIndicate i-th layer of RBM visual layers vector, H₁Indicate i-th layer of RBM hidden layer vector；

W_iIndicate the visual layers of i-th layer of RBM and the weight vector of hidden layer mapping relations；

σ calculation formula is as follows:

σ (x)=1/ (1+exp (- x))

3) visual layers are calculated according to formula (2) and obtains V₁:

In above formula, v_ijIndicate the value of j-th of visual element in i-th layer of visual layers, the transposition of superscript T representing matrix；

4) hidden layer H is calculated further according to formula (1)₁:

P(h_ij=1 | V₁)=σ (W_jV₁)

5) for all node j, weight is updated:

λ is related with convergence rate, is a constant, λ is bigger, and convergence rate is faster

If n=N-1, terminate, otherwise 2) n=n+1, goes to step.

Back-propagating described in method of the invention comprises the following processes:

1) parameter of random initializtion BP network, reads the weight matrix W of RBM network, and training pace is initialized as N；

2) each layer of forward calculation of unit-node value, to l layers of j unit-node, nodal value isThe nodal value is all related to l-1 layers of all cell nodes, if neuron j exists Output layer (l=L) enablesError e_j(n)=d_j(n)-o_j(n), d_jFor the result of label；

W_ijIt (n) is weight related with l j-th of unit-node of layer；

For the nodal value of l layers of j-th of unit-node, if it is output layer neuron, o_jIt (n) is output layer J unit-node nodal value, d_jIt (n) is the correct result of output layer label, e_j(n)=d_j(n)-o_jIt (n) is error；

3) δ is calculated, to transmitting, will successively finely tune weight downwards after δ；δ is function related with the error amount of nodal value, is used In fine tuning weight；

For output unit:

4) for hidden unit:

5) weight is finely tuned

Wherein η is learning rate, and learning rate is related with convergence rate, and learning rate is higher, is restrained faster；

If n=N, terminate；Otherwise 2) n=n+1 is gone to step.

Present invention combination Android application features are carried out using the deep learning model based on depth confidence network Signature analysis: comprehensive application features are obtained in conjunction with Android application software first；Then depth confidence network is utilized The higher level of abstraction feature for excavating software features, distinguishes normal software and evil based on higher level of abstraction feature finally by sorting algorithm Meaning software.The experimental results showed that the deep learning model based on depth confidence network can preferably characterize Android malice it is soft The feature of part, detection effect are also significantly better than traditional neural network model and machine learning model.

Detailed description of the invention

It, below will be to required in the embodiment of the present invention for the clearer technical solution for illustrating the embodiment of the present invention The attached drawing used is briefly described.

Fig. 1 is Android malware detection framework figure of the present invention.

Fig. 2 is the DNB network structure that the present invention uses.

Fig. 3 is the structure figures of deep learning model of the present invention.

Fig. 4 is the schematic diagram of sorting algorithm of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Whole description.

The Android evil based on depth confidence network (Deep Belief Network, DBN) that the invention proposes a kind of Anticipate software detection side, and Android malware detection framework of the invention is as shown in Figure 1, its step are as follows:

(1) firstly, being directed to Android application software, permission and sensitive API of Android application software etc. 179 is extracted A feature, 179 features correspond to the binary set of one 179 dimension, should if Android application software includes this feature Dimension is 1, and otherwise, which is original of the feature vector as deep learning model corresponding to 0, Android application software Begin to input；Secondly, deep learning model is constructed using DBN network, feature vector corresponding to Android application software is defeated Enter deep learning model, carries out the detection of Android application software.

(2) deep learning model

Since traditional machine learning algorithm is usually all shallow-layer framework, it can not effectively pass through linked character pair Android software carries out high-level characterization, and the present invention excavates higher level of abstraction feature using deep learning model, carries out The detection of Android application software.Feature vector corresponding to Android application software is inputted into deep learning model, depth The main function for practising model is to carry out weight expression to feature vector, is simply exactly the power for assessing each dimension of feature vector Weight, and without classification, the feature vector of input is subjected to high layering, subsequent sorting algorithm is responsible for applying Android Software is classified, so that classification is more acurrate, and detection effect is more preferable.

Deep learning model is divided into greedy algorithm initialization, contrast divergence algorithm is instructed in advance based on depth confidence network Experienced and back-propagating network finely tunes three parts, and three above part will describe in detail subsequent referring to attached drawing.

(3) sorting algorithm

Classification of the present invention using support vector machines (Support Vector Machine, SVM) algorithm as model is calculated Method, the feature vector that deep learning model is exported input svm classifier module, classify to Android application software, distinguish Malware and normal software out.

In the present invention, Android can be described more fully hereinafter using permission category feature and sensitive API category feature and apply, Using the deep structure of DBN e-learning feature, Android malware preferably can be characterized and detected.

Describe the basic framework of Android malware of the present invention detection above in conjunction with Fig. 1, below in conjunction with Fig. 2, Each step of Android malware detection method of the present invention is described in further detail in Fig. 3 and Fig. 4.

1. feature extraction

The feature of Android application software is broadly divided into static nature and behavioral characteristics, and static nature, which refers to, not to be executed In the case where application software, using modes such as decompilings, the feature of software to be analyzed is extracted, mainly includes authority information and calling Sensitive API information；Behavioral characteristics refer to the feature of the reflection application software behavior obtained when Android application software executes, It is slow compared to behavioral characteristics extraction rate, the disadvantages of resource is more are occupied, system resource needed for static nature is small, and speed is fast, fits Large-scale feature extraction is closed, therefore feature is extracted using the method for static analysis herein, and be based on static nature construction feature Collection.

Android application features in order to obtain decompress its installation file (.apk file), obtain two weights The file wanted, respectively AndroidManifest.xml and classes.dex file.AndroidManifest.xml file is System list file defines the information such as permission, the component of application software, solves to AndroidManifest.xml file Analysis, obtains the permission of Android application software application, for example, android.permission.camera is Android application Software application uses camera permission.By parsing AndroidManifest.xml file, Android application software has been obtained 120 permissions in total.Decompiling parsing is carried out to classes.dex file by baksmali tool, which API can be learnt Interface is called, for example, chmod is the sensitive API for changing user right.By parsing classes.dex file, obtain 59 sensitive APIs in total.The sensitive API information of the authority information extracted and calling is as shown in table 1.

The explanation of feature and feature that table 1 constructs

Referring to table 1,179 features such as permission and sensitive API of Android application software are extracted, 179 features are corresponding One 179 dimension binary set, if Android application software include this feature, the dimension be 1, otherwise, the dimension values For being originally inputted as deep learning model of feature vector corresponding to 0, Android application software.

2 depth confidence networks

In current deep learning theory, depth confidence network (Deep Belief Network, DBN) be using compared with For extensive a kind of deep learning frame.Depth confidence network is divided into two parts, structure as shown in Fig. 2, floor portions by multilayer Limited Boltzmann machine (Restricted Boltzmann Machine, RBM) element stack forms, and top section is to have supervision Back-propagating (Back Propagation, BP) network layer, for finely tuning overall architecture.The present invention by DBN network application in In Android malware detection, compared to traditional deep learning frame (Recognition with Recurrent Neural Network, convolutional neural networks etc.), It is an advantage of the present invention that for Android application software feature vector pace of learning faster, performance is more preferable, thus this Invention detects Android malware using the deep learning frame based on DBN.

As shown in Fig. 2, V indicates the nodal value vector of visual layers, H indicates the nodal value vector of hidden layer, in the RBM of stacking In, in addition to the bottom and top, the hidden layer in each layer of RBM is the visual layers of another RBM above.W is Weight matrix, for indicating the mapping relations between visual layers and hidden layer.

Deep learning model based on DBN of the invention includes three parts.Firstly, at the beginning of carrying out RBM using greedy algorithm Beginningization is initialized for the parameter to weight matrix W.Greedy algorithm is used for the initialization of RBM weight matrix, the purpose is to In order to increase the efficiency of subsequent contrast's divergence algorithm (Contrastive Divergence, CD), because of the weight of completely random Matrix parameter efficiency for CD algorithm is too low, and it is excessively high to calculate cost.Then, the Android first step obtained is using soft The APP sample of the namely non-label of feature vector corresponding to part inputs to initial characteristics the vector V0, bottom RBM of bottom RBM Initial characteristics vector be bottom RBM visual layers nodal value vector, by sdpecific dispersion (Contrastive Divergence, CD) algorithm each layer of RBM of training, in upward unsupervised conversion process, from be specifically not easy to classify Feature vector is converted into the abstract assemblage characteristic vector for being easy to classify, by adjusting the weight matrix Wi in own layer, so that The mapping of this layer of feature vector reaches local optimum.Finally, DBN network is finely tuned with having supervision by BP network, to make parameter Reach global optimum, and exports the feature vector for being easy to classify and enter categorization module.The groundwork of above-mentioned DBN network is to instruct The weight for practicing feature vector indicates, does not classify to Android application software, therefore there is still a need for pass through sorting algorithm pair Android application software is classified, and is made here using support vector machines (Support Vector Machine, SVM) algorithm For the sorting algorithm of model, structure is as shown in Figure 3.

As shown in Figure 2, the building of deep learning model by unsupervised pre-training stage and has the back-propagating rank of supervision Two stage compositions of section.In the pre-training stage, several RBM layer stacks form the basic framework of DBN network, and two layers adjacent Greedy algorithm is used between RBM, the parameter of weight matrix W is initialized, by sdpecific dispersion (Contrastive Divergence, CD) trained each layer of the RBM of algorithm, further trains the weight matrix parameter of RBM.In the back-propagating stage, BP module is finely adjusted DBN network with the sample of label in a manner of having supervision, finally, the feature of depth confidence network output Vector enters svm classifier module, and svm classifier module classifies to sample according to this feature vector, and structure is as shown in Figure 3.

The algorithm in the depth confidence network is described in detail below.

2.1 contrast divergence algorithm

Due to contrast divergence algorithm (Contrastive Divergence, CD) precision height, calculating speed is fast, using CD Algorithm is practised, for training the parameter of weight matrix W, so that the mapping of this layer of feature vector reaches local optimum.CD algorithm utilizes " otherness " of two probability distribution carrys out iteration and updates weight, is finally reached convergence.

RBM network self-training process based on CD algorithm is as follows:

The feature vector x in training set for being N for sample size_n(0≤n < N)

1) n=0

P(h_0j=1 | V₀)=σ (W_jV₀) --- formula (1)

h_ijIndicate the value of j-th of hidden unit in i-th layer of hidden layer；

σ calculation formula is as follows:

σ (x)=1/ (1+exp (- x))

3) visual layers are calculated according to formula (2) and obtains V₁:

4) hidden layer H is calculated further according to formula (1)₁:

P(h_1j=1 | V₁)=σ (W_jV₁)

5) for all node j, weight is updated:

If n=N-1, terminate, otherwise 2) n=n+1, goes to step.

2.2 back-propagating networks

As shown in figure 3, back-propagating (Back Propagation, BP) network is by there is the mode of learning of supervision, and The application software (being known to be malice or normal use software) of label carries out Comparative result, finely tunes entire DBN network.Using BP network training method, node value finding function choose Sigmod function.

BP network training process is as follows:

W_ijIt (n) is weight related with l j-th of unit-node of layer；

3) δ is calculated, to transmitting, will successively finely tune weight downwards after δ；δ is function related with the error amount of nodal value, is used In fine tuning weight；For final output unit:

4) for hidden unit:

5) weight is finely tuned

If n=N, terminate；Otherwise 2) n=n+1 is gone to step.

3. sorting algorithm

According to the high-level characteristic vector of depth confidence network output, sorting algorithm classifies to Android application software, Here support vector machines (Support Vector Machine, the SVM) sorting algorithm of algorithm as model is used.SVM algorithm Including two stages: training and test.Normal sample and malice sample, SVM in the given training stage find hyperplane, this is super Plane is specified by normal line vector ω and vertical range b, which will there are two classifications of maximum back gauge γ to separate, wherein Positive is normal sample, and Negative is malice sample, as shown in Figure 4.

In test phase, test set can be divided into two classes, the decision function f such as formula (3) of Linear SVM by SVM prediction model

X indicates that sample is determined as normal sample when f (x) > 0 by the feature vector of depth confidence network output, otherwise, will Sample is judged as malice sample.

4, The effect of invention

4.1 data set

It is downloaded altogether in Google Play Store and obtains 10000 application software as normal sample collection.Malice sample set Sample number 3938, consist of two parts, a part from Genome Project (http: // Www.malgenomeproject.org/) totally 1260, a part from VirusTotal (https: // Www.virustotal.com/) totally 2678, two parts amount to 3938 malice samples.700 are randomly selected from sample set Normal sample and 700 malice samples, are then thoroughly mixed, and as one group of data, choose 5 groups of data experiments in total.From 5 2 groups are chosen respectively as training set and test set in group data.The environment of experiment is as shown in table 2.

2 experimental situation of table

4.2 are compared with other conventional machines learning algorithms

By the testing result obtained the present invention is based on the deep learning model of DBN+SVM and traditional machine learning model into Row compares, and experimental result is as shown in table 3.In an experiment, using accuracy rate (Precision), recall rate (Recall) and correct Three indexs of rate (Accuracy) come evaluate to Android malware detection result.

3 different machines learning algorithm testing result of table

For most of traditional machine learning algorithms (Bayes、Logistic Regression、KNN、 SVM), a variety of common kernel functions such as sigmoid kernel, linear kernel are tested, and it is best to choose testing result Experimental result of the data as conventional machines learning algorithm.From table 3 it can be seen that under same test collection, DBN of the invention The accuracy ratio SVM of+SVM algorithm is higher by 3.35%, thanBayes is higher by 11.83%, is higher by 12.26% than KNN, than Logistic Regression is higher by 14.38%, it can be seen that, the present invention is based on the deep learning models of DBN to be substantially better than biography The neural network model and machine learning model of system.

Claims

1. a kind of method of the Android malware detection based on depth confidence network, it is characterised in that the method includes Following steps:

Deep learning model is constructed using depth confidence network DBN, the feature extracted is used into the deep learning Model is handled, and the sample with higher level of abstraction feature is obtained；

Using sorting algorithm, classifies to the sample characterized by the deep learning model, distinguish Malware And normal software.

2. according to the method described in claim 1, it is characterized in that the permission and sensitivity for extracting Android application software The feature of API specifically includes the following steps:

The installation file of the application software is decompressed, AndroidManifest.xml and classes.dex text is obtained Part, by obtaining the permission of Android application software and the feature of sensitive API to the document analysis.

3. according to the method described in claim 1, wherein constructed using depth confidence network DBN deep learning model include by It unsupervised pre-training stage and is made of two stages of back-propagating stage of supervision.

4. according to the method described in claim 3, it is characterized in that the pre-training phase process is as follows:

The feature vector x in training set for being N for sample size_n(0≤n < N)

1) n=0

P(h_0j=1 | V₀)=σ (W_jV₀) --- formula (1)

h_ijIndicate the value of j-th of hidden unit in i-th layer of hidden layer；

V_iIndicate i-th layer of RBM visual layers vector, H_iIndicate i-th layer of RBM hidden layer vector；

σ calculation formula is as follows:

σ (x)=1/ (1+exp (- x))

3) visual layers are calculated according to formula (2) and obtains V₁:

4) hidden layer H is calculated further according to formula (1)₁:

P(h_1j=1 | V₁)=σ (W_jV₁)

5) for all node j, weight is updated:

If n=N-1, terminate, otherwise 2) n=n+1, goes to step.

5. according to the method described in claim 3, it is characterized in that the back-propagating BP is comprised the following processes:

W_ijIt (n) is weight related with l j-th of unit-node of layer；

For the nodal value of l layers of j-th of unit-node, if it is output layer neuron, o_jIt (n) is the j of output layer The nodal value of unit-node, d_jIt (n) is the correct result of output layer label, e_j(n)=d_j(n)-o_jIt (n) is error；

3) δ is calculated, to transmitting, will successively finely tune weight downwards after δ；δ is function related with the error amount of nodal value, for micro- Adjust weight；

For output unit:

4) for hidden unit:

5) weight is finely tuned

Wherein η is learning rate, and learning rate is related with convergence rate, and learning rate is higher, is restrained faster；If n=N, knot Beam；Otherwise 2) n=n+1 is gone to step.