CN105550583A

CN105550583A - Random forest classification method based detection method for malicious application in Android platform

Info

Publication number: CN105550583A
Application number: CN201510969901.6A
Authority: CN
Inventors: 桂盛霖; 杨漫游; 王沐; 李多航
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2015-12-22
Filing date: 2015-12-22
Publication date: 2016-05-04
Anticipated expiration: 2035-12-22
Also published as: CN105550583B

Abstract

The invention discloses a random forest classification method based detection method for a malicious application in an Android platform. The method comprises the following steps: obtaining APP samples, wherein the APP samples include malicious and good APP samples; obtaining all applicable permission lists and API information of APPs to obtain a permission set and an API set; extracting static features of the APP samples, wherein the static features include applied permissions and called APIs; based on the static features of the APP samples, the permission set and the API set, constructing a sample library, wherein the sample library comprises table items of APP identifiers, type identifiers for distinguishing the malicious and good APP samples, application identifiers for the permissions in the permission set, and calling identifiers for the APIs in the API set; based on the sample library, constructing decision trees of a random forest to obtain a random forest classifier; and based on the random forest classifier, detecting to-be-detected APPs. By implementing the detection method, the malicious APPs can be efficiently detected and the security of the Android platform can be improved.

Description

Based on the Android platform malicious application detection method of random forest classification method

Technical field

The present invention relates to mobile terminal software safe technical field, particularly relates to a kind of method sorting algorithm in machine learning field being applied to Android malicious application and detecting.

Background technology

In recent years along with intelligent terminal, the especially development of smart mobile phone, the life of people became more and more convenient.Present smart mobile phone even can complete many functions that just can must complete on PC in the past, and this has attracted the use of people more.But more and more huger smart phone user colony also result in the attention of many malicious application developers.Along with the development of smart phone user colony, the quantity of malicious application is also in continuous growth.Malicious application starts to become one of mobile phone safe and privacy of user and threatens greatly.Under these circumstances, find and a kind ofly the method for batch detection malicious application exactly can just seem very necessary.

Be in the patented claim of CN104123500A at publication number, describe a kind of Android platform malicious application detection scheme based on degree of depth study, carry out feature extraction this programme is by applying original installation file and running during to Android, then detected by degree of depth study Modling model.Because needs operationally detect, therefore its efficiency comparison detected is low, poor effect.

Summary of the invention

Goal of the invention of the present invention is: for above-mentioned Problems existing, provide a kind of detection method of Android malicious application, by using random forest classification method, achieving the differentiation that malicious application and good will are applied under Android platform, having ensured the interests of user.

Android platform malicious application detection method based on random forest classification method of the present invention, comprises the following steps:

Obtain Android and apply (hereinafter referred to as APP) sample, comprise good will application sample and malice sample;

Obtain APP all to apply for authority, allly call API, obtain authority set and API collection;

Extract the static nature of each APP sample, comprise the authority that each application sample is applied for, the API called;

Build Sample Storehouse based on the static nature of each APP sample, authority set and API collection, the list item that described Sample Storehouse comprises has: APP identifier, distinguish good will and malice type identifier, to the application identifier of authority each in authority set, the call identifier to each API that API concentrates;

According to Sample Storehouse, build every decision tree of random forest, obtain random forest sorter:

Sample based on Sample Storehouse, obtain the training dataset of different group, using the APP included by one group of training dataset as the APP under the root node of decision tree, division process is carried out to each node of decision tree, obtains a decision tree:

Based on the m under present node, (wherein m is preset value to Stochastic choice, for calculating best divisional mode, its value is less than the sum of the static nature included by Sample Storehouse) individual static nature, and calculate the information gain corresponding to each static nature respectively; Get the Split Attribute of the maximum static nature of information gain as present node, based on Split Attribute, each node is divided, the APP being about to the static nature had corresponding to Split Attribute assigns to a leaf node, the APP without the static nature corresponding to Split Attribute assigns to another leaf node, until the number of APP under present node be 1 or Split Attribute be finished; Classification belonging to each leaf node depends on the type (good will or malice) of the APP under it, if comprise two class APP simultaneously, then depends on and comprises the maximum type of APP number;

Whether extract the static nature of APP to be detected, classify based on institute's established model to APP to be detected, detecting APP to be detected is malicious application.

Owing to have employed technique scheme, the invention has the beneficial effects as follows: the present invention is by extracting the static nature of APP: authority characteristic sum API Calls feature, in conjunction with the random forest classification method in machine learning field, achieve the efficient detection to malice APP, improve the security of Android platform.

Accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention;

Fig. 2 is in embodiment, preserves the exemplary plot of the file layout of characteristic information;

Fig. 3 is in embodiment, the schematic diagram of split vertexes;

Fig. 4 is in embodiment, the schematic diagram of a decision tree in random forest.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail.

In order to realize batch quantity analysis exactly go out APP be malice or this difficult problem of good will, the invention provides a kind of Android platform malicious application detection method based on random forest classification method.See Fig. 1, this method mainly comprises following five steps:

S1: obtain APP sample, comprises the APP sample of malice and good will;

S2: obtain all of APP and apply for permissions list, API information, obtain authority set and API collection;

S3: the static nature extracting APP sample, the API comprise applied for authority, calling;

S4: build Sample Storehouse based on the static nature of each APP sample, authority set and API collection, the list item that described Sample Storehouse comprises has: APP identifier, distinguish good will and malice type identifier, to the application identifier of authority each in authority set, the call identifier to each API that API concentrates;

S5: based on Sample Storehouse, uses random forest sorting algorithm Modling model, namely builds every decision tree of random forest based on Sample Storehouse, obtain random forest sorter;

S6: the random forest sorter built based on S5 detects APP to be detected.

Respectively the embodiment of each step is described below, to understand the present invention better:

S1: obtain APP sample: obtain APP sample installation file from various channel, is divided into good will sample and malice sample, and preserves by the acquisition APP sample got.

S2: obtain all of APP and apply for permissions list, API information: on the Android developer website (http://developer.android.com/reference/android/Manifest.permiss ion.html) of Google, provide the entitlement limit information that Android APP can apply for, thered is provided all authorities that can apply for are saved in this locality, form a list, as complete or collected works, i.e. authority set.

Obtain all API information lists of calling of APP, step S302 operation is performed to all samples, obtains all API information that all samples call, remove duplicate contents.Preferably, after can also filtering out and differentiates with good will malice the API substantially had nothing to do based on priori, more all API information stayed are saved in this locality, form a list, as complete or collected works, i.e. API collection.

S3: the static nature extracting APP sample:

S301: the authority information extracting each APP sample:

The authority information of this step mainly by using androguard (https: //github.com/androguard/androguard) this storehouse of increasing income to obtain Android application, concrete steps are as follows:

(1) androguard storehouse is imported.

(2) using APK (AndroidPackage) the installation file path of APP as Parameter transfer to the APK class in androguard storehouse, call the get_permissions method of this class, just can obtain this application all the entitlement limit information applied for.

(3) authority set accessed by step S2, filters obtained authority information, obtains this and applies the Android permissions list used.

S302: the API Calls information extracting each APP sample:

This step is mainly by carrying out decompiling to the APK installation file of APP, and then use matching regular expressions API Calls information to carry out, concrete steps are as follows:

(1) use the APK installation file of unzip instrument to application to carry out decompress(ion), obtain " classes.dex " file in application file.

(2) dex2jar instrument is used to be jar file " classes.jar " by " classes.dex " file translations.

(3) use unzip instrument to carry out decompress(ion) to " classes.jar " file, obtain corresponding class file.

(4) javap instrument is used to carry out decompiling to class file

(5) use regular expression to mate the file after decompiling, extract API Calls information.

(6) according to the API collection that step S2 provides, obtained API information is filtered, obtain the AndroidAPI list of current APP sample.

Below be only give a kind of processing mode example extracting the static nature of APP sample, the present invention is not limited thereto, and those skilled in the art can also adopt other ways customary to obtain the static nature of each APP sample.

S4: build Sample Storehouse based on the static nature of each APP sample, authority set and API collection:

The static nature of extracted each APP sample is preserved by following form:

(1) authority information got is processed, preserve into the file that suffix is called .csv.The often row of file represents an APP.Often first of row be classified as the type identifier distinguishing good will and malice, if often first of row be classified as 0, then represent that the APP of one's own profession is good will; If often first of row be classified as 1, then represent that the APP of one's own profession is good will; Each row ensuing, represent an authority of android system, the value of row is 0, and represent do not have this authority, namely do not apply for this authority, the value of row is 1, and represent that APP employs this authority, namely application is to this authority.Separate with comma with before each numeral of a line.

(2) the API Calls message file got is processed, preserve into the file that suffix is called .csv.The often row of file represents an APP.Often first of row be classified as the type identifier distinguishing good will and malice, if often first of row be classified as 0, then represent that the APP of one's own profession is good will; If often first of row be classified as 1, then represent that the APP of one's own profession is good will; Each row ensuing, represent an API, the value of row is 0, and represent and do not call this API, the value of row is 1, represents and have invoked this API.Separate with comma between the numeral of colleague.When noting preserving, the order of APP should be identical with the preservation order of authority information.

(3) above-mentioned two matrixes are combined into one.As mentioned before, ensure that every a line corresponding same APP during generator matrix, so the first row of one of them matrix can be removed, then the matrix of the row alignment direct splicing Cheng Xin of matrix.The file content finally obtained as shown in Figure 2.Wherein, the first type being classified as APP, 0 represents good will, and 1 represents malice.Second and third is classified as API Calls information, corresponding A PI1 and API2, and the 4th is classified as APP authority information, corresponding perm1.

Namely based on extracted each APP sample static nature, obtain authority set and API collection, built by above-mentioned (1) ~ (3) and be used for the Sample Storehouse of random forest classification method modeling, the list item that this Sample Storehouse comprises has: APP identifier, distinguish good will and malice type identifier, to the application identifier of authority each in authority set, the call identifier to each API that API concentrates.

S5: based on Sample Storehouse, uses random forest sorting algorithm Modling model:

The method of building every decision tree in random forest method is: from N (representing the number of training sample) individual training cases, be made with the sampling of putting back to, sample N time, forms one group of training dataset D (i.e. bootstrap sampling).This training dataset D is used to train tree to be constructed.

Presetting a number m, for determining when making a decision on one node, can use how many variablees, wherein m should be less than M (total number of representation feature, in this application, corresponding static nature total number of the present invention).

For each node, Stochastic choice m based on the variable on this aspect.According to this m variable, calculate the divisional mode of its best, until node meets division cut-off condition, thus obtain every decision tree

Information gain: feature A is to the information gain g (D of training dataset D, A), be defined as the empirical entropy H (D) of set D and the difference of the empirical condition entropy H (D|A) of D under feature A specified criteria, that is: g (D, A)=H (D)-H (D|A).

Information entropy: the uncertainty representing stochastic variable.If X is the stochastic variable of a limited value, its probability distribution is: P (X=x _i)=p _i, i=1,2 ..., n, wherein n represents total number of stochastic variable X.Then the entropy of stochastic variable X is defined as: when at the bottom of logarithm being 2, the unit of entropy is bit.

Conditional entropy: be provided with stochastic variable (X, Y), its joint probability distribution is: P (X=x _i, Y=y _i)=p _ij, i=1,2 ..., n, j=1,2 ..., n, wherein n represents total number of stochastic variable (X, Y).Conditional entropy H (Y|X) represents the uncertainty of stochastic variable Y under the condition of known stochastic variable X, the conditional entropy H (Y|X) of stochastic variable Y under the condition that stochastic variable X is given, is defined as the entropy of the conditional probability distribution of Y under X specified criteria to the mathematical expectation of X: wherein p _i=P (X=x _i), i=1,2 ..., n.

When the probability in information entropy and conditional entropy is obtained by data estimation (particularly Maximum-likelihood estimation), corresponding entropy and conditional entropy are called empirical entropy and empirical condition entropy.

Based on this, the present invention uses random forest sorting algorithm Modling model, and the process namely building every decision tree of random forest based on Sample Storehouse is specially:

Represent the number of training sample with N, M represents the total number of static nature.Be that example is described with N=15, M=3; Arrange and be used for decision when making a decision on one node, the variable number m that can use, based on the Sample Storehouse shown in Fig. 2, in present embodiment, the value of m is 3.

From N number of training cases, be made with the sampling of putting back to, sample N time, form one group of training dataset D.Using the APP corresponding to one group of training dataset D as the APP under the root node of decision tree.Stochastic choice based on the static nature of the m under present node, and calculates the information gain corresponding to each static nature respectively; Get the Split Attribute of the maximum static nature of information gain as present node, based on Split Attribute, each node is divided, the APP being about to the static nature had corresponding to Split Attribute assigns to a leaf node, the APP without the static nature corresponding to Split Attribute assigns to another leaf node, see Fig. 3, until the number of APP under present node be 1 or Split Attribute be finished.A fission process is wherein described below:

Calculate empirical entropy:

H (D) = - \frac{9}{15} l o g \frac{9}{15} - \frac{6}{15} l o g \frac{6}{15} = 0.971;

Computing information gain:

For this feature of API1, when API1 exists (being 1), 5 samples are all malice.When API1 does not exist (being 0), in 10 samples, there are 4 malice.

So the conditional entropy of corresponding A PI1:

H (D | A_{a p i 1}) = \frac{5}{15} l o g \frac{5}{5} + \frac{10}{15} (- \frac{4}{10} l o g \frac{4}{10} - \frac{6}{10} l o g \frac{6}{10}) = 0.647.

Therefore the information gain of corresponding A PI1 can be obtained: g (D, A _api1)=H (D)-H (D|A _api1)=0.324.

For this feature of API2, when API2 exists (being 1), 6 samples are all malice.When API1 does not exist (being 0), in 9 samples, there are 3 malice.

So the conditional entropy of corresponding A PI2:

H (D | A_{a p i 2}) = \frac{6}{15} l o g \frac{6}{6} + \frac{9}{15} (- \frac{3}{9} l o g \frac{3}{9} - \frac{6}{9} l o g \frac{6}{9}) = 0.551.

Therefore the information gain of corresponding A PI2 can be obtained: g (D, A _api2)=H (D)-H (D|A _api2)=0.420.

For authority 1 (perm1) this feature, when authority 1 exists (being 1), in 4 samples, 1 is malice.When authority 1 does not exist (being 0), in 11 samples, there are 8 malice.

So the conditional entropy H (D|A of corresponding perm1 _perm1) be:

H (D | A_{p e r m 1}) = \frac{4}{15} (- \frac{1}{4} l o g \frac{1}{4} - \frac{3}{4} l o g \frac{3}{4}) + \frac{11}{15} (- \frac{3}{11} l o g \frac{3}{11} - \frac{8}{11} l o g \frac{8}{11}) = 0.836;

Therefore the information gain that can obtain corresponding perm1 is: g (D, A _perm1)=H (D)-H (D|A _perm1)=0.135.

Therefore select the static nature that information gain is maximum, namely API2 is the attribute of split vertexes.Fission process as shown in Figure 3.

From root node, based on being above-mentioned divisional mode, division process is carried out to each node, until the number of APP under present node be 1 or Split Attribute be finished.Finally obtain a complete decision tree.The classification belonging to each leaf node of decision tree depends on the type (good will or malice) of the APP under it, if comprise two class APP simultaneously, then depends on and comprises the maximum type of APP number.Every tree all can complete growth and can not beta pruning, and namely we manually can not remove the structure of intervening tree.

The structure of repetition decision tree is formed, and obtains many decision trees, forms a random forest.

In order to improve monitoring result further, by above-mentioned steps, by information gain as when selecting the standard dividing decision tree nodes attribute, with predetermined interval 2, the accuracy rate of classifying when measuring different decision tree quantity, selects the higher scheme of accuracy rate as final mask.In this embodiment, predetermined interval is taken as 20, and the variable quantity of the quantity of the decision tree included by random forest built is namely 20.

S6: the random forest sorter built based on S5 detects APP to be detected.

To APP to be detected, extract the static nature of APP to be detected based on step S3, i.e. authority information and API Calls information.For the ease of classification process, to the static nature extracted, based on the storage mode shown in Fig. 2, (static nature extracted by each APP to be detected saves as a line, the often corresponding static nature of row, the sequence consensus of each static nature in the sequencing of each static nature and Fig. 2, consistent namely with Sample Storehouse.) static nature information is saved as the file that suffix is called .csv.In the csv of APP to be detected, there is not the type identifier distinguishing good will and malice, first row is directly the static nature of corresponding order.Based on the random forest sorter that step S5 builds, APP to be detected to be classified, detect that it is malice or good will with this.

For certain decision tree in random forest, its shape as shown in Figure 4, namely assigned in the middle of its each leaf node by this decision tree by involved training sample.Due in the csv file of the application, often occur in row that 0 represents that this static nature does not exist, 1 represents to exist, and therefore can be directly used in single decision tree based on the csv file of APP to be detected and carry out classification judgement to APP.Now be exemplified below:

Suppose that APP to be detected have invoked API2, so according to decision tree as shown in Figure 4, APP to be detected falls to the left side of root node.Because left side is leaf node, and in APP under this leaf node, malice occupies the majority (have 10 and be all malice), and therefore the classification of current APP to be detected is maliciously.

Suppose APP never call API2 to be detected, the node 2 that app so to be detected will fall on the right side of root node.Right side node is not leaf node, and according to diagram, we need to judge whether API1 calls.If API1 is called, then APP to be detected falls to figure interior joint 3.According to node 3, if authority 1 is not called, then the leaf node that falls to the right of APP to be detected.As figure, in the APP under the lobus dexter child node of node 3, non-malicious occupies the majority (only have 2 and be all non-malicious), and so current APP to be detected is then non-malicious.

Allow every decision tree in random forest all judge APP to be detected, finally just can provide the classification of final APP to be detected according to the mode of their judged results.Such as, if there are 50 decision trees in random forest, and 30 decision trees all judge that APP to be measured is as malice, so finally judge that current APP to be detected is as malice.

The above, be only the specific embodiment of the present invention, arbitrary feature disclosed in this specification, unless specifically stated otherwise, all can be replaced by other equivalences or the alternative features with similar object; Step in disclosed all features or all methods or process, except mutually exclusive feature and/or step, all can be combined in any way.

Claims

1., based on the Android platform malicious application detection method of random forest classification method, it is characterized in that, comprise the following steps:

Obtain Android application sample, comprise good will application sample and malice sample;

What obtain Android application allly applies for authority, allly calls API, obtains authority set and API collection;

Extract the static nature of each Android application sample, comprise the authority that each application sample is applied for, the API called;

Build Sample Storehouse based on the static nature of each Android application sample, authority set and API collection, the list item that described Sample Storehouse comprises has: Android application identities symbol, distinguish good will and malice type identifier, to the application identifier of authority each in authority set, the call identifier to each API that API concentrates;

Sample based on Sample Storehouse, obtain the training dataset of different group, using the Android application included by one group of training dataset as the Android application under the root node of decision tree, division process is carried out to each node of decision tree, obtains a decision tree:

Stochastic choice based on the static nature of the m under present node, and calculates the information gain corresponding to each static nature respectively, and wherein m is preset value; Get the Split Attribute of the maximum static nature of information gain as present node, based on Split Attribute, each node is divided, a leaf node is assigned in the Android application being about to the static nature had corresponding to Split Attribute, do not have the static nature corresponding to Split Attribute Android application assign to another leaf node, until under present node Android application number be 1 or Split Attribute be finished; Classification belonging to each leaf node depends on the type of the Android application under it, if comprise two class Android application simultaneously, then depends on and comprises the maximum type of Android application number;

Extract the static nature that Android to be detected applies, based on random forest sorter, Android application to be detected is classified, detect whether Android to be detected application is malicious application.

2. the method for claim 1, is characterized in that, the type identifier distinguishing good will and malice is 0 and 1, and wherein 0 represents good will, and 1 represents malice; Be 0 and 1 to the application identifier of authority each in authority set, wherein 0 represents and does not apply for this authority, and 1 represents that application is to this authority; Be 0 and 1 to the call identifier of each API that API concentrates, wherein 0 represents and does not call, and 1 represents and calls.

3. the method as shown in claim 1 or 2, is characterized in that, builds multiple random forest, and the highest random forest of selection sort accuracy rate is as random forest sorter.