CN109753800A

CN109753800A - Android malicious application detection method and system integrating frequent itemsets and random forest algorithm

Info

Publication number: CN109753800A
Application number: CN201910002795.2A
Authority: CN
Inventors: 景小荣; 王丹
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Shenzhen Hongyue Enterprise Management Consulting Co ltd
Priority date: 2019-01-02
Filing date: 2019-01-02
Publication date: 2019-05-14
Anticipated expiration: 2039-01-02
Also published as: CN109753800B

Abstract

The invention discloses a kind of Android (Android) malice detection methods for merging frequent item set (Apriori) algorithm and random forests algorithm, are related to technical field of information processing.Decompiling is carried out to Android application sample, according to permission and function call static nature is extracted from each decompiling file, to obtain the incidence relation in sample set between permission；The frequent 3- item collection of malice sample and normal sample is excavated based on Apriori algorithm, and then sensitive applications programming interface (Application Programming Interface, API) function call is combined to generate feature；Study and classification to feature is realized using random forest grader, to realize that Android malicious application detects.It is detected using the malice that the present invention carries out Android application software, system resources consumption is low, and has very high Detection accuracy.

Description

Merge the Android malicious application detection method of frequent item set and random forests algorithm And system

Technical field

The present invention relates to network securitys, information security detection field, and in particular to a kind of Android malicious application detection side Method.

Background technique

Android (Android) as current intelligent terminal system most popular in the world, it is open, free with platform the features such as It is widely used in the world.Therefore, target of attack is targeted by Android and put down by many malicious code researchers Platform.With technological progress, the cost of manufacture of Android rogue program is also lower and lower, leads to the quantity of Android malware It is growing day by day.It is shown according to the data that 360 internet security centers are issued, the newly-increased malice of intercepting and capturing Android platform in 2017 is soft It is 757.3 ten thousand, part sample, average 3.1 ten thousand newly-increased daily.Malware uses the new technologies and methods such as digging mine wooden horse, Botnet It frequently launches a offensive, including steals userspersonal information, the indecent behaviors such as malice fee suction bring massive losses to user.It faces How so extensive malicious attack, effectively realize the detection to Android malicious application, becomes current Android platform peace Full matter of utmost importance.

Static detection and dynamic detection are broadly divided into the malice detection of Android application at present.Static detection refers to Without running application software, and the reverse-engineerings means such as decompiling are used, its source program is analyzed, its feature is extracted, than Such as signature, permission, directly analysis characteristic behavior.Stationary detection technique is mainly to using program-described file (AndroidManifest.xml) and grammar file (smali) code file carries out feature extraction.Guo et al. passes through parsing Information labels in AndroidManifest.xml and smali code file extract the class of application, permission, component, signature, each The processed data of kind and starting information etc..Rashidi B et al. is by permission and application programming interface (Application Programming Interface, API) function call is as characteristic set, using support vector machines (Support Vector Machine, SVM) and K- neighbour (K-Nearest Neighbor, KNN) algorithm malicious application is detected, but exist many Erroneous judgement.Machine learning can be achieved the detection of Android application software manual, improve the efficiency of analysis, but rely on and mention The application feature taken.

The dynamic detection of Android malicious application refers in application software operational process, passes through injection, hook (HOOK) etc. Technology obtains the feature of the application, but defect is that software is needed to run, and system resources consumption is excessive.In dynamic detection research side Face, Mahindru et al. uses tracker (Strace) acquisition applications software action data, and sends it to Analysis Service end, benefit With these behavior samples of classifier training, finally judge to apply whether contain malicious act using K- nearest neighbor algorithm.Singh L etc. People uses API Hook technology, carries out Hook to sensitive API in Android platform, once system or application are to specific API When calling, calling function can be intercepted and captured, proxy function is redirected it to obtain details, behavioural information can be obtained.

Summary of the invention

The technical problem to be solved by the present invention is to for the disadvantages mentioned above of the prior art, by calling machine learning to calculate Method is learnt and is detected to Android application, and Android malicious application detection complexity is reduced, and saves system resources consumption, On solving high dimensional feature and mechanized classification test problems, the Detection accuracy to Malware is further improved.

The technical solution that the present invention solves above-mentioned technical problem is to propose a kind of fusion frequent item set (Apriori) algorithm With the Android malicious application detection method of random forests algorithm, comprising the following steps: it is anti-to carry out batch to Android application software Compiling, the software permission that is applied and sensitive API function static nature；The frequent item set for excavating permission feature makees permission feature Dimension-reduction treatment obtains the frequent 3- item collection of permission, to obtain the incidence relation in sample set between permission；Excavate malice sample With the frequent 3- item collection of normal sample, it is calculated together as feature construction feature set using information gain with sensitive API function Method is screened and is scored to the characteristic attribute in feature set, is extracted important feature, is constructed corresponding vector space；Using Random forests algorithm carries out study and classification and Detection to vector space, carries out just to the vector space of normal sample and malice sample Often or the attribute of malice marks.

The present invention further comprises using static analysis tools to carry out decompiling to application software before feature extraction, obtaining To so file (lib), smali and AndroidManifest.xml comprising resource file (res), third party software development kit File, include various resource files, source code and the other static code features of the application software in file.

The present invention further comprises extracting feature, parsing using programming language (python) script The all permissions that application is extracted in the extended markup language files such as AndroidManifest.xml obtain permission feature, use Method function in python -- os.walk () traverses all smali files, extracts each sample according to canonical matching process Sensitive API function.

The present invention further comprises that the frequent 3- item collection for excavating permission feature specifically includes: respectively from malice sample and just Permission, which is extracted, in normal sample constructs authority set；The 1- item collection of Mining Frequent authority set: the support of each permission in authority set is calculated S is spent, beta pruning is carried out to the frequent 1- item collection for being unsatisfactory for minimum support min_s, obtains Candidate Set L₁, then to L₁In element into Row connection；Using the Candidate Set after connection as new sample set, Mining Frequent 2- item collection: to being unsatisfactory for minimum support min_s Frequent 2- item collection carry out beta pruning, form new Candidate Set L₂, repeat, until obtaining frequent 3- item collection.

The present invention further comprises being specifically included using information gain (information gain, IG) algorithm, is calculated special The entropy of sign and the difference of its conditional entropy obtain the IG value of this feature, and IG value shows that more greatly degree of correlation is bigger, according to related journey Degree retains important feature, and important feature is matched with application software each in system, constructs corresponding vector respectively Space.Building vector space specifically includes, the building feature vector (x different comprising application software₁,x₂,…,x_n) feature set X calls formula ν: s → { 0,1 }^|X|, vector space ν is constructed according to the feature vector in set X, wherein s indicates some application Software, per one-dimensional corresponding with feature a certain in X in ν, if s includes a certain feature, in vector space ν with this feature pair The ident value answered is 1, is otherwise 0.

The present invention also proposes a kind of Android malicious application detection system for merging Apriori algorithm and random forests algorithm System, comprising: characteristic extracting module, feature processing block and random forest sorting algorithm module, characteristic extracting module is to by criticizing The Android application software for measuring decompiling carries out feature extraction, the software permission that is applied and sensitive API function static nature； Feature processing block excavates the frequent item set of permission feature, makees dimension-reduction treatment to permission feature, obtains the frequent 3- item collection of permission, To obtain the incidence relation in sample set between permission, excavate the frequent 3- item collection of malice sample and normal sample, by its with Sensitive API function sieves the characteristic attribute in feature set together as feature construction feature set, using information gain algorithm Choosing and scoring, extract important feature, construct corresponding vector space；Random forest sorting algorithm module to vector space into Row study and classification and Detection carry out normal or malice attribute to the vector space of normal sample and malice sample and mark.

The present invention is extracted using static detection mode using data characteristics, and then uses Apriori algorithm to data characteristics The frequent 3- item collection of permission in normal and Malware is excavated, then merges sensitive API and calls function, is created using random forest Classifier learns and classifies to it.Further, it is obtained using IG algorithm by the entropy of calculating feature and the difference of its conditional entropy The IG value of this feature is retained important feature and is constructed respectively corresponding using matching algorithm to application software each in system Vector space.The present invention carries out higher-dimension permission feature to excavate its frequent 3- item collection, less on system resources consumption.

Detailed description of the invention

Fig. 1 is the Android malicious application detection model for merging Apriori algorithm and random forests algorithm.

Specific embodiment

It elaborates below in conjunction with attached drawing to specific implementation process of the invention.

Fig. 1 show the present invention using detection system model schematic.In order to realize the inspection to Android system malicious application It surveys, the present invention merges Apriori algorithm Mining Frequent 3- item collection and random forests algorithm is classified, and proposes a kind of Android malice Using detection system, which includes characteristic extracting module, feature processing block and random forest sorting algorithm module.

Decompiling will be carried out in the sample set of the normal software being collected into and Malware in batches first, after decompiling Application program describes the power that application program is extracted in file AndroidManifest.xml and grammar file smali file It limits (Android permission) and sensitive applications programming interface api function calls, be then directed to permission feature mining The frequent 3- item collection sequence of permission is found in the syntagmatic in normal sample and malice sample between permission, and combines API quick Function is felt as learning characteristic, feature selecting is optimized to it using IG algorithm, and further, the important feature of reservation is embedded in Feature vector forms vector space, and finally it is trained and is classified using random forests algorithm, to detect Android malice Using.

It is illustrated below for each section.

(1) characteristic extracting module, using programming language Python script batch compilation sample set, after extracting decompiling The feature of AndroidManifest.xml and smali file, the feature of extraction mainly include permission feature and sensitive API function. For permission feature extraction, corresponding permission feature is extracted from some access right of application, is such as parsed The all permissions of application are extracted in AndroidManifest.xml file.Due to when user using in system a certain function or When accessing certain sensitive datas, it will apply for the power applied in access right, such as AndroidManifest.xml file Limit ----android.permission.READ_PHONE_STATE indicates that telephone state permission is read in application；For sensitivity Api function, one programming language (Java) class of each smali file representative, the various systems for containing application calling are answered With interface function, use the method in python --- os.walk () function traverses all smali files, from this document with Function (invoke) beginning is called, occurred api function is traversed according to string matching, extracts various kinds from all functions This sensitive API function.By the sensitive API function for traversing each sample that all smali files extract, so that it may correspondingly obtain Application software potentially malicious behavior.Byte code files due to smali as Android virtual machine (Dalvik), each smali One java class of file representative contains the various system application interface functions of application calling；Since Malware generates Malicious act must call corresponding api function.Therefore, using the sensitive API function of calling all in sample set as random The learning characteristic of forest algorithm, it is trained after to detect malicious application.

Before carrying out feature extraction to each sample software, it is necessary to carry out decompiling to sample set.Decompiling can be used File with .apk suffix is carried out decompiling by tool Apktool, includes resource file (res), third party sdk to obtain The files such as so file (lib), smali and AndroidManifest.xml, this class file include various resource files, source code, With other static natures.

Usual Malware can apply for some dangerous permission combinations, these groups before generating malicious act in terms of permission Credit union mutually relies on and generates malicious act.Therefore, the dangerous class permission of Malware not only request slip one, and can apply endangering Dangerous class permission combination, such as in malice sample, application permission combination is usually READ_SMS (short message reading), READ_ PHONE_STATE (reading mobile phone state), WRITE_SMS (editing short message) three, the executable privacy of user that reads are re-send to Malicious operations such as elsewhere, and rarely have this permission to combine in normal software, according to permission combine in different dangerous permissions Working in coordination, there are potentially malicious behaviors, therefore can determine whether it for Malware.

Apriori algorithm is the algorithm for the Mining Boolean Association Rules frequent item set that Agrawal et al. is proposed.Apriori The frequent 3- item collection of algorithm excavation permission.Obtain a large amount of permission feature and sensitive API function.However, the power usually obtained It is very big to limit characteristic dimension, computation complexity is high, therefore, using the frequent item set for excavating permission feature based on Apriori algorithm Dimension-reduction treatment is carried out to permission characteristic dimension, to obtain the frequent 3- item collection of permission.The frequent 3- item collection of permission feature is excavated, with The incidence relation in sample set between permission is obtained, its specific step is described as follows.

The frequent 3- item collection of permission is excavated based on Apriori algorithm, concretely, this is extracted from all samples using Shen Normal software sample authority set P and malice sample authority set M please, wherein P={ p₁,p₂,…,p_nRepresent normal software sample Authority set, indicate whole applied n permissions of normal software sample, M={ m₁,m₂,…,m_xRepresent the power of malice sample Limit collection indicates applied x permission in whole malice samples.It is excavated respectively for the authority set of normal sample and malice sample Frequent 3- item collection.Following method specifically can be used:

To the authority set Mining Frequent 1- item collection of sample permission: calculating the support S of each permission in sample authority set, table Show the probability that the permission occurs in all sample sets, beta pruning carried out to the frequent 1- item collection for being unsatisfactory for minimum support min_s, To obtain the set for meeting condition, and as Candidate Set L₁, then to L₁In element be attached；It then will be after connection Candidate Set includes all 2- item collections, then the Mining Frequent 2- item collection from new sample set, to discontented as new sample set at this time The frequent 2- item collection of sufficient minimum support min_s carries out beta pruning, forms new Candidate Set L₂, according to above-mentioned steps, repeat, Until obtaining the frequent 3- item collection of sample authority set.

Connection: in a certain frequent n- item collection set, before being found downwards since the first item (for example i-th) of the set The nth elements of all elements in i and j are then connected into the (n+1)th item collection by n-1 same items (such as jth item).

From normal software sample authority set P, p is calculated separately₁,p₂,…,p_nSupport of the frequency of appearance as the element Spend S, minimum support be P in the minimum appearance of each element frequency and between 0 to 1, after Mining Frequent 1- item collection, according to Minimum support carries out beta pruning and connection, finally obtains the frequent 3- item collection of normal sample.

From malice sample authority set M, m is calculated separately₁,m₂,…,m_xSupport S of the frequency of appearance as the element, After Mining Frequent 1- item collection, beta pruning and connection are carried out according to minimum support, finally obtain frequent 3 item collection of malice sample.

(2) characteristic processing

After the frequent 3- item collection for excavating malice sample and normal sample using Apriori algorithm, by itself and sensitive API letter Number is screened and is scored to characteristic attribute using information gain IG algorithm together as feature.IG algorithm is by calculating feature Comentropy and the difference of its conditional entropy obtain the IG value of this feature, which shows that more greatly degree of correlation is bigger.Entropy calculates: root Probability P (the C occurred respectively according to normal software in sample set or Malware_i), according to formula:The comentropy H (C) of sample set is calculated.The calculating of conditional entropy: according to formula:Respectively ith feature conditional entropy H (Y | X_i).Therefore, according to formula IG_i=H (C)-H(Y|X_i) the IG value that calculates ith feature is, in order to screen that advantageous classification is normal in multiple features of comforming or Malware Feature so that the uncertain reduction degree of feature is maximum, therefore the feature that IG value is 0 is rejected, and is not 0 by its residual value Feature be retained as important feature.

Definition set X is the feature set that application software retains, and includes different feature (x in feature set₁,x₂,…,x_n), In, n is important characteristic.According to formula, ν: s → { 0,1 }^|X|, according to the feature construction vector space ν in set X, s is enabled to indicate Some application software, wherein per one-dimensional corresponding with feature a certain in X in ν.If s includes this feature, in vector space ν with The corresponding ident value of this feature is 1, is otherwise 0, and whether ident value representative contains this feature.

It is empty that corresponding vector is constructed respectively to application software each in system using matching algorithm according to the method described above Between ν, then, after Feature Selection, building one include n feature feature set, each sample of correspondence generate it is different to Quantity space ν, and it is deposited into MySQL database, the input as random forest categorization module.

(3) random forests algorithm is classified

After obtaining feature vector, detection substantially becomes a kind of classification problem.Since the result of detection is normal and malice Two classes, so detection substantially just belongs to two classification problems.And random forests algorithm is very suitable to solve two classification problems.It utilizes The vector space ν of acquisition is realized using random forest sorting algorithm and is classified.

Following methods specifically can be used, Supervised classification: for known to being collected into normal and malice sample set it is each Application software belongs to normal or Malware according to each application software, in each vector space corresponding with each application software Behind, normal or malice attribute mark is carried out to each application software, as described in following formula.

Wherein V (S) indicates all application software set, and normal indicates that the application software belongs to normal software, malware Indicate that the application software belongs to Malware.

After obtaining the vector space of training sample set, it is trained to obtain random forest grader.It will be to be measured soft Part obtains vector space ν after feature extraction and characteristic processing, and ν at this time is free of normal or malware identifier, with sky It is white or '? ' its value is replaced, then examined using vector space of the random forest grader of training sample to the software under testing Classification is surveyed, is in the result normal software or Malware with normal the or malware string representation software under testing, by This can realize the detection to Malware.

The present invention utilize inverse compiling technique, to application software sample collection carry out batch decompiling, in file permission and Api function extracts.In face of higher-dimension permission feature, dimension-reduction treatment is carried out using Apriori algorithm, obtains the frequent 3- of permission Item collection carries out Feature Selection by information gain, further obtains important feature in conjunction with sensitive API function.By important feature Be mapped to vector space, indicated with 0 or 1, and normal use and malicious application are marked, finally obtain with it is markd to Quantity space.Sample set is learnt and classified using random forests algorithm.

Claims

1. an Android malicious application detection method that fuses frequent itemset algorithm and random forest algorithm, is characterized in that, comprises the following steps: carry out batch decompiling to Android Android application software to obtain sample set, obtain application software authority and sensitive application program programming Interface API function static features; mining the frequent itemsets of permission features to reduce the dimension of permission features to obtain frequent 3-itemsets of permissions to obtain the correlation between permissions in the sample set; dig out the frequent 3-itemsets of malicious samples and normal samples 3-item set, the frequent 3-item sets of malicious samples and normal samples, together with sensitive API functions, are used as features to construct feature sets, and the information gain algorithm is used to filter and score the feature attributes in the feature set, extract important features, and construct The corresponding vector space; the random forest classifier is used to learn and classify the vector space, and the vector space of normal samples and malicious samples is labeled with normal or malicious attributes.

2. method according to claim 1, is characterized in that, uses static analysis tool to decompile application software before feature extraction, obtains so file lib, grammar file smali and application that comprise resource file res, third-party software development kit The description file AndroidManifest.xml contains various resource files, source codes, and other static code features of the application software.

3. method according to claim 1, is characterized in that, adopts programming language python script to extract characteristic, parses all the authority of extraction application in AndroidManifest.xml file to obtain authority characteristic, uses the method function in python---os.walk( ) traverse all smali files and extract sensitive API functions of all samples in the sample set according to the regular matching method.

4. The method according to claim 1, wherein mining the frequent 3-item sets of permission features specifically comprises: extracting permissions from malicious samples or normal samples respectively to construct permission sets; mining the 1-item sets of frequent permission sets: Calculate the support S of each permission in the permission set, prune the frequent 1-itemsets that do not meet the minimum support min_s, get the candidate set L ₁ , and then connect the elements in L ₁ ; connect the connected candidate set As a new 2-item set, mining frequent 2-itemsets: prune the frequent 2-itemsets that do not satisfy the minimum support min_s to form a new candidate set L ₂ , repeat until the frequent 3-itemsets are obtained .

5. method according to claim 1 is characterized in that, adopting information gain (InformationGain, IG) algorithm specifically comprises, according to the probability P (C _i ) that normal software or malicious software appear respectively in the sample set, according to formula: Calculate the information entropy H(C) of the sample set, according to the formula: Calculate the conditional entropy H(Y|X _i ) of the ith feature, and calculate the IG value of the ith feature according to the formula IG _i =H(C)-H(Y|X _i ). The larger the IG value, the more malicious the related process is. The larger the frequent 3-itemsets of the samples and the normal samples are, the important features are retained according to the degree of correlation, and the important features are matched with each application software in the system, and the corresponding vector spaces are constructed respectively.

6. method according to claim 5, is characterized in that, constructing vector space specifically comprises, removes the feature with IG value of 0, and retains the feature whose remaining value is not 0 as important feature, constructs and comprises different application software samples. The feature set X of eigenvectors (x ₁ , x ₂ ,…,x _n ), call the formula ν: s→{0,1} ^|X| , construct a vector space ν according to the eigenvectors in the set X, where s represents For a certain application software, each dimension in ν corresponds to a certain feature in X, if s contains this certain feature, the value of the identifier corresponding to this feature in the vector space ν is 1, otherwise it is 0.

7. An Android malicious application detection system integrating frequent itemset algorithm and random forest algorithm, comprising: a feature extraction module, a feature processing module and a random forest classification algorithm module, characterized in that, the feature extraction module performs batch decompiling on the Android The application software performs feature extraction to obtain application software permissions and static features of sensitive API functions; the feature processing module mines frequent itemsets of permissions features, performs dimension reduction processing on permissions features, and obtains frequent 3-itemsets of permissions to obtain sample set permissions The correlation between the malicious samples and the normal samples is mined, and the frequent 3-item sets of the malicious samples and the normal samples are mined, and the sensitive API functions are used as features to construct a feature set. The random forest classification algorithm module learns and classifies the vector space, and uses the random forest classifier to mark the normal or malicious attributes of the vector space of normal samples and malicious samples.

8. detection system according to claim 7, is characterized in that, uses static analysis tool to decompile the application software, obtains the file that comprises res, lib, smali and AndroidManifest.xml, comprises each of described application software in the file. resource files, source code, and other static code features.

9. The detection system according to claim 7, is characterized in that, adopts the programming language python script to extract features, parses all the permissions applied for in the AndroidManifest.xml file to obtain the permission features, and uses the os.walk() function to traverse all smali files , and extract the sensitive API functions of each sample according to the regular matching method.

10. The detection system according to claim 7, wherein mining frequent 3-item sets of permission features specifically comprises: extracting permissions from malicious samples or normal samples respectively to construct permission sets; mining 1-items of frequent permission sets Set: Calculate the support S of each permission in the permission set, prune the frequent 1-itemsets that do not meet the minimum support min_s, get the candidate set L ₁ , and then connect the elements in L ₁ ; The candidate set is used as a new sample set to mine frequent 2-itemsets: prune the frequent 2-itemsets that do not meet the minimum support min_s to form a new candidate set L ₂ , and repeat until the frequent 3-itemsets are obtained .

11. The detection system according to claim 7, characterized in that, adopting the IG algorithm specifically includes calculating the difference between the entropy value of the feature and its conditional entropy to obtain the IG value of the feature, and according to the sample set, normal software or malicious software appear respectively. The probability P(C _i ) of , according to the formula: Calculate the information entropy H(C) of the sample set, according to the formula: Calculate the conditional entropy H(Y|X _i ) of the ith feature, and calculate the IG value of the ith feature according to the formula IG _i =H(C)-H(Y|X _i ). The larger the IG value, the more malicious the related process is. The larger the frequent 3-itemsets of the samples and the normal samples are, the important features are retained according to the degree of correlation, and the important features are matched with each application software in the system, and the corresponding vector spaces are constructed respectively.

12. The detection system according to claim 11, wherein the larger the IG value is, the greater the degree of correlation is, the important features are retained according to the degree of correlation, the important features are matched with each application software in the system, and the The corresponding vector space, the construction of the vector space includes, remove the features whose IG value is 0, and keep the other features whose value is not 0 as important features, and construct different feature vectors (x ₁ , x ₂ , ...,x _n ) feature set X, call the formula ν: s→{0,1} ^|X| , construct a vector space ν according to the feature vectors in the set X, where s represents a certain application software, and each The dimension corresponds to a certain feature in X. If s contains this certain feature, the value of the identifier corresponding to this feature in the vector space ν is 1, otherwise it is 0.