Background technique
Android (Android) as current intelligent terminal system most popular in the world, it is open, free with platform the features such as
It is widely used in the world.Therefore, target of attack is targeted by Android and put down by many malicious code researchers
Platform.With technological progress, the cost of manufacture of Android rogue program is also lower and lower, leads to the quantity of Android malware
It is growing day by day.It is shown according to the data that 360 internet security centers are issued, the newly-increased malice of intercepting and capturing Android platform in 2017 is soft
It is 757.3 ten thousand, part sample, average 3.1 ten thousand newly-increased daily.Malware uses the new technologies and methods such as digging mine wooden horse, Botnet
It frequently launches a offensive, including steals userspersonal information, the indecent behaviors such as malice fee suction bring massive losses to user.It faces
How so extensive malicious attack, effectively realize the detection to Android malicious application, becomes current Android platform peace
Full matter of utmost importance.
Static detection and dynamic detection are broadly divided into the malice detection of Android application at present.Static detection refers to
Without running application software, and the reverse-engineerings means such as decompiling are used, its source program is analyzed, its feature is extracted, than
Such as signature, permission, directly analysis characteristic behavior.Stationary detection technique is mainly to using program-described file
(AndroidManifest.xml) and grammar file (smali) code file carries out feature extraction.Guo et al. passes through parsing
Information labels in AndroidManifest.xml and smali code file extract the class of application, permission, component, signature, each
The processed data of kind and starting information etc..Rashidi B et al. is by permission and application programming interface (Application
Programming Interface, API) function call is as characteristic set, using support vector machines (Support Vector
Machine, SVM) and K- neighbour (K-Nearest Neighbor, KNN) algorithm malicious application is detected, but exist many
Erroneous judgement.Machine learning can be achieved the detection of Android application software manual, improve the efficiency of analysis, but rely on and mention
The application feature taken.
The dynamic detection of Android malicious application refers in application software operational process, passes through injection, hook (HOOK) etc.
Technology obtains the feature of the application, but defect is that software is needed to run, and system resources consumption is excessive.In dynamic detection research side
Face, Mahindru et al. uses tracker (Strace) acquisition applications software action data, and sends it to Analysis Service end, benefit
With these behavior samples of classifier training, finally judge to apply whether contain malicious act using K- nearest neighbor algorithm.Singh L etc.
People uses API Hook technology, carries out Hook to sensitive API in Android platform, once system or application are to specific API
When calling, calling function can be intercepted and captured, proxy function is redirected it to obtain details, behavioural information can be obtained.
Summary of the invention
The technical problem to be solved by the present invention is to for the disadvantages mentioned above of the prior art, by calling machine learning to calculate
Method is learnt and is detected to Android application, and Android malicious application detection complexity is reduced, and saves system resources consumption,
On solving high dimensional feature and mechanized classification test problems, the Detection accuracy to Malware is further improved.
The technical solution that the present invention solves above-mentioned technical problem is to propose a kind of fusion frequent item set (Apriori) algorithm
With the Android malicious application detection method of random forests algorithm, comprising the following steps: it is anti-to carry out batch to Android application software
Compiling, the software permission that is applied and sensitive API function static nature;The frequent item set for excavating permission feature makees permission feature
Dimension-reduction treatment obtains the frequent 3- item collection of permission, to obtain the incidence relation in sample set between permission;Excavate malice sample
With the frequent 3- item collection of normal sample, it is calculated together as feature construction feature set using information gain with sensitive API function
Method is screened and is scored to the characteristic attribute in feature set, is extracted important feature, is constructed corresponding vector space;Using
Random forests algorithm carries out study and classification and Detection to vector space, carries out just to the vector space of normal sample and malice sample
Often or the attribute of malice marks.
The present invention further comprises using static analysis tools to carry out decompiling to application software before feature extraction, obtaining
To so file (lib), smali and AndroidManifest.xml comprising resource file (res), third party software development kit
File, include various resource files, source code and the other static code features of the application software in file.
The present invention further comprises extracting feature, parsing using programming language (python) script
The all permissions that application is extracted in the extended markup language files such as AndroidManifest.xml obtain permission feature, use
Method function in python -- os.walk () traverses all smali files, extracts each sample according to canonical matching process
Sensitive API function.
The present invention further comprises that the frequent 3- item collection for excavating permission feature specifically includes: respectively from malice sample and just
Permission, which is extracted, in normal sample constructs authority set;The 1- item collection of Mining Frequent authority set: the support of each permission in authority set is calculated
S is spent, beta pruning is carried out to the frequent 1- item collection for being unsatisfactory for minimum support min_s, obtains Candidate Set L1, then to L1In element into
Row connection;Using the Candidate Set after connection as new sample set, Mining Frequent 2- item collection: to being unsatisfactory for minimum support min_s
Frequent 2- item collection carry out beta pruning, form new Candidate Set L2, repeat, until obtaining frequent 3- item collection.
The present invention further comprises being specifically included using information gain (information gain, IG) algorithm, is calculated special
The entropy of sign and the difference of its conditional entropy obtain the IG value of this feature, and IG value shows that more greatly degree of correlation is bigger, according to related journey
Degree retains important feature, and important feature is matched with application software each in system, constructs corresponding vector respectively
Space.Building vector space specifically includes, the building feature vector (x different comprising application software1,x2,…,xn) feature set
X calls formula ν: s → { 0,1 }|X|, vector space ν is constructed according to the feature vector in set X, wherein s indicates some application
Software, per one-dimensional corresponding with feature a certain in X in ν, if s includes a certain feature, in vector space ν with this feature pair
The ident value answered is 1, is otherwise 0.
The present invention also proposes a kind of Android malicious application detection system for merging Apriori algorithm and random forests algorithm
System, comprising: characteristic extracting module, feature processing block and random forest sorting algorithm module, characteristic extracting module is to by criticizing
The Android application software for measuring decompiling carries out feature extraction, the software permission that is applied and sensitive API function static nature;
Feature processing block excavates the frequent item set of permission feature, makees dimension-reduction treatment to permission feature, obtains the frequent 3- item collection of permission,
To obtain the incidence relation in sample set between permission, excavate the frequent 3- item collection of malice sample and normal sample, by its with
Sensitive API function sieves the characteristic attribute in feature set together as feature construction feature set, using information gain algorithm
Choosing and scoring, extract important feature, construct corresponding vector space;Random forest sorting algorithm module to vector space into
Row study and classification and Detection carry out normal or malice attribute to the vector space of normal sample and malice sample and mark.
The present invention is extracted using static detection mode using data characteristics, and then uses Apriori algorithm to data characteristics
The frequent 3- item collection of permission in normal and Malware is excavated, then merges sensitive API and calls function, is created using random forest
Classifier learns and classifies to it.Further, it is obtained using IG algorithm by the entropy of calculating feature and the difference of its conditional entropy
The IG value of this feature is retained important feature and is constructed respectively corresponding using matching algorithm to application software each in system
Vector space.The present invention carries out higher-dimension permission feature to excavate its frequent 3- item collection, less on system resources consumption.
Specific embodiment
It elaborates below in conjunction with attached drawing to specific implementation process of the invention.
Fig. 1 show the present invention using detection system model schematic.In order to realize the inspection to Android system malicious application
It surveys, the present invention merges Apriori algorithm Mining Frequent 3- item collection and random forests algorithm is classified, and proposes a kind of Android malice
Using detection system, which includes characteristic extracting module, feature processing block and random forest sorting algorithm module.
Decompiling will be carried out in the sample set of the normal software being collected into and Malware in batches first, after decompiling
Application program describes the power that application program is extracted in file AndroidManifest.xml and grammar file smali file
It limits (Android permission) and sensitive applications programming interface api function calls, be then directed to permission feature mining
The frequent 3- item collection sequence of permission is found in the syntagmatic in normal sample and malice sample between permission, and combines API quick
Function is felt as learning characteristic, feature selecting is optimized to it using IG algorithm, and further, the important feature of reservation is embedded in
Feature vector forms vector space, and finally it is trained and is classified using random forests algorithm, to detect Android malice
Using.
It is illustrated below for each section.
(1) characteristic extracting module, using programming language Python script batch compilation sample set, after extracting decompiling
The feature of AndroidManifest.xml and smali file, the feature of extraction mainly include permission feature and sensitive API function.
For permission feature extraction, corresponding permission feature is extracted from some access right of application, is such as parsed
The all permissions of application are extracted in AndroidManifest.xml file.Due to when user using in system a certain function or
When accessing certain sensitive datas, it will apply for the power applied in access right, such as AndroidManifest.xml file
Limit ----android.permission.READ_PHONE_STATE indicates that telephone state permission is read in application;For sensitivity
Api function, one programming language (Java) class of each smali file representative, the various systems for containing application calling are answered
With interface function, use the method in python --- os.walk () function traverses all smali files, from this document with
Function (invoke) beginning is called, occurred api function is traversed according to string matching, extracts various kinds from all functions
This sensitive API function.By the sensitive API function for traversing each sample that all smali files extract, so that it may correspondingly obtain
Application software potentially malicious behavior.Byte code files due to smali as Android virtual machine (Dalvik), each smali
One java class of file representative contains the various system application interface functions of application calling;Since Malware generates
Malicious act must call corresponding api function.Therefore, using the sensitive API function of calling all in sample set as random
The learning characteristic of forest algorithm, it is trained after to detect malicious application.
Before carrying out feature extraction to each sample software, it is necessary to carry out decompiling to sample set.Decompiling can be used
File with .apk suffix is carried out decompiling by tool Apktool, includes resource file (res), third party sdk to obtain
The files such as so file (lib), smali and AndroidManifest.xml, this class file include various resource files, source code,
With other static natures.
Usual Malware can apply for some dangerous permission combinations, these groups before generating malicious act in terms of permission
Credit union mutually relies on and generates malicious act.Therefore, the dangerous class permission of Malware not only request slip one, and can apply endangering
Dangerous class permission combination, such as in malice sample, application permission combination is usually READ_SMS (short message reading), READ_
PHONE_STATE (reading mobile phone state), WRITE_SMS (editing short message) three, the executable privacy of user that reads are re-send to
Malicious operations such as elsewhere, and rarely have this permission to combine in normal software, according to permission combine in different dangerous permissions
Working in coordination, there are potentially malicious behaviors, therefore can determine whether it for Malware.
Apriori algorithm is the algorithm for the Mining Boolean Association Rules frequent item set that Agrawal et al. is proposed.Apriori
The frequent 3- item collection of algorithm excavation permission.Obtain a large amount of permission feature and sensitive API function.However, the power usually obtained
It is very big to limit characteristic dimension, computation complexity is high, therefore, using the frequent item set for excavating permission feature based on Apriori algorithm
Dimension-reduction treatment is carried out to permission characteristic dimension, to obtain the frequent 3- item collection of permission.The frequent 3- item collection of permission feature is excavated, with
The incidence relation in sample set between permission is obtained, its specific step is described as follows.
The frequent 3- item collection of permission is excavated based on Apriori algorithm, concretely, this is extracted from all samples using Shen
Normal software sample authority set P and malice sample authority set M please, wherein P={ p1,p2,…,pnRepresent normal software sample
Authority set, indicate whole applied n permissions of normal software sample, M={ m1,m2,…,mxRepresent the power of malice sample
Limit collection indicates applied x permission in whole malice samples.It is excavated respectively for the authority set of normal sample and malice sample
Frequent 3- item collection.Following method specifically can be used:
To the authority set Mining Frequent 1- item collection of sample permission: calculating the support S of each permission in sample authority set, table
Show the probability that the permission occurs in all sample sets, beta pruning carried out to the frequent 1- item collection for being unsatisfactory for minimum support min_s,
To obtain the set for meeting condition, and as Candidate Set L1, then to L1In element be attached;It then will be after connection
Candidate Set includes all 2- item collections, then the Mining Frequent 2- item collection from new sample set, to discontented as new sample set at this time
The frequent 2- item collection of sufficient minimum support min_s carries out beta pruning, forms new Candidate Set L2, according to above-mentioned steps, repeat,
Until obtaining the frequent 3- item collection of sample authority set.
Connection: in a certain frequent n- item collection set, before being found downwards since the first item (for example i-th) of the set
The nth elements of all elements in i and j are then connected into the (n+1)th item collection by n-1 same items (such as jth item).
From normal software sample authority set P, p is calculated separately1,p2,…,pnSupport of the frequency of appearance as the element
Spend S, minimum support be P in the minimum appearance of each element frequency and between 0 to 1, after Mining Frequent 1- item collection, according to
Minimum support carries out beta pruning and connection, finally obtains the frequent 3- item collection of normal sample.
From malice sample authority set M, m is calculated separately1,m2,…,mxSupport S of the frequency of appearance as the element,
After Mining Frequent 1- item collection, beta pruning and connection are carried out according to minimum support, finally obtain frequent 3 item collection of malice sample.
(2) characteristic processing
After the frequent 3- item collection for excavating malice sample and normal sample using Apriori algorithm, by itself and sensitive API letter
Number is screened and is scored to characteristic attribute using information gain IG algorithm together as feature.IG algorithm is by calculating feature
Comentropy and the difference of its conditional entropy obtain the IG value of this feature, which shows that more greatly degree of correlation is bigger.Entropy calculates: root
Probability P (the C occurred respectively according to normal software in sample set or Malwarei), according to formula:The comentropy H (C) of sample set is calculated.The calculating of conditional entropy: according to formula:Respectively ith feature conditional entropy H (Y | Xi).Therefore, according to formula IGi=H
(C)-H(Y|Xi) the IG value that calculates ith feature is, in order to screen that advantageous classification is normal in multiple features of comforming or Malware
Feature so that the uncertain reduction degree of feature is maximum, therefore the feature that IG value is 0 is rejected, and is not 0 by its residual value
Feature be retained as important feature.
Definition set X is the feature set that application software retains, and includes different feature (x in feature set1,x2,…,xn),
In, n is important characteristic.According to formula, ν: s → { 0,1 }|X|, according to the feature construction vector space ν in set X, s is enabled to indicate
Some application software, wherein per one-dimensional corresponding with feature a certain in X in ν.If s includes this feature, in vector space ν with
The corresponding ident value of this feature is 1, is otherwise 0, and whether ident value representative contains this feature.
It is empty that corresponding vector is constructed respectively to application software each in system using matching algorithm according to the method described above
Between ν, then, after Feature Selection, building one include n feature feature set, each sample of correspondence generate it is different to
Quantity space ν, and it is deposited into MySQL database, the input as random forest categorization module.
(3) random forests algorithm is classified
After obtaining feature vector, detection substantially becomes a kind of classification problem.Since the result of detection is normal and malice
Two classes, so detection substantially just belongs to two classification problems.And random forests algorithm is very suitable to solve two classification problems.It utilizes
The vector space ν of acquisition is realized using random forest sorting algorithm and is classified.
Following methods specifically can be used, Supervised classification: for known to being collected into normal and malice sample set it is each
Application software belongs to normal or Malware according to each application software, in each vector space corresponding with each application software
Behind, normal or malice attribute mark is carried out to each application software, as described in following formula.
Wherein V (S) indicates all application software set, and normal indicates that the application software belongs to normal software, malware
Indicate that the application software belongs to Malware.
After obtaining the vector space of training sample set, it is trained to obtain random forest grader.It will be to be measured soft
Part obtains vector space ν after feature extraction and characteristic processing, and ν at this time is free of normal or malware identifier, with sky
It is white or '? ' its value is replaced, then examined using vector space of the random forest grader of training sample to the software under testing
Classification is surveyed, is in the result normal software or Malware with normal the or malware string representation software under testing, by
This can realize the detection to Malware.
The present invention utilize inverse compiling technique, to application software sample collection carry out batch decompiling, in file permission and
Api function extracts.In face of higher-dimension permission feature, dimension-reduction treatment is carried out using Apriori algorithm, obtains the frequent 3- of permission
Item collection carries out Feature Selection by information gain, further obtains important feature in conjunction with sensitive API function.By important feature
Be mapped to vector space, indicated with 0 or 1, and normal use and malicious application are marked, finally obtain with it is markd to
Quantity space.Sample set is learnt and classified using random forests algorithm.