CN109753800B

CN109753800B - Android malicious application detection method and system fusing frequent item set and random forest algorithm

Info

Publication number: CN109753800B
Application number: CN201910002795.2A
Authority: CN
Inventors: 景小荣; 王丹
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2019-01-02
Filing date: 2019-01-02
Publication date: 2023-04-07
Anticipated expiration: 2039-01-02
Also published as: CN109753800A

Abstract

The invention discloses an Android (Android) malicious detection method fusing a frequent item set (Apriori) algorithm and a random forest algorithm, and relates to the technical field of information processing. Performing decompiling on the Android application sample, and obtaining an association relation among sample centralized authorities according to the authority extracted from each decompiled file and the static feature of function call; excavating frequent 3-term sets of a malicious sample and a normal sample based on an Apriori algorithm, and further calling and generating characteristics by combining a sensitive Application Programming Interface (API) function; and learning and classifying the features by adopting a random forest classifier, so that the Android malicious application detection is realized. The method is used for malicious detection of the Android application software, the system resource consumption is low, and the detection accuracy is very high.

Description

Android malicious application detection method and system fusing frequent item set and random forest algorithm

Technical Field

The invention relates to the field of network security and information security detection, in particular to an Android malicious application detection method.

Background

Android (Android) is widely applied in the world as the most popular intelligent terminal system in the world at present by the characteristics of platform openness, free charge and the like. Thus, many malicious code researchers target attacks on the Android platform. With the technical progress, the manufacturing cost of the Android malicious program is lower and lower, so that the number of the Android malicious software is increased day by day. According to data display issued by a 360 Internet security center, 757.3 thousands of newly-added malicious software samples of the Android platform are intercepted in 2017, and 3.1 thousands of newly-added malicious software samples are added every day on average. The malicious software frequently initiates attacks by using various new technologies such as mining trojans, botnets and the like, and includes rogue behaviors such as stealing personal information of users and malicious charges, so that huge losses are brought to the users. In the face of the generalized malicious attacks, how to effectively realize the detection of the malicious Android applications becomes the primary problem of the safety of the current Android platform.

At present, malicious detection of Android application is mainly divided into static detection and dynamic detection. The static detection is to analyze a source program by adopting reverse engineering means such as decompilation and the like without running application software, extract characteristics such as signature, authority and the like, and directly analyze characteristic behaviors. The static detection technology mainly extracts the characteristics of an application program description file (android. Xml) and a grammar file (smali) code file. Xml and the information tag in the smali code file are analyzed by Guo et al, and the class, the authority, the component, the signature, various processed data, the starting information and the like of the application are extracted. Rashidi B et al use an authority and Application Programming Interface (API) function call as a feature set, and use a Support Vector Machine (SVM) and a K-Nearest Neighbor (KNN) algorithm to detect malicious applications, but have many false positives. The detection of the Android application software by machine learning can be manually removed, the analysis efficiency is improved, and the machine learning depends on the extracted application characteristics.

The dynamic detection of the Android malicious application refers to that in the running process of application software, the characteristics of the application are obtained through technologies such as injection, HOOK (HOOK) and the like, but the defects are that the software is required to run, and system resources are excessively consumed. In the aspect of dynamic detection research, mahinderu et al use a tracker (Strace) to collect behavior data of application software and transmit the behavior data to an analysis server, train the behavior samples by using a classifier, and finally use a K-nearest neighbor algorithm to judge whether the application contains malicious behaviors. Singh L et al, using API Hook technology, performs Hook on sensitive APIs on an Android platform, and once a system or application calls a particular API, may intercept the call function and redirect it to a proxy function to obtain detailed information, i.e., obtain behavior information.

Disclosure of Invention

The technical problem to be solved by the invention is to solve the above defects in the prior art, learn and detect the Android application by calling a machine learning algorithm, reduce the detection complexity of the malicious Android application, save the system resource consumption, and further improve the detection accuracy of the malicious software on the aspects of solving the problems of high-dimensional features and automatic classification detection.

The technical scheme for solving the technical problems is to provide an android malicious application detection method fusing a frequent item set (Apriori) algorithm and a random forest algorithm, and the method comprises the following steps of: performing batch decompiling on the Android application software to obtain application software permission and sensitive API function static characteristics; mining a frequent item set of the authority characteristics to perform dimension reduction processing on the authority characteristics to obtain a frequent 3-item set of the authority so as to obtain an association relation between the authorities in the sample set; excavating frequent 3-item sets of the malicious samples and the normal samples, taking the frequent 3-item sets and the sensitive API functions as feature construction feature sets, screening and grading feature attributes in the feature sets by adopting an information gain algorithm, extracting important features, and constructing vector spaces corresponding to the important features; and learning and classifying and detecting the vector space by adopting a random forest algorithm, and marking the normal or malicious attributes of the vector space of the normal sample and the malicious sample.

The method further comprises the step of decompiling the application software by using a static analysis tool before feature extraction to obtain files containing resource files (res), so files (lib) of a third-party software development kit, smali and android manifest.

The method further comprises the steps of extracting features by adopting a programming language (python) script, analyzing all rights of an application extracted from extended markup language files such as android Manifest.

The invention further comprises that the frequent 3-item set for mining the authority characteristics specifically comprises: respectively extracting authorities from the malicious sample and the normal sample to construct an authority set; 1-item set of mining frequent authority set: calculating the support degree S of each authority in the authority set, and pruning the frequent 1-item set which does not meet the minimum support degree min _ S to obtain a candidate set L ₁ Then to L ₁ The elements in (1) are connected; taking the connected candidate set as a new sample set, and mining a frequent 2-item set: for the deficiency ofPruning a frequent 2-item set with the minimum support degree min _ s to form a new candidate set L ₂ This is repeated until a frequent 3-item set is obtained.

The invention further comprises that the Information Gain (IG) algorithm is adopted, specifically, the difference value between the entropy value of the characteristic and the conditional entropy of the characteristic is calculated to obtain the IG value of the characteristic, the larger the IG value is, the larger the correlation degree is, the important characteristic is reserved according to the correlation degree, the important characteristic is matched with each application software in the system, and the corresponding vector space is respectively constructed. The construction of the vector space comprises, in particular, the construction of a feature vector (x) comprising different application software ₁ ,x ₂ ,…,x _n ) The feature set X calls a formula v: s → {0,1} ^|X| And constructing a vector space v according to the feature vectors in the set X, wherein s represents certain application software, each dimension in v corresponds to a certain feature in X, if s contains the certain feature, the identification value corresponding to the feature in the vector space v is 1, and if not, the identification value is 0.

The invention also provides an Android malicious application detection system integrating an Apriori algorithm and a random forest algorithm, which comprises the following steps: the system comprises a feature extraction module, a feature processing module and a random forest classification algorithm module, wherein the feature extraction module is used for extracting features of batch decompiled Android application software to obtain application software permission and sensitive API function static features; the characteristic processing module excavates a frequent item set of authority characteristics, performs dimension reduction processing on the authority characteristics to obtain a frequent 3-item set of the authority so as to obtain an incidence relation between the authorities in the sample set, excavates a frequent 3-item set of a malicious sample and a frequent 3-item set of a normal sample, takes the frequent 3-item set and a sensitive API (application program interface) function as a characteristic construction characteristic set, screens and scores characteristic attributes in the characteristic set by adopting an information gain algorithm, extracts important characteristics and constructs a vector space corresponding to the important characteristics; and the random forest classification algorithm module is used for learning and classifying the vector space, and performing normal or malicious attribute marking on the vector spaces of the normal sample and the malicious sample.

The method adopts a static detection mode to extract the application data characteristics, further adopts an Apriori algorithm to dig out frequent 3-item sets of authority limits in normal and malicious software for the data characteristics, then fuses sensitive API (application program interface) calling functions, and adopts a random forest creation classifier to learn and classify the data characteristics. Further, an IG algorithm is adopted to obtain the IG value of the characteristic by calculating the difference value between the entropy value of the characteristic and the conditional entropy value of the characteristic, important characteristics are reserved, and a matching algorithm is utilized to respectively construct a vector space corresponding to each application software in the system. The invention excavates the frequent 3-item set of the high-dimensional authority characteristics, and has less system resource consumption.

Drawings

FIG. 1 shows an Android malicious application detection model fusing an Apriori algorithm and a random forest algorithm.

Detailed Description

The following detailed description of the embodiments of the invention is provided in connection with the accompanying drawings.

FIG. 1 is a schematic diagram of a model of an applied inspection system according to the present invention. In order to realize detection of Android system malicious application, the Android malicious application detection system is provided by fusing an Apriori algorithm mining frequent 3-item set and random forest algorithm classification.

Firstly, collected normal software and malicious software samples are intensively decompiled in batches, authority (Android permission) applied by an application program and a sensitive Application Programming Interface (API) function call are extracted from a decompiled application program description file Android manifest.

The following description will specifically explain each part.

(1) The feature extraction module is used for compiling the sample set in batch by using a programming language Python script and extracting features of the decompiled android. For the authority feature extraction, corresponding authority features are extracted from a certain application authority, for example, all the authorities applied are extracted from an android manifest. When a user uses a certain function in the system or accesses some sensitive data, the user applies for a use permission, for example, the permission applied in an android management. For sensitive API functions, each smali file represents a programming language (Java) class containing various system application interface functions called by the application, all smali files are traversed using the method in python, os. By traversing the sensitive API functions of all samples extracted from the smali file, the potential malicious behavior of the application software can be obtained accordingly. Because the smali is used as a byte code file of an android virtual machine (Dalvik), each smali file represents a Java class and comprises various system application interface functions called by the application; malicious behavior due to malware must call the corresponding API function. Therefore, all called sensitive API functions in the sample set are used as learning features of the random forest algorithm and trained to detect malicious applications.

The sample set must be decompiled before feature extraction can be performed on each sample software. Files with apk suffixes can be decompiled using a decompilation tool, apktool, to yield files containing resource files (res), third party sdk so files (lib), smali, and android manifest.

Generally, before malicious behaviors are generated, malicious software applies some dangerous right combinations in terms of rights, and the dangerous right combinations depend on each other to generate the malicious behaviors. Therefore, the malware not only applies for a single dangerous permission, but also applies for a dangerous permission combination, for example, in a malicious sample, the application permission combination is usually READ _ SMS (READ short message service), READ _ PHONE _ STATE (READ mobile PHONE STATE), WRITE _ SMS (edit short message service), and can perform malicious operations of reading user privacy and sending to other places, and the like, while the normal software is rarely provided with the permission combination, and the malicious software can be judged to be the malware according to the potential malicious behavior of the mutual cooperation of different dangerous permissions in the permission combination.

The Apriori algorithm is an algorithm proposed by Agrawal et al to mine a frequent set of boolean association rules. The Apriori algorithm mines a frequent 3-term set of permissions. A large number of rights features and sensitive API functions are obtained. However, the obtained authority feature dimension is very large and the computation complexity is high, so that the frequent item set based on the Apriori algorithm mining authority feature is adopted to perform dimension reduction processing on the authority feature dimension to obtain the frequent 3-item set of the authority. And mining a frequent 3-item set of the authority characteristics to obtain the association relation between the authorities in the sample set, wherein the specific steps are described as follows.

Mining frequent 3-item sets of permissions based on Apriori algorithm, specifically, extracting a normal software sample permission set P and a malicious sample permission set M of the application from all samples, wherein P = { P = ₁ ,p ₂ ,…,p _n Represents the authority set of normal software samples, which represents n authorities applied by all normal software samples, and M = { M = } ₁ ,m ₂ ,…,m _x And the permission set of the representative malicious sample represents x permissions applied in all the malicious samples. A frequent 3-item set is mined separately for the sets of permissions for the normal and malicious samples. Specifically, the following method can be adopted:

mining the authority set of the sample authority for a frequent 1-item set: calculating the support degree S of each authority in the sample authority set, representing the probability of the authority appearing in all the sample sets, pruning the frequent 1-item set which does not meet the minimum support degree min _ S to obtain a set meeting the conditions, and taking the set as a candidate set L ₁ Then to L ₁ The elements in (1) are connected; then, the connected candidate set is used as a new sample set, all 2-item sets are contained at the moment, then, frequent 2-item sets are mined from the new sample set, and the frequent 2-item sets which do not meet the minimum support degree min _ s are pruned to form a new candidate setSelection set L ₂ And repeating the steps until a frequent 3-item set of the sample authority set is obtained.

Connecting: in a certain frequent n-item set, starting from the first item (for example, the ith item) of the set, searching the item (for example, the jth item) with which the first n-1 item is the same downwards, and then connecting all the elements in the i item and the nth element of the jth item into an n +1 item set.

Respectively calculating P from a normal software sample authority set P ₁ ,p ₂ ,…,p _n The occurrence frequency is used as the support S of the element, the minimum support is the lowest occurrence frequency of each element in P and is between 0 and 1, after the frequent 1-item set is mined, pruning and connection are carried out according to the minimum support, and finally the frequent 3-item set of the normal sample is obtained.

Respectively calculating M from the malicious sample authority set M ₁ ,m ₂ ,…,m _x And the occurrence frequency is used as the support degree S of the element, after the frequent 1-item set is mined, pruning and connection are carried out according to the minimum support degree, and finally the frequent 3-item set of the malicious sample is obtained.

(2) Feature processing

After a frequent 3-term set of a malicious sample and a frequent 3-term set of a normal sample are mined by an Apriori algorithm, the malicious sample and the frequent 3-term set are taken as characteristics together with a sensitive API function, and characteristic attributes are screened and scored by an information gain IG algorithm. The IG algorithm obtains an IG value of the feature by calculating the difference value of the information entropy and the condition entropy of the feature, and the larger the value is, the larger the correlation degree is. Entropy calculation: according to the probability P (C) that normal software or malicious software respectively appears in the sample set _i ) According to the formula:

and calculating the information entropy H (C) of the sample set. And (3) calculating conditional entropy: according to the formula:

conditional entropy H (Y | X) of the ith feature, respectively _i ). Thus, according to formula IG _i ＝H(C)-H(Y|X _i ) The IG value of the ith feature is calculated asThe features of normal or malicious software are classified, so that the uncertainty reduction degree of the features is maximum, the features with IG value of 0 are removed, and the features with the rest values not being 0 are reserved as important features.

Defining a set X of feature sets reserved for application software, wherein the feature sets comprise different features (X) ₁ ,x ₂ ,…,x _n ) Wherein n is an important characteristic number. According to the formula, ν: s → {0,1} ^|X| And constructing a vector space v according to the features in the set X, and enabling s to represent certain application software, wherein each dimension in v corresponds to a certain feature in X. If s contains the feature, the identification value corresponding to the feature in the vector space v is 1, otherwise, the identification value is 0, and the identification value represents whether the feature is contained.

According to the method, a matching algorithm is utilized to respectively construct a vector space v corresponding to each application software in the system, then after feature screening, a feature set comprising n features is constructed, different vector spaces v are generated corresponding to each sample, and the vector spaces v are stored in a MySQL database and serve as input of a random forest classification module.

(3) Random forest algorithm classification

After the feature vectors are obtained, detection essentially becomes a classification problem. Since the detection results are both normal and malicious, detection essentially belongs to the binary problem. While the random forest algorithm is very suitable for solving the problem of two classifications. And (4) classifying by using the obtained vector space v and adopting a random forest classification algorithm.

The following methods can be specifically adopted, and supervised classification is available: for each piece of application software in the collected known normal and malicious sample sets, according to whether the application software belongs to normal or malicious software, the application software is identified with normal or malicious attributes after each vector space corresponding to the application software, as described in the following formula.

Wherein V (S) represents the set of all application software, normal represents that the application software belongs to normal software, and malware represents that the application software belongs to malware.

And after the vector space of the training sample set is obtained, training the vector space to obtain the random forest classifier. After feature extraction and feature processing are carried out on software to be tested, a vector space v is obtained, wherein v does not contain a normal or malware identifier, and is blank or? ' instead of the values, a random forest classifier of the training sample is used for detecting and classifying the vector space of the software to be detected, and the normal or malware character strings are used for representing whether the software to be detected is normal software or malware in the result, so that the detection of the malware can be realized.

The invention utilizes the decompilation technology to perform batch decompilation on the application software sample set and extract the authority and the API function in the file. And in the face of high-dimensional authority characteristics, performing dimensionality reduction treatment by adopting an Apriori algorithm to obtain a frequent 3-item set of the authority, and then combining a sensitive API function to perform characteristic screening through information gain to further obtain important characteristics. And mapping the important features into a vector space, representing by 0 or 1, marking the normal application and the malicious application, and finally obtaining the vector space with the marks. And learning and classifying the sample set by adopting a random forest algorithm.

Claims

1. A method for detecting android malicious application fusing a frequent item set algorithm and a random forest algorithm is characterized by comprising the following steps: performing batch decompiling on Android application software to obtain a sample set, and obtaining application software permission and API (application programming interface) function static characteristics of a sensitive application program; mining frequent item sets of the authority characteristics, and performing dimension reduction processing on the authority characteristics to obtain frequent 3-item sets of the authorities so as to obtain an association relation between the authorities in the sample set; excavating frequent 3-item sets of the malicious samples and the normal samples, respectively taking the frequent 3-item sets of the malicious samples and the normal samples and the sensitive API function as feature construction feature sets, screening and grading feature attributes in the feature sets by adopting an information gain algorithm, extracting important features, and constructing vector spaces corresponding to the important features; learning and classifying detection are carried out on the vector space by adopting a random forest classifier, and normal or malicious attribute labeling is carried out on the vector space of the normal sample and the malicious sample;

the frequent 3-item set for mining the authority features specifically comprises the following steps: respectively extracting the authority from the malicious sample or the normal sample to construct an authority set; 1-item set of mining frequent authority set: calculating the support degree S of each authority in the authority set, and pruning the frequent 1-item set which does not meet the minimum support degree min _ S to obtain a candidate set L ₁ Then to L ₁ The elements in (1) are connected; taking the connected candidate set as a new 2-item set, and mining a frequent 2-item set: pruning the frequent 2-item set which does not meet the minimum support degree min _ s to form a new candidate set L ₂ Repeating the steps until a frequent 3-item set is obtained;

the Information Gain (IG) algorithm specifically comprises the steps of respectively generating the probability P (C) of normal software or malicious software in the sample set according to the probability _i ) According to the formula:

calculating the information entropy H (C) of the sample set according to the formula: />

Computing the conditional entropy of the ith feature H (Y | X) _i ) According to formula IG _i ＝H(C)-H(Y|X _i ) Calculating an IG value of the ith characteristic, wherein the larger the IG value is, the larger the correlation degree of the frequent 3-item set of the malicious sample and the normal sample is, reserving important characteristics according to the correlation degree, matching the important characteristics with each application software in the system, and respectively constructing vector spaces corresponding to the important characteristics;

specifically, the vector space construction method comprises the steps of eliminating the features with IG value of 0, reserving the features with the rest values not being 0 as important features, and constructing different feature vectors (x) containing application software samples ₁ ,x ₂ ,…,x _n ) Feature set X, calling formula V: s → {0,1} ^|X| And constructing a vector space V according to the feature vectors in the set X, wherein s represents certain application software, and each dimension in V is consistent with a certain feature in XCorrespondingly, if s includes the certain feature, the identification value corresponding to the feature in the vector space V is 1, otherwise, it is 0.

2. The method of claim 1, wherein before feature extraction, a static analysis tool is used to perform decompilation on the application software to obtain a so file lib containing a resource file res, a third-party software development kit, a grammar file smali and an application description file android manifest.

3. The method as claimed in claim 2, characterized in that a programming language python script is used to extract features, all rights acquisition rights features of an application extracted from an android manifest xml file are parsed, all smali files are traversed using a method function in python-os.

4. An android malicious application detection system fusing a frequent item set algorithm and a random forest algorithm comprises: the system comprises a feature extraction module, a feature processing module and a random forest classification algorithm module, and is characterized in that the feature extraction module performs feature extraction on batch decompiled Android application software to obtain application software authority and sensitive API function static features; the characteristic processing module excavates a frequent item set of the authority characteristics, performs dimension reduction processing on the authority characteristics to obtain a frequent 3-item set of the authority so as to obtain an incidence relation between the authorities in the sample set, excavates a frequent 3-item set of a malicious sample and a frequent 3-item set of a normal sample, uses the frequent 3-item set and a sensitive API (application program interface) function as a characteristic construction characteristic set, screens and scores characteristic attributes in the characteristic set by adopting an information gain algorithm, extracts important characteristics and constructs a vector space corresponding to the important characteristics; the random forest classification algorithm module is used for learning and classifying and detecting the vector space, and a random forest classifier is used for marking the normal or malicious attributes of the vector space of the normal sample and the vector space of the malicious sample;

the frequent 3-item set for mining the authority features specifically comprises the following steps:respectively extracting the authority from the malicious sample or the normal sample to construct an authority set; 1-item set of mining frequent authority set: calculating the support degree S of each authority in the authority set, and pruning the frequent 1-item set which does not meet the minimum support degree min _ S to obtain a candidate set L ₁ Then to L ₁ The elements in (1) are connected; taking the connected candidate set as a new sample set, and mining a frequent 2-item set: pruning the frequent 2-item set which does not meet the minimum support degree min _ s to form a new candidate set L ₂ Repeating the steps until a frequent 3-item set is obtained;

the IG algorithm with information gain specifically comprises calculating the difference between the entropy of the feature and its conditional entropy to obtain the IG value of the feature, and respectively generating probability P (C) of normal software or malicious software in the sample set _i ) According to the formula:

Calculating the conditional entropy H (Y | X) of the ith feature _i ) According to the formula IG _i ＝H(C)-H(Y|X _i ) Calculating an IG value of the ith characteristic, wherein the larger the IG value is, the larger the correlation degree of the frequent 3-item set of the malicious sample and the normal sample is, reserving important characteristics according to the correlation degree, matching the important characteristics with each application software in the system, and respectively constructing vector spaces corresponding to the important characteristics;

specifically, the vector space construction method comprises the steps of eliminating the features with IG value of 0, reserving the features with the rest values not being 0 as important features, and constructing different feature vectors (x) containing application software samples ₁ ,x ₂ ,…,x _n ) Feature set X, calling formula V: s → {0,1} ^|X| And constructing a vector space V according to the feature vectors in the set X, wherein s represents certain application software, each dimension in V corresponds to a certain feature in X, if s contains the certain feature, the identification value corresponding to the feature in the vector space V is 1, and if not, the identification value is 0.

5. The detection system according to claim 4, wherein the static analysis tool is used for decompiling the application software to obtain a file containing res, lib, smali and android.

6. The detection system according to claim 5, wherein a programming language python script is adopted to extract features, all rights of an application extracted from an android manifest.xml file are analyzed to obtain rights features, an os.walk () function is used to traverse all smali files, and a sensitive API function of each sample is extracted according to a regular matching method.