CN112100621B

CN112100621B - Android malicious application detection method based on sensitive permission and API

Info

Publication number: CN112100621B
Application number: CN202010951202.XA
Authority: CN
Inventors: 郭方方; 赵天宇; 孙思佳; 王慧强; 吕宏武; 冯光升; 李冰洋; 任威霖
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2022-05-20
Anticipated expiration: 2040-09-11
Also published as: CN112100621A

Abstract

The invention belongs to the technical field of mobile terminal network security, and particularly relates to an android malicious application detection method based on sensitive permission and an API. The method and the device solve the problem that only the high-risk features are concerned and the low-risk features are ignored in the static analysis of the conventional android malicious application detection. After the authority and API characteristics are obtained, the high-risk sensitive characteristics and the low-risk sensitive characteristics are considered, and the low-sensitivity redundant authority and API characteristics are removed by calculating the sensitivity of each authority and API, so that the number of the authorities and the API is reduced, and the speed and the accuracy of malicious application program detection are improved.

Description

Android malicious application detection method based on sensitive permission and API

Technical Field

The invention belongs to the technical field of mobile terminal network security, and particularly relates to an android malicious application detection method based on sensitive permission and an API.

Background

In recent years, in various mobile intelligent device operating systems, the android system is a mainstream operating system of a mobile terminal rapidly by using the characteristics of an open source. With the continuous expansion of various Android downloading platforms, a plurality of malicious application developers aim the attacking eyes at Android applications. Many operations of malicious applications may not be even felt by the user until loss is caused, such as stealing user privacy data including phone book, mailbox, location, password, and file, and performing tariff related malicious activities, such as malicious deduction of short message phone, logging in a bank account, and the like. In different platforms, the Android market has the top of various application markets with high-risk applications and malicious applications. In 2019, the whole year, the number of times of blocking malicious program attacks by mobile phone users in China is about 9.5 hundred million, and on average, the number of times of blocking the malicious program attacks by the mobile phone users is about 259.2 ten thousand times per day. About 180.9 ten thousand samples of newly added malicious programs are intercepted at the mobile terminal, and about 0.5 ten thousand samples of newly added malicious programs are intercepted at the mobile terminal every day on average. Particularly, the mobile device is more and more convenient and faster to use due to the arrival of the 5G era, but the safety problem is more and more prominent, so that how to effectively solve the safety problem of the Android platform is a hot research field in the world at present.

In order to better solve the problems of android nowadays, a large number of scholars focus on the detection method of android malicious applications. The current detection method for the android malicious application can be mainly divided into static analysis and dynamic analysis. The static analysis does not need to really run the Android application program, but uses the technologies of reverse engineering, pattern matching, static system calling and the like to analyze program source codes or byte codes, and performs data flow analysis and control flow analysis on the program to find out a malicious behavior execution path possibly existing in the program. Hou S et al further categorize API calls belonging to certain methods in the smali code into a block based on analysis of static API calls extracted from the smali file. And then applying a deep learning framework for detecting unknown Android malicious software according to the generated code block. (Hou S, Saas A, Ye Y, et al. DroidDelver: An Android Malware Detection System Using Deep Belief Network Based on API Call Blocks [ M ]// Web-Age Information management. Springer International Publishing,2016.) dynamic analysis is to place An application in a real device environment or a virtual device environment to run, generate as many execution paths to cover code segments as possible, monitor the run-time behavior, take program run-time data such as permission change, Network I/O, System Call, etc., and then further analyze the data to detect whether the Android application program has a safety problem. Therefore, the dynamic analysis can discover some malicious behaviors which may occur only when the application runs, such as dynamic loading, code obfuscation and the like. The DroidScribe proposed by Dash et al collects multi-dimensional and multi-level dynamic characteristics including system call, decoded Binder communication, abstracted behavior patterns and the like, and detects and classifies malicious software by using an SVM classification algorithm. (Dash S K, Suarez-Tangil G, Khan S, et al. Droidscribe: Classifying android hardware based on runtime behavior [ C ]//2016IEEE Security and Privacy Workshos (SPW). IEEE,2016:252-

In summary, since the dynamic detection technique has a large overhead in time and resource consumption, and the extracted feature information is not stable, the static detection solves the difficulty well. In practical application, many application programs are put on the shelf every day in the android application market, the dynamic detection needs to spend and hardly detect the malicious programs on the platform in a short time, the static detection technology well balances efficiency and overhead, a higher detection precision is obtained at the cost of lower time and resources, and the method is suitable for the requirements of the android application market.

Disclosure of Invention

The invention aims to solve the problem that only the high-risk features are concerned and the low-risk features are ignored in the conventional static analysis of android malicious application detection, and provides an android malicious application detection method based on sensitive permission and an API (application program interface).

The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:

step 1: inputting an android application program sample to be detected, taking part of the sample to construct a training set, and forming a test set by the rest samples; calibrating an android application program sample in a training set, and dividing the training set into a malicious application program set M and a benign application program set B; setting a sensitivity threshold eta and parameters d and k in a random forest classifier;

step 2: obtaining an authority feature set P ═ { P ] of android application program samples in a training set₁,p₂,…,p_i… } and API feature set a ═ a₁,a₂,…,a_i,…}；

And step 3: calculating each authority P in authority feature set P_iSensitivity S (p) of_i)；

Wherein, I (p)_iM) represents the authority p_iCorrelation with malicious applications, M ∈ M; i (p)_iAnd b) represents the right p_iCorrelation with benign applications, B ∈ B; p (p)_i) Is the authority p_iA probability of occurrence in an android application sample; p (m) is the probability that the android application sample is a malicious application; p (b) is the probability that the android application sample is a benign application; p (p)_iM) is the authority p_iA probability of occurring in an android application sample and when the sample is a malicious application; p (p)_iB) is the authority p_iAppear in an android application sample and the sampleProbability of being a benign application;

and 4, step 4: computing each API feature a in API feature set A_iSensitivity S (a) of_i)；

Wherein, I (a)_iM) represents API feature a_iCorrelation with malicious applications; i (a)_iAnd b) represents API feature a_iCorrelation with benign applications; p (a)_i) As a feature of API_iA probability of occurrence in an android application sample; p (a)_iM) is API feature a_iA probability of occurring in an android application sample and when the sample is a malicious application; p (a)_iAnd b) is API feature a_iA probability of occurring in an android application sample and the sample being a benign application;

and 5: screening the authority feature set P and the API feature set A;

if the authority P in the authority feature set P_iSensitivity S (p) of_i) If the sensitivity is greater than the sensitivity threshold eta, the authority P is reserved in the authority feature set P_i(ii) a Otherwise, deleting the authority in the authority feature set P;

if API feature a in API feature set A_iSensitivity S (a) of_i) If the sensitivity threshold eta is larger than the threshold eta, the API feature a is reserved in the API feature set A_i(ii) a Otherwise, deleting the API feature from the API feature set A;

step 6: constructing a random forest classifier by using the screened authority feature set P and the API feature set A;

step 6.1: extracting N times from the N android application program samples of the training set, and obtaining a data set D containing the N android application program samples; wherein N is the number of android application program samples in the training set;

step 6.2: when each node is split, randomly selecting d static features from the right feature set P and the API feature set A, respectively calculating the information gain of the d static features, and selecting the static feature with the maximum information gain as the splitting attribute of the current node; splitting the node based on the splitting attribute of the node, dividing the android application program sample with the splitting attribute in the data set D into the left node of the node, and dividing the rest android application program samples into the right node of the node;

step 6.3: splitting each node in the decision tree according to the step 6.2, and stopping splitting the node until all samples of the node belong to malicious applications or benign applications;

step 6.4: repeating the steps 6.1 to 6.3 to generate k decision trees; combining the k decision trees to form a random forest classifier;

and 7: obtaining permission characteristic set P of android application program sample in test set_dAnd API feature set A_dSet of authority features P of test set_dAnd API feature set A_dInputting the result into a trained random forest classifier to obtain a detection result.

The present invention may further comprise:

and 2, acquiring an authority feature set P ═ P of android application program samples in the training set₁,p₂,…,p_i… } and API feature set a ═ a₁,a₂,…,a_i… } the method is specifically:

step 2.1: decompiling the android application program sample in the training set by using an tool, wherein the file generated after decompiling comprises android manifest.xml, res folder, apktol.yml and smali folder;

step 2.2: obtain rights information from android manifestAnd (4) the authority features after all the duplication removal form an authority feature set P ═ { P ═₁,p₂,…,p_i,…}；

Step 2.3: traversing each smali file, extracting all API data including API names, parameters and API return values, removing the duplication of the API information extracted from each sample, and forming an API feature set A (a) by the API calling information after the duplication removal₁,a₂,…,a_i,…}。

The invention has the beneficial effects that:

the method solves the problem that only the high-risk features are concerned and the low-risk features are ignored in the conventional android malicious application detection static analysis. According to the invention, after the permission and API characteristics are obtained, not only high-risk sensitive characteristics but also low-risk sensitive characteristics are considered, and the low-sensitivity redundant permission and API characteristics are removed by calculating the sensitivity of each permission and API, so that the number of the permissions and APIs is reduced, and the speed and accuracy of malicious application program detection are improved.

Drawings

FIG. 1 is a flow chart of a method for android malicious application detection based on sensitive permissions and APIs.

Fig. 2 is the ith decision tree in the corresponding random forest in the embodiment of the present invention, where i ═ 1,2,3, k }.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The invention belongs to the field of mobile terminal network security, relates to an android malicious application detection method, and particularly relates to an android malicious application detection method based on sensitive permission and an API (application programming interface).

Since the dynamic detection technique has a large overhead in time and resource consumption, and the extracted feature information is not stable, the static detection solves the difficulty well. In practical application, many application programs are put on the shelf every day in the android application market, the dynamic detection needs to spend and hardly detect the malicious programs on the platform in a short time, the static detection technology well balances efficiency and overhead, a higher detection precision is obtained at the cost of lower time and resources, and the method is suitable for the requirements of the android application market. Therefore, the invention finally chooses to use static detection techniques. However, only those sensitive features with high risk are concerned when the static features are extracted by the conventional method, and the sensitive features with low risk are often ignored. Sensitive features of low risk also have the powerful ability to distinguish between benign and malignant applications. The invention provides an android malicious application detection method based on the sensitive permission and the API, extracted sensitive permission and API have not only high-risk sensitive features but also low-risk sensitive features, and then a random forest classifier is obtained by training the features for classification, so that a high detection rate can be obtained in a short time.

In order to solve the problem that only features with high risk are concerned and features with low risk are ignored in the conventional android malicious application detection static analysis, the invention provides an android malicious application detection method based on sensitive API extraction. The method comprises the steps of firstly, decompiling a sample by using an apktool to obtain authority and API (application program interface) calling information in the sample, then deleting low-sensitivity features according to sensitivity, reserving the high-sensitivity features to form a feature set, training the feature set to obtain a random forest classifier according to the feature set to classify unknown application programs, wherein the feature set comprises high-risk and low-risk features. The invention specifically comprises the following contents:

step 1: android application samples are obtained, including malicious applications and benign applications. Malicious applications refer to any application that detracts from the interests of the user, and benign applications refer to applications that do not detract from the interests of the user.

Step 2: and acquiring static characteristics in the application program sample, wherein the static characteristics only comprise the authority characteristics and the API calling information. The authority characteristics are formed into an authority characteristic set P ═ { P ═ P₁,p₂,…,p_i…, the API call information is formed into an API feature set a ═ a₁,a₂,…,a_i,…}。

And step 3: obtaining authority P in authority set P_iSensitivity S (p) of_i) And API call a in API call set A_iSensitivity S (a) of_i)。

And 4, step 4: mixing S (p)_i) Is compared to a sensitivity threshold η. If S (p)_i) If the authority is greater than eta, the authority is reserved in the set P, otherwise, the authority is deleted in the set P. Mixing S (a)_i) Is compared to a sensitivity threshold η. If S (a)_i) If the API is not deleted, the API is deleted in the set A.

And 5: and (4) constructing k decision trees by using the authority feature set and the API feature set acquired in the step (4) and combining a known method for judging the splitting attribute by using the information gain, wherein the k decision trees are combined into a random forest classifier.

Step 6: according to the step 2, the authority feature set P in the application program to be detected is extracted_dAnd API Call feature set A_dBased on P_dAnd A_dAnd detecting the application program to be detected by using a random forest classifier.

The specific process of obtaining the static features in the application program sample in the step 2 is as follows:

(2.1) decompiling the sample by using the tool, namely, the decompiled file comprises android manifest.

(2.2) acquiring authority information from android manifest.xml, deleting repeated authority, and forming an authority feature set P (P) by all the deduplicated authority features₁,p₂,…,p_i,…}。

(2.3) traversing each smali file, extracting all API data including API names, parameters and API return values, removing the duplication of the API information extracted from each sample, and forming an API feature set A (a) by the API calling information after the duplication removal₁,a₂,…,a_i,…}。

Obtaining the authority P in the authority set P in the step 3_iSensitivity S (p) of_i) And API call a in API call set A_iSensitivity S (a) of_i) The specific process comprises the following steps:

obtaining authority p by mutual information formula_iCorrelation I (p) with malicious application m_iM), authority p_iCorrelation I (p) with benign applications b_i,b)。

Wherein, p (p)_i) Is the authority p_iProbability of appearing in the sample, p (m) probability of the application being a malicious application, p (b) probability of the application being a benign application, p (p)_iM) is p_iProbability of occurring in the sample while the application is malicious, p (p)_iB) is p_iThe probability of occurring in the sample while the application is benign.

Is prepared from I (p)_iM) and I (p)_iB) p can be calculated_iSensitivity S (p) of_i)。

Wherein, S (p)_i) In the range of [0,1]. When S (p)_i) When 0 denotes the authority p_iIt is a less sensitive right that is often used in both malicious and benign applications. When S (p)_i) When 1 indicates the authority p_iIs a highly sensitive right that is a low risk right that is invoked only in benign applications or a high risk right that is invoked only in malicious applications.

And API call a_iSensitivity S (a) of_i) Is calculated and authority p_iSensitivity S (p) of_i) The same is true.

The invention has the beneficial effects that: according to the invention, after the permission and API characteristics are obtained, not only high-risk sensitive characteristics but also low-risk sensitive characteristics are considered, and the low-sensitivity redundant permission and API characteristics are removed by calculating the sensitivity of each permission and API, so that the number of the permissions and APIs is reduced, and the speed and accuracy of malicious application program detection are improved.

Example 1:

an android malicious application detection method based on sensitive permission and API comprises the following steps:

Step 2: and acquiring static characteristics in the application program sample, wherein the static characteristics only comprise the authority characteristics and the API calling information. The authority characteristics form an authority characteristic set P ═ P₁,p₂,…,p_i…, the API call information is formed into an API feature set a ═ a₁,a₂,…,a_i,…}。

And 4, step 4: will S (p)_i) Is compared to a sensitivity threshold η. If S (p)_i) If the authority is greater than eta, the authority is reserved in the set P, otherwise, the authority is deleted in the set P. Mixing S (a)_i) Is compared to a sensitivity threshold η. If S (a)_i) If the API is not deleted, the API is deleted in the set A.

Step 6: extracting the application program to be detected according to the step 2Set of privilege features P in order_dAnd API Call feature set A_dBased on P_dAnd A_dAnd detecting the application program to be detected by using a random forest classifier.

obtaining authority p by mutual information formula_iCorrelation I (p) with malicious application m_iM), authority p_iCorrelation with benign applications b I (p)_i,b)。

Wherein, p (p)_i) Is the authority p_iProbability of appearing in the sample, p (m) probability of the application being a malicious application, p (b) probability of the application being a benign applicationRatio, p (p)_iM) is p_iProbability of occurring in the sample while the application is malicious, p (p)_iB) is p_iThe probability of occurring in the sample while the application is benign.

The specific process of constructing a random forest classifier by using the known method in the step 5 is as follows:

and (5.1) N is the number of the training samples, and the data sets D containing the N training samples are obtained by extracting the N training samples without returning the N training samples for N times.

(5.2) when each node is split, randomly selecting M static features from the M static features (M is the sum of the authority feature and the API feature, and M is far smaller than M), respectively calculating the information gain g of the M static features, selecting the static features with the largest information gain as the split attribute of the current node, splitting each node based on the split attribute, distributing the application programs with the split attribute in the data set D to the left node of the node, and distributing the application programs without the split attribute to the right node of the node.

(5.3) each node of the current decision tree is split according to the step (5.2), and the node stops splitting until all samples of the node belong to malicious applications or all samples belong to benign applications.

And (5.4) repeatedly generating k decision trees according to the sequence of the steps (5.1), (5.2) and (5.3), and combining the k decision trees to form a random forest classifier.

The specific process of detecting the application program to be detected by using the random forest classifier in the step 6 is as follows:

(6.1) for the ith decision tree in the random forest, the shape of which is shown in FIG. 2, when judging the application program to be detected, firstly, the authority p in the node 1 is judged₁Whether or not P is present_dIf present, turn to node 2 to the left of node 1. Since the node 2 is a leaf node and the applications in the node 2 are benign, the application to be detected is judged to be a benign application. If not, the node 3 on the right side of the node 1 is turned to, and APIa in the node 3 is judged₁Whether or not it exists in A_dIf the node 5 does not exist, turning to the node 5 on the right side of the node 3, wherein the node 5 is a leaf node, and all the nodes 5 are malicious applications, so that the application to be detected is judged to be the malicious application.

And (6.2) judging the application program to be detected by the k decision trees in the random forest according to the step (6.1), and finally determining the category of the application program to be detected according to most of the k judgment results. For example, if there are 100 decision trees, 70 decision trees are determined to be malicious applications and 30 decision trees are determined to be benign applications, the application to be detected is finally determined to be a malicious application.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An android malicious application detection method based on sensitive permission and API is characterized by comprising the following steps:

Wherein, I (p)_iM) represents the authority p_iCorrelation with malicious applications, M ∈ M; i (p)_iAnd b) represents the right p_iCorrelation with benign applications, B ∈ B; p (p)_i) Is the authority p_iA probability of occurrence in an android application sample; p (m) is the probability that the android application sample is a malicious application; p (b) is the probability that the android application sample is a benign application; p (p)_iM) is the authority p_iA probability of occurring in an android application sample and when the sample is a malicious application; p (p)_iB) is the authority p_iPresence in android application sampleAnd the probability that the sample is a benign application;

and 5: screening the authority feature set P and the API feature set A;

2. The android malicious application detection method based on the sensitive authority and the API as claimed in claim 1, characterized in that: and 2, acquiring an authority feature set P ═ P of android application program samples in the training set₁,p₂,…,p_i… } and API feature set a ═ a₁,a₂,…,a_i… } the method is specifically:

step 2.2: acquiring authority information from android manifest.xml, deleting repeated authority, and forming an authority feature set P { P } by all the deduplicated authority features₁,p₂,…,p_i,…}；