CN108280350B

CN108280350B - Android-oriented mobile network terminal malicious software multi-feature detection method

Info

Publication number: CN108280350B
Application number: CN201810109044.6A
Authority: CN
Inventors: 庄毅; 王军; 顾晶晶; 蒋理; 杨帆; 孙炳林
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2018-02-05
Filing date: 2018-02-05
Publication date: 2021-09-28
Anticipated expiration: 2038-02-05
Also published as: CN108280350A

Abstract

The invention discloses a mobile network terminal malicious software multi-feature detection method for Android. The method comprises the following steps: step 1, obtaining an Android software dataset which comprises a malicious sample and a non-malicious sample; step 2, analyzing the installation package of the malicious software, extracting the installation package characteristics of the software, and constructing an installation package characteristic vector; step 3, acquiring the authority of the software application and constructing an authority list; step 4, decompiling the installation package of the malicious software, constructing a sensitive behavior diagram of the software, and extracting a sensitive behavior set of the software; step 5, performing statistical analysis on software features belonging to the same malware family in the malicious sample to construct a malware family feature library; and 6, extracting software features, and performing malice judgment and malicious family classification. According to the invention, the software package characteristics, the authority characteristics and the software sensitive behavior calling characteristics are selected as the basis for judging the malicious software, so that the accuracy of detecting the malicious behaviors of the software can be improved, and the capability of classifying malicious software families is realized.

Description

Android-oriented mobile network terminal malicious software multi-feature detection method

Technical Field

The invention belongs to the field of mobile software analysis and information security, and particularly relates to an Android-oriented mobile network terminal malicious software multi-feature detection method.

Background

The Android malicious code multi-label detection problem is a challenging problem in academia and industry. The malicious nature of the software is judged and at the same time the family to which the software belongs is also given. The application of the current smart phone relates to various aspects of life of people, and the Android system occupies a large share of the smart phone, so that the Android malicious code is accurately detected, and the method has important significance and application value for protecting the privacy and property safety of Android users.

The existing Android malicious software detection technology is mainly divided into 2 types: static analysis-based and dynamic analysis-based detection techniques, respectively. The dynamic analysis method simulates the execution of software, and can bypass the problems of code confusion, encryption and the like encountered by a static method; but dynamic test code coverage is low and some malicious programs may prevent themselves from running under the simulator. The static analysis method mainly researches and uses a decompilation technology or a control flow and data flow analysis technology on a smali intermediate code, can automatically analyze software, has higher detection efficiency and high code coverage rate, and is suitable for analyzing a large number of software samples; the disadvantage is that it is necessary to solve the problem of code obfuscation, encryption, and decoding malicious code in dynamic execution, which is difficult to detect with static methods. In order to deal with the problem, researchers consider technologies such as encryption, code dynamic loading, Native code dynamic loading and the like, such as Riskranker and droid range, in malware detection.

At present, many scholars perform related research on a multi-label detection method of Android malicious software. For example, Daniel Arp et al proposes a static analysis method-based Android malicious code multi-label detection method, extracts a large number of static features from a software installation package, and classifies the features by using a support vector machine, thereby realizing efficient detection; yu Feng et al provides a feature description language for describing Android malicious families, and classifies software to be detected by using a feature matching algorithm, so that semantic-based Android malicious software detection is realized; chao Yang et al describe the logical behavior of the software by using a two-stage behavior diagram representation method, judge the maliciousness of the software by combining the static taint analysis and the behavior diagram among the components through the analysis of the malicious behavior patterns, and realize the classification of malicious families.

However, in the research of the conventional Android malware multi-tag detection technology, all samples of malware are selected for analysis, and characteristics of the malware are extracted and used as a basis for judging the maliciousness of the software to be detected. Malicious software belonging to different families has different malicious behaviors, and the characteristics expressed by the malicious behaviors are also greatly different. Malware of the same malware family have similar malicious behavior. However, existing malware detection tools are less capable of multi-tag detection of malware, such as McAfee, which detects malicious samples in Genome data sets, wherein more than 90% of the samples are detected as Trojan or Downloader, and actually belong to a plurality of different malware families (e.g., DroidDream). Therefore, the speed and the accuracy are both to be further improved, and an efficient malware multi-tag detection method needs to be researched.

Disclosure of Invention

The invention aims to provide an Android-oriented mobile network terminal malicious software multi-feature detection method, so that the features of Android malicious software are effectively extracted, the Android malicious software detection precision is improved, and the Android-oriented mobile network terminal malicious software multi-feature detection method has the Android malicious family classification capability.

The technical solution for realizing the invention is as follows: a mobile network terminal malicious software multi-feature detection method for Android specifically comprises the following steps:

step 1, obtaining Android malware samples, marking Android malware families to which the samples belong, and then obtaining non-malware samples so as to construct malicious and non-malware sample datasets;

step 2, extracting installation package characteristics of the software, comprising the following steps: so file, whether there is file for root system, whether there is abnormal file, and whether there is subprogram, thus constructing installation package feature vector F;

step 3, processing an Android software sample by using a decompiling tool, analyzing an Android Manifest xml file, and extracting an authority list P applied by software according to a mark field in xml;

step 4, decompiling the installation package, constructing a software function call graph, positioning a security sensitive method therein, constructing a sensitive behavior graph SBG of the software, then obtaining context information of the security sensitive method by adopting a data flow analysis method, and forming a sensitive behavior set SBS of the software by the directly or indirectly called security sensitive method;

step 5, performing statistical analysis on software features belonging to the same malware family in the malicious sample to obtain the occurrence probability of each feature component, and constructing an Android malware family multi-feature model M so as to construct a malware family feature library;

and 6, extracting the features of the software to be tested by using the methods in the steps 2-4, performing feature matching on the features of the software to be tested and the feature library of the malicious software family to obtain the name of the malicious software family with the highest similarity, outputting the software as the malicious software if the similarity exceeds a threshold value, and outputting the malicious software family to which the software belongs, otherwise, outputting the software as benign software.

Compared with the prior art, the invention has the following remarkable advantages: 1) the invention provides an Android-oriented mobile network terminal malicious software multi-feature detection method, aiming at different malicious software families, analyzing software from three aspects of software package features, application authority features and software behavior calling features on the basis of a static analysis method; 2) according to the invention, a statistical analysis method is adopted to extract features of a malicious software family, a malicious software family feature library is constructed, and a malicious software multi-label detection method is provided based on the feature library, so that better malicious judgment precision and malicious family classification precision can be achieved.

The invention is explained in further detail below with reference to the drawings.

Drawings

Fig. 1 is a flowchart of a malicious software multi-feature detection method for an Android-oriented mobile network terminal according to the present invention.

FIG. 2 is a comparison of malware detection accuracy and malicious family classification accuracy with the partial engines in VirusTotal, using the present invention.

Detailed Description

With reference to the attached drawings, the Android-oriented mobile network terminal malicious software multi-feature detection method comprises the following steps:

the abnormal file refers to a file of which the suffix does not match with the type specified by the file content; judging whether the file exists or not, and judging whether the library file is a root extension file or not according to the MD5 value; and judging whether subprograms exist in the jar file, the dex file and the apk file.

the security-sensitive method comprises: a method of authority protection, a Source/Sink method of information flow and other suspicious methods; the authority protection method refers to an API which can be used only when the authority needs to be applied in an Android system, the information flow Source/Sink method refers to a method which can possibly generate or send sensitive information, and other suspicious methods comprise a dynamic loading function, a reflection function, an encryption and decryption function, a Native code execution function and a calling function.

The constructed software function call graph is the following four-tuple:

SBG＝(V^D,V^N,E,μ)

wherein, V^DCalling a subset of the set of points in the graph for software sensitive behavior, any node v therein_d∈V^DIs one of the security sensitive methods; v^NCalling a subset of the set of points in the graph for software sensitive behavior, any node v therein_n∈V^NThe method is a non-security sensitive method, but directly or indirectly calls a security sensitive method; e is as large as V^N×V^DCalling a set of graph edges for software sensitive behaviors to indicate that methods have a calling relationship therebetween, wherein any one edge e ═ v_n,v_d) E represents a non-security-sensitive method v in software_n∈V^NDirectly or indirectly calling security sensitive method v_d∈V^DOr component C_sMethod v of (1)_nTriggering component C directly or indirectly through ICC_tMethod v of (1)_d(ii) a Marking function μ V_d→<ID, EntryType, Para > is used for marking the content contained in the node in the graph, namely the context information of the method, including the method ID, the entry point type EntryType and the parameter Para;

the set of sensitive behaviors is the set shown below:

SBS＝{S₁,…,S_i,…,S_m}

wherein S is_i＝{v|(v_i,v)∈E∧v_i∈V^N∧v∈V^DThe method is a security sensitive method set, and a diagram SBG (SBG-V) for representing sensitive behaviors is called^D,V^NIn E, mu), the set of all security sensitive methods directly or indirectly called by the ith non-sensitive security method of the VN set; m ═ V^NAnd | is the length of the set SBS.

the constructed Android malicious software family multi-feature model is the following six-tuple:

M＝(SBS^c,α,F^c,β,P^c,γ)

wherein the content of the first and second substances,

the method comprises the steps that a sensitive behavior set which is common to a malware family is obtained by statistically analyzing a sensitive behavior set SBS of a sample of the same malware family; marking function

For marking SBS^cProbability of occurrence of the mesosensitive set of methods in the malware family sample; f^cCounting the common software installation package characteristics of the obtained malicious software family samples by analyzing the installation package characteristic vector F of the same malicious software family sample; the marking function beta F belongs to F^c→[0,1]For marking F^cThe probability of occurrence of various features in the malware family sample; p^cThe method comprises the steps of counting an authority list frequently applied by a malicious software family sample by analyzing an authority list P of the same malicious software family sample; the marking function gamma is P belongs to P^c→[0,1]For marking P^cThe probability of each privilege appearing in the malware family sample.

The similarity between the software to be tested and the malware family is represented as:

wherein S_fSimilarity as software feature vectors，S_pFor similarity of authority lists, S_sbsSimilarity of sensitive behavior sets, μ_iThe weight value of each similarity in calculation;

software feature vector similarity S_fThe calculation method comprises the following steps: giving a feature vector F ═ F of the software to be tested₁,f₂,f₃，...，f_mAnd f, feature vectors in the multi-feature model of the malware family to be matched

And the corresponding labeling function β, then:

calculating similarity according to the probability of each feature, wherein if the values in the feature vectors in the multi-feature model of the malicious family are all 0, the similarity is 0; wherein the correction factor omega_fThe calculation method comprises the following steps: all of the variables F in the vector F_if_i ^cNumber of features divided by vector F^cA median number of 1 features;

permission list similarity S of software_pThe calculation method comprises the following steps: giving an authority list P of the software to be tested and the authority list P in the multi-feature model of the malicious software family to be matched^c＝{p₁ ^c,p₂ ^c,...,p_n ^cAnd the corresponding marking function γ, then:

wherein the correction factor omega_pThe calculation method comprises the following steps: belonging to P in permission set P^cIs divided by the set P^cLength of (d); when authority list P^cElement (1) of

When included in the permission list P of the software under test,

the value is 1, otherwise 0;

sensitive behavior set similarity S_sbsThe calculation method comprises the following steps: given the SBS, which is a sensitive behavior set for software, the set of sensitive behaviors in the multiconfeatures of the malware family to be matched

And the corresponding marking function α, then:

in the formula, ω_sbsThe calculation method for the correction factor comprises the following steps: all in SBS

Set S of_i ^cIs divided by the amount of SBS in the set^cLength of (d); wherein the function

Represents: there is a certain set S in SBS, and set

The proportion of similar elements in (1) to all elements in the two sets is larger than theta (0)<θ≤1)。

Therefore, the characteristics of the malicious software family are extracted by adopting a statistical analysis method, the malicious software family characteristic library is constructed, the malicious software multi-label detection method is provided based on the characteristic library, and the high malicious judgment precision and the malicious family classification precision can be achieved.

In order to make those skilled in the art better understand the technical problems, technical solutions and technical effects of the present invention, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.

Examples

A multi-feature detection method for Android-oriented mobile network terminal malicious software uses a Drebin data set and non-malicious software samples obtained from Google Play to form a data set, and the malicious code detection and family classification specifically comprise the following steps:

step 1: dividing samples in Drebin according to a malicious family to which the samples belong, acquiring non-malicious software on Google Play by using a web crawler method, and verifying by using VirusTotal on-line detection service, thereby constructing a sample data set comprising 4486 malicious software samples of 24 malicious software families and 2140 benign software samples;

step 2: decompressing the software installation package to be analyzed by using a Zip decompression tool, and extracting the installation package characteristics of the software, wherein the method comprises the following steps: so file, whether there is file for root system, whether there is abnormal file, and whether there is subprogram, thus constructing installation package feature vector F; when judging whether a file for a root system exists, comparing the MD5 value of the existing root extension library file with the file in the software installation package; judging whether an abnormal file exists, analyzing the file content through an Apache Tika tool to obtain the file type, and comparing the file type with a file suffix; judging whether a subprogram exists or not, and checking whether a jar file, a dex file and an apk file exist in the subprogram or not;

and step 3: processing an Android software sample by using an APKParser, analyzing an Android Manifest xml file, and extracting an authority list P applied by software according to a mark field in xml;

and 4, step 4: the method comprises the steps of using a Soot tool to decompile an installation package, constructing a software function call graph, positioning a security sensitive method in the software function call graph, constructing a sensitive behavior graph SBG of the software, then obtaining context information of the security sensitive method by adopting a data flow analysis method, and forming a sensitive behavior set SBS of the software by the directly or indirectly called security sensitive method;

security sensitive methods of interest include: a method of authority protection, a Source/Sink method of information flow and other suspicious methods; the authority protection method refers to an API which can be used only when the authority needs to be applied in an Android system, the information flow Source/Sink method refers to a method which can possibly generate or send sensitive information, and other suspicious methods comprise a dynamic loading function, a reflection function, an encryption and decryption function, a Native code execution function and a calling function.

The constructed sensitive behavior call graph is the following four-tuple:

SBG＝(V^D,V^N,E,μ)

wherein, V^DCalling a subset of the set of points in the graph for software sensitive behavior, any node v therein_d∈V^DIs one of the security sensitive methods; v^NCalling a subset of the set of points in the graph for the software function, any node v therein_n∈V^NThe method is a non-security sensitive method, but directly or indirectly calls a security sensitive method; e is as large as V^N×V^DAnd calling the collection of graph edges for sensitive behaviors to indicate that the methods have calling relations. Wherein any one side e ═ v (v)_n,v_d) E represents a non-security-sensitive method v in software_n∈V^NDirectly or indirectly calling security sensitive method v_d∈V^DOr component C_sMethod v of (1)_nTriggering component C directly or indirectly through ICC_tMethod v of (1)_d(ii) a Marking function μ V_d→<ID,EntryType,Para>The method is used for marking the content contained in the vertex in the graph and comprises a method ID, an entry point type EntryType and a parameter Para.

The set of sensitive behaviors is the set shown below:

SBS＝{S₁,S₂,…,S_m}

wherein S is_i＝{v|(v_i,v)∈E∧v_i∈V^N∧v∈V^DThe method is a security sensitive method set, and a diagram SBG (SBG-V) for representing sensitive behaviors is called^D,V^NIn E, μ) of_nThe set is formed by all security sensitive methods directly or indirectly called by the ith non-sensitive security method of the set; m ═ V^NL is the length of the set SBS;

and 5, selecting 75 percent (3341 samples) of the 24 malware family samples as the samples for feature extraction, and constructing a malware family feature library. Performing statistical analysis on software features belonging to the same malware family in a malicious sample to obtain the occurrence probability of each feature component, and constructing an Android malware family multi-feature model M so as to construct a malware family feature library;

the constructed Android malware family multi-feature model is the following six-tuple:

M＝(SBS^c,α,F^c,β,P^c,γ)

wherein the content of the first and second substances,

For marking SBS^cProbability of occurrence of the mesosensitive set of methods in the malware family sample; f^cThe method comprises the steps that the common software installation package characteristics of the malicious software family samples are obtained through statistics by analyzing the installation package characteristics F of the same malicious software family sample; the marking function beta F belongs to F^c→[0,1]For marking F^cThe probability of each feature in the malware family sample occurring; p^cThe method comprises the steps of counting an authority list frequently applied by a malicious software family sample by analyzing an authority list P of the same malicious software family sample; the marking function gamma is P belongs to P^c→[0,1]For marking P^cThe probability of each privilege appearing in a malware family sample;

step 6, extracting the features of the software to be tested by using the methods in the steps 2-4, performing feature matching on the features of the software to be tested and a malicious software family feature library to obtain a malicious software family name with the highest similarity, outputting the software as malicious software if the similarity exceeds 0.7, and outputting the malicious software family to which the software belongs, otherwise, outputting the software as benign software;

the similarity between the software to be tested and the malware family is expressed as:

wherein S_fIs the similarity of the feature vectors, S_pFor similarity of authority lists, S_sbsSimilarity of sensitive behavior sets, μ_iIn the experiment, three weight values are taken as the weight values of each similarity in calculation

The similarity calculation method of the software feature vector comprises the steps of giving the feature vector F of the software to be tested to be F ═ F₁,f₂,f₃，...，f_mAnd F, feature vector in the multi-feature model of the malware family to be matched^c＝{f₁ ^c,f₂ ^c,f₃ ^c，...，f_m ^cAnd the similarity of the corresponding labeling function β is calculated as follows:

and calculating the similarity according to the probability of the occurrence of each feature, wherein if the values of the feature vectors in the multi-feature model of the malicious family are all 0, the similarity is 0. Wherein the correction factor omega_fThe calculation method comprises the following steps: all of the variables F in the vector F_if_i ^cNumber of features divided by vector F^cNumber of features with a median of 1.

The method for calculating the similarity of the software permission list comprises the steps of giving the permission list P of the software to be tested and giving the permission list P in the multi-feature model of the malicious software family to be matched^c＝{p₁ ^c,p₂ ^c,...,p_n ^cAnd the similarity of the corresponding labeling function γ is calculated as follows:

wherein the correction factor omega_pThe calculation method comprises the following steps: belonging to P in permission set P^cIs divided by the set P^cLength of (d).

The method for calculating the similarity of the sensitive behavior sets comprises the steps of giving the SBS of the sensitive behavior sets of the software and obtaining the sensitive behavior sets in the multi-features of the malicious software families to be matched

And the corresponding marking function alpha, and the calculation method of the similarity is shown as the following formula:

in order to prevent the more featured malware family from covering the less featured family, a correction factor omega is introduced_sbsThe calculation method comprises the following steps: all in SBS

Set of (2)

Is divided by the amount of SBS in the set^cLength of (d). Wherein the function

Represents: there is a certain set S in SBS, and set

The proportion of similar elements in (a) to all elements in both sets is greater than 80%.

The remaining 25% (1145) malware samples and 2140 benign software samples were tested using the above method, and the results of the software maliciousness determination and malicious family classification are compared with the results of the 8 antivirus engines commonly found in VirusTotal as shown in fig. 2.

Therefore, the method selects the software package characteristics, the authority characteristics and the software sensitive behavior calling characteristics as the basis for judging the malicious software, can improve the accuracy of detecting the malicious behaviors of the software, and has the capability of classifying malicious software families.

Claims

1. A mobile network terminal malicious software multi-feature detection method for Android is characterized by comprising the following steps:

step 4, decompiling the installation package, constructing a software function call graph, positioning a security sensitive method therein, constructing a sensitive behavior graph SBG of the software, then obtaining context information of the security sensitive method by adopting a data flow analysis method, and forming a sensitive behavior set SBS of the software by the directly or indirectly called security sensitive method; the constructed software function call graph is the following four-tuple:

SBG＝(V^D,V^N,E,μ)

wherein, V^DCalling a subset of the set of points in the graph for software sensitive behavior, any node v therein_d∈V^DIs one of the security sensitive methods; v^NCalling a subset of the set of points in the graph for software sensitive behavior, any node v therein_n∈V^NThe method is a non-security sensitive method, but directly or indirectly calls a security sensitive method; e is as large as V^N×V^DCalling a set of graph edges for software sensitive behaviors to indicate that methods have a calling relationship therebetween, wherein any one edge e ═ v_n,v_d) E represents a non-security-sensitive method v in software_n∈V^NDirectly or indirectly calling security sensitive method v_d∈V^DOr component C_sMethod v of (1)_nTriggering component C directly or indirectly through ICC_tMethod v of (1)_d(ii) a Marking function μ V_d→<ID,EntryType,Para>For marking the contents of a dot inclusion in a graph, i.e. V^DAnd V^NThe context information of the method (1), including a method ID, an entry point type EntryType, and a parameter Para;

the set of sensitive behaviors is the set shown below:

SBS＝{S₁,…,S_i,…,S_m}

wherein S is_i＝{v|(v_i,v)∈E∧v_i∈V^N∧v∈V^DThe method is a security sensitive method set, and a diagram SBG (SBG-V) for representing sensitive behaviors is called^D,V^NIn E, μ) of^NThe set is formed by all security sensitive methods directly or indirectly called by the ith non-sensitive security method of the set; m ═ V^NL is the length of the set SBS;

and 6, extracting the features of the software to be tested by using the methods in the steps 2-4, performing feature matching on the features of the software to be tested and the malware family feature library to obtain a malware family name with the highest similarity, outputting the software to be tested as malware if the similarity exceeds a threshold value, and outputting the malware family to which the software to be tested belongs, otherwise, outputting the software to be tested as benign software.

2. The Android-oriented mobile network terminal malware multi-feature detection method as claimed in claim 1, wherein the abnormal file in step 2 refers to a file whose suffix does not match with a type specified by the file content itself; the method comprises the steps of judging whether a file exists or not, judging whether a library file is a rootextension file or not through an MD5 value; and judging whether subprograms exist in the jar file, the dex file and the apk file.

3. The Android-oriented mobile network terminal malware multi-feature detection method of claim 1, wherein the security-sensitive method in step 4 comprises: a method of authority protection, a Source/Sink method of information flow and other suspicious methods; the authority protection method refers to an API which can be used only when the authority needs to be applied in an Android system, the information flow Source/Sink method refers to a method which can possibly generate or send sensitive information, and other suspicious methods comprise a dynamic loading function, a reflection function, an encryption and decryption function, a Native code execution function and a calling function.

4. The Android-oriented mobile network terminal malware multi-feature detection method of claim 1, wherein the Android malware family multi-feature model constructed in step 5 is the following six-tuple:

M＝(SBS^c,α,F^c,β,P^c,γ)

wherein the content of the first and second substances,

the method comprises the steps that a sensitive behavior set which is common to a malware family is obtained by statistically analyzing a sensitive behavior set SBS of a sample of the same malware family;marking function

5. The Android-oriented mobile network terminal malware multi-feature detection method as claimed in claim 1, wherein the similarity between the software to be detected and a malware family in step 6 is represented as:

wherein S_fIs the similarity of the software feature vectors, S_pFor similarity of authority lists, S_sbsSimilarity of sensitive behavior sets, μ_iThe weight value of each similarity in calculation;

software feature vector similarity S_fThe calculation method comprises the following steps: giving a feature vector F ═ F of the software to be tested₁,f₂,f₃，...，f_mAnd F, feature vector in the multi-feature model of the malware family to be matched^c＝{f₁ ^c,f₂ ^c,f₃ ^c，...，f_m ^cAnd the corresponding labeling function β, then: