CN112149124B

CN112149124B - Android malicious program detection method and system based on heterogeneous information network

Info

Publication number: CN112149124B
Application number: CN202011206884.8A
Authority: CN
Inventors: 牛伟纳; 张洪彬; 张小松; 鲁启扬; 朱航
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2022-04-29
Anticipated expiration: 2040-11-02
Also published as: CN112149124A

Abstract

The invention discloses a method and a system for detecting android malicious programs based on a heterogeneous information network, which belong to the field of software security detection.A main scheme comprises the steps of decompiling all APK files in a preprocessing module to obtain a smali file, and selecting a used API; the relation matrix construction module constructs four relation matrixes according to the smali file; entities and different relations in the relation matrix form a heterogeneous information network, and the meta-path construction module generates different feature matrices in the heterogeneous information network according to different meta-paths; using different feature matrix combinations for modeling a multi-core learning model; and testing different multi-core learning models which are modeled, and selecting the model with the best effect as a final android malicious program detection classifier. The invention analyzes different relationships among the APIs, fills more detailed API information into the meta-paths, collects high-level semantics of different meta-paths by using multi-core learning, and effectively detects and identifies the android malicious program.

Description

Android malicious program detection method and system based on heterogeneous information network

Technical Field

The invention belongs to the field of software security detection, and discloses a method and a system for detecting android malicious programs based on a heterogeneous information network.

Background

Today, mobile devices such as smartphones are widely used in our daily lives. Due to the popularity of Android devices and the openness of the Android OS, the number of Android malware is rapidly increasing. Infection of a mobile device with malware may result in the leakage of important private information, such as a user's account and password. Also, the rise of malicious software that wastes user time and cheats the theft of money results in economic loss for the user. Therefore, there is an urgent need to effectively detect and defend against Android malware.

The currently mainstream Android malware detection method identifies and classifies malicious applications by using a machine learning technology with different characteristics (mainly API call characteristics, assembly code characteristics and binary code characteristics) as input. Current Android malware detection methods can be divided into two broad categories, namely signature-based methods and behavior-based methods. The first approach is to classify malicious Android applications according to some unique digital signature of a known malware type. Such signatures are generated by statically analyzing known malware samples to see their code structure, including assembly code and features at the binary level. However, signature-based methods are generally unable to detect unknown Android malware or more complex malware, such as polymorphic malware. The second method executes a given Android malware sample in a sandbox, obtains runtime behavior information of the given Android malware sample, and can extract relevant API calling features and assembly code features from the runtime behavior information. This behavior-based approach is called dynamic analysis. However, methods based on dynamic analysis naturally incur high overhead at runtime and may lose some hidden malicious behavior (if not triggered during monitoring). The performance of machine learning based Android malware detection methods depends on the degree to which the extracted features represent the difference between different types of malicious Android applications and benign Android applications. The characteristics used by the existing Android malware detection method are usually too simple, and rich relationships among the functions are not considered, so that the detection accuracy is limited.

Disclosure of Invention

In view of the above problems in the prior art, an object of the present invention is to provide a method and a system for detecting an android malicious program based on a heterogeneous information network, where the method analyzes different relationships between APIs, fills more detailed API information into meta-paths, and collects high-level semantics of different meta-paths using multi-core learning, so that detection and identification of the android malicious program can be effectively performed.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for detecting android malicious programs based on a heterogeneous information network comprises the following steps:

s1: decompiling all APK files in a preprocessing module to obtain a smali file, counting the called times of various API calls from all the smali files, and selecting the first N APIs with the largest called times;

s2: the relation matrix construction module extracts API calls from all the smali files and constructs four relation matrixes according to different API relations;

s3: entities and different relations in the relation matrix form a heterogeneous information network, and the meta-path construction module generates different feature matrices in the heterogeneous information network according to different meta-paths;

s4: using different feature matrix combinations for modeling a multi-core learning model;

s5: and testing the multi-core learning model which is modeled by using different characteristic matrix combinations by using a test data set, and selecting one model with the accuracy reaching a threshold value as a final android malicious program detection classifier.

In the above technical solution, the step S1 is implemented by the following specific steps:

s1-1: decompressing the APK files of all the collected android samples to obtain a dex file;

s1-2: decompiling the dex file to obtain a smali code file;

s1-3: and comprehensively counting the called times of various API calls in all the smali files, and selecting the first N APIs with the largest called times.

In the above technical solution, the four relationship matrices in step S2 include:

a: the relationship matrix A is used to represent the relationship between APP and API, A_ij＝a_ijE {0,1} represents APP_iWhether or not it contains an API_jIf a is_ij1 denotes APP_iIn which an API is contained_jOtherwise, the relation is represented by R0;

b: the relationship matrix B is used to represent the API and the relationship between the API, B_ij＝b_ijE {0,1} represents API_iAnd API_jWhether it is in the same Block, if b_ij1 denotes API_iAnd API_jIn the same Block, otherwise, the relation is represented by R1;

p: the relationship matrix P is used to represent the API and the relationship between the API, P_ij＝p_ijE {0,1} represents API_iAnd API_jWhether or not they belong to the same Package, if p_ij1 denotes API_iAnd API_jBelonging to the same Package, otherwise, not, the relation is represented by R2;

i: the relationship matrix I is used to represent the API and the relationship between the API, I_ij＝i_ijE {0,1} represents API_iAnd API_jWhether the same Invoke-method is used, if i_ij1 denotes API_iAnd API_jThe same Invoke method is used, otherwise, the relationship is represented by R3.

Wherein, APP_iRepresenting the ith android sample software, API_iIndicating the selected ith sensitive API.

In the above technical solution, the Block refers to a smali code before a pair of ". method" and ". endmethod" in a smali file.

In the above technical solution, the step S2 is implemented by the following specific steps:

s2-1: selecting a part of the android software samples as comparison samples, and constructing a relation matrix A0 between the APP and the API of the comparison samples;

s2-2: selecting a part of the android software samples as training samples, and constructing a relation matrix A1 between the APP and the API of the training samples;

s2-3: selecting a part of the android software samples as test samples, and constructing a relation matrix A2 between the APP and the API of the test samples;

s2-4: relationship matrices B, P and I are constructed using all android software samples.

In the above technical solution, the heterogeneous information network composed of different entities and relationships in step S3 means:

the method comprises the following steps that different entities are classified into two types, namely an android application program APP represented by an APK file and an API call; the different relations are four, namely R0, R1, R2 and R3.

The two entities and four relations between the two entities jointly form a heterogeneous information network G ═ V, E >, wherein V is a node set, and node types are limited to the two entities of APP and API; e is the collection of the connection among the nodes, and the connection relation is limited to four relations of R0, R1, R2 and R3.

In the above technical solution, the method for generating different feature matrices according to different element paths in the heterogeneous information network in step S3 includes:

meta path

Means that a certain API in one APP and a certain API in another APP appear in the same Block, with M_BRepresenting a feature matrix generated from the meta path;

m of the training set_BIs M_B(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A1_nj＝＝1)&&(B_ij1), M of the test set_BIs M_B(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A2_nj＝＝1)&&(B_ij1) respectively indicating whether the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample are present in the same Block, and if so, indicating that the jth API is present in the same Block, otherwise, indicating that the jth API is not present.

Meta path

Indicating that a certain API in one APP and a certain API in another APP belong to the same Package, using M_PRepresenting the feature matrix generated by the meta-path. (ii) a

M of the training set_PIs M_P(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A1_nj＝＝1)&&(P_ij＝＝1) M of test set_PIs M_P(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A2_nj＝＝1)&&(P_ij1), respectively indicating that the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample belong to the same Package, if the jth API is 1, indicating that the ith API belongs to the same Package, otherwise, no;

meta path

Means that a certain API in one APP and a certain API in another APP use the same Invoke-method, with M_IRepresenting a feature matrix generated from the meta path;

m of the training set_IIs M_I(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A1_nj＝＝1)&&(I_ijMI of test set is M ═ 1)_I(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A2_nj＝＝1)&&(I_ij1) respectively indicates whether the j-th API in the nth APP in the training sample or the test sample and the i-th API in the mth APP in the control sample use the same Invoke-method, and if 1, indicates that the same Invoke-method is used, otherwise, it does not.

M above_B，M_PAnd M_IN in the expression represents the number of sensitive APIs, and two subscripts in the relationship matrices a0, a1, a2, B, P, and I represent the corresponding rows and columns of the relationship matrix, respectively.

In the above technical solution, the combination of different feature matrices in step S4 includes:

a total of seven feature matrix combinations: m alone_BMatrix, individual M_PMatrix, individual M_IMatrix, M_BAnd M_PMatrix combination, M_BAnd M_IMatrix combination, M_PAnd M_IMatrix combination and M_B、M_PAnd M_IA combination of three feature matrices.

In the above technical solution, the step S5 is implemented by the following specific steps:

s5-1: three feature matrices M constructed based on the test sample and the control sample obtained in step S3_B、M_P、M_IAfter the combination of different feature matrices is performed, the combined feature matrices are respectively used as the input of the corresponding multi-core learning model constructed in the step S4, and the accuracy of the corresponding model is tested;

s5-2: and comparing the accuracy rates of the multi-core learning models obtained in the step S5-1, and selecting the model with the accuracy rate meeting the threshold value as a final android malicious program detection classifier.

A system for android malicious program detection based on a heterogeneous information network comprises,

a preprocessing module: the method is mainly responsible for decompiling the APK file to obtain a smali file, and counting API calls required to be used from all the smali files.

A relationship matrix construction module: extracting API calls from the smali files of all android samples, and constructing four relationship matrixes according to different API relationships;

a meta path construction module: the four relation matrixes comprise two entities and four relations between the entities, so that a heterogeneous information network is formed, and characteristic matrixes are respectively constructed according to three different element paths in the heterogeneous network;

the multi-core learning modeling module: using different characteristic matrix combinations as the input of the multi-core learning model, and respectively modeling the multi-core learning model;

a classifier: and testing the multi-core learning model which is modeled by using different characteristic matrix combinations by using a test data set, and selecting one model with the accuracy reaching a threshold value as a final android malicious program detection classifier.

Compared with the prior art, the invention has the beneficial effects that:

first, the API used in the present invention is a sensitive API. Sensitive APIs are more used by malware to perform some sensitive or malicious operations than normal APIs. Thus, sensitive APIs are more representative of malicious behavior of software, and in benign android applications, sensitive APIs may appear less. Thanks to the nature of sensitive APIs, the present invention can use a smaller number of APIs to characterize the behavior of malware;

secondly, the meta-path information used by the invention can describe the API relationship among different APPs in a finer granularity. Different from the traditional meta path which only can show how many APIs of different APPs appear in the same Block, the feature matrix constructed based on the meta path in the invention describes which APIs of different APPs appear in the same Block at all. Similarly, it also describes which APIs belong to the same Package among different APPs, and which APIs use the same Invoke-method. The API relationships among different APPs are described in detail by the fine-grained characteristic information, so that the learning model can capture more differences among android malicious software, benign software and different android malicious software families, and the accuracy of the android malicious program detection classifier is improved;

and thirdly, the anti-confusion capacity is stronger.

Drawings

FIG. 1 is a block diagram of the system architecture of the present invention;

fig. 2 is a schematic flow chart of the method for extracting the android malicious program detection classification model.

Detailed Description

The invention is further illustrated by the following specific examples.

The invention provides an android malicious program detection method based on a heterogeneous information network, which is characterized by comprising the following steps of:

s1: decompiling all APK files in a preprocessing module to obtain a smali file, counting the called times of various API calls from all the smali files, and selecting the first N sensitive APIs with the largest called times;

s2: the relation matrix construction module extracts sensitive API calls from all the smali files and constructs four relation matrixes according to different sensitive API relations;

In the present invention, the step S1 is implemented by the following specific steps:

s1-2: decompiling the dex file to obtain a smali code file;

s1-3: and comprehensively counting the called times of various API calls in all the smali files, and selecting the first N sensitive APIs with the largest called times. Fewer sensitive APIs may be used here because it takes into account that the length of the feature vector is squared with the number of sensitive APIs, while at the same time sensitive APIs can better represent the behavior of the android malware than normal APIs.

In the present invention, the four relationship matrices in step S2 include:

a: the relationship matrix A is used to represent the relationship between APP and sensitive API, A_ij＝a_ijE {0,1} represents APP_iWhether or not to have sensitive APIs_jIf a is_ij1 denotes APP_iContaining a sensitive API_jOtherwise, the relation is represented by R0;

b: the relationship matrix B is used to represent the relationships between sensitive APIs and sensitive APIs, B_ij＝b_ijE {0,1} represents a sensitive API_iAnd sensitive API_jWhether it is in the same Block, if b_ij1 then represents a sensitive API_iAnd sensitive API_jIn the same Block, otherwise, the relation is represented by R1;

p: the relation matrix P is used to represent the relation between the sensitive API and the sensitive API, P_ij＝p_ijE {0,1} represents a sensitive API_iAnd sensitive API_jWhether or not they belong to the same Package, if p_ij1 then represents a sensitive API_iAnd sensitive API_jBelong to the same Package, otherwise, the PackageNo, this relationship is represented by R2;

i: the relationship matrix I is used to represent the relationships between the sensitive APIs and the sensitive APIs, I_ij＝i_ijE {0,1} represents a sensitive API_iAnd sensitive API_jWhether the same Invoke-method is used, if i_ij1 then represents a sensitive API_iAnd sensitive API_jThe same Invoke method is used, otherwise, the relationship is represented by R3.

In the invention, the Block refers to a smali code before a pair of a 'method' and an 'endmethod' in a smali file.

In the present invention, in order to keep the feature matrix constructed by the training set and the feature matrix constructed by the test set to have the same dimension, and also in order to ensure the consistency of the data, a fixed android software sample is further selected as a control sample, so the step S2 is implemented by the following specific steps:

s2-1: selecting a part of the android software samples as comparison samples, and constructing a relation matrix A0 between the APP and the sensitive API of the comparison samples;

s2-2: selecting a part of the android software samples as training samples, and constructing a relation matrix A1 between the APP and the sensitive API of the training samples;

s2-3: selecting a part of the android software samples as test samples, and constructing a relation matrix A2 between the APP and the sensitive API of the test samples;

In the present invention, the heterogeneous information network composed of different entities and relationships in step S3 refers to:

the method comprises the following steps that different entities are classified into two types, namely an android application program APP represented by an APK file and a sensitive API call; the different relations are four, namely R0, R1, R2 and R3.

The two entities and four relations between the two entities jointly form a heterogeneous information network G ═ V, E >, wherein V is a node set, and node types are limited to the two entities of APP and sensitive API; e is the collection of the connection among the nodes, and the connection relation is limited to four relations of R0, R1, R2 and R3.

In the present invention, the method for generating different feature matrices according to different meta paths in the heterogeneous information network in step S3 includes:

meta path

Means that a sensitive API in one APP and a sensitive API in another APP have appeared in the same Block, using M_BRepresenting a feature matrix generated from the meta path;

m of the training set_BIs M_B(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A1_nj＝＝1)&&(B_ij1), M of the test set_BIs M_B(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A2_nj＝＝1)&&(B_ijAnd 1) respectively indicating whether the j-th sensitive API in the nth APP in the training sample or the test sample and the i-th sensitive API in the mth APP in the control sample are present in the same Block, if so, indicating that the p-th sensitive API is present in the same Block, otherwise, indicating that the p-th sensitive API is not present in the same Block.

Meta path

Indicating that a sensitive API in one APP and a sensitive API in another APP belong to the same Package, using M_PRepresenting the feature matrix generated by the meta-path. (ii) a

M of the training set_PIs M_P(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A1_nj＝＝1)&&(P_ij1), M of the test set_PIs M_P(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A2_nj＝＝1)&&(P_ij1) in the training sample or test sample, respectively, the jth sensitive API in the nth APP in the training sample or test sample, and the jth sensitive API in the control sampleThe ith sensitive API in the m APPs belongs to the same Package, if the ith sensitive API is 1, the ith sensitive API belongs to the same Package, otherwise, the ith sensitive API does not belong to the same Package;

meta path

Indicating that a sensitive API in one APP and a sensitive API in another APP use the same Invoke-method, with M_IRepresenting a feature matrix generated from the meta path;

m of the training set_IIs M_I(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A1_nj＝＝1)&&(I_ij1), M of the test set_IIs M_I(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A2_nj＝＝1)&&(I_ij1) respectively indicates whether the j-th sensitive API in the nth APP in the training sample or the test sample and the i-th sensitive API in the mth APP in the control sample use the same Invoke-method, and if 1, indicates that the same Invoke-method is used, otherwise, indicates no.

M above_B，M_PAnd M_IN in the expression indicates the number of sensitive APIs, and the two subscripts in the matrices a0, a1, a2, B, P, and I indicate the corresponding rows and columns of the matrix, respectively.

M of the training and test data set here_B、M_PAnd M_IThe feature matrix is constructed relative to a reference sample, so that m in the feature matrix of the training and testing set is the same, and the dimension of the feature matrix is ensured to be unchanged.

In the present invention, the different feature matrix combinations in step S4 include:

In the present invention, the step S5 is implemented by the following specific steps:

The present embodiments are to be considered as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description. All the technical solutions formed by the transformation or the equivalent substitution fall within the protection scope of the present invention.

Claims

1. A method for detecting android malicious programs based on a heterogeneous information network is characterized by comprising the following steps:

s5: testing a multi-core learning model which is modeled by using different characteristic matrix combinations by using a test data set, and selecting one model with the accuracy reaching a threshold value as a final android malicious program detection classifier;

step S2 shows that the four relationship matrices include:

a: relation matrix A usesIn representing the relationship between APP and API, A_ij＝a_ijE {0,1} represents APP_iWhether or not it contains an API_jIf a is_ij1 then APP_iIn which an API is contained_jOtherwise, the relation is represented by R0;

b: the relationship matrix B is used to represent the API and the relationship between the API, B_ij＝b_ijE {0,1} represents API_iAnd API_jWhether it is in the same Block, if b_ij1 then API_iAnd API_jIn the same Block, otherwise, the relation is represented by R1;

p: the relationship matrix P is used to represent the API and the relationship between the API, P_ij＝p_ijE {0,1} represents API_iAnd API_jWhether or not they belong to the same Package, if p_ij1 then API_iAnd API_jBelonging to the same Package, otherwise, not, the relation is represented by R2;

i: the relationship matrix I is used to represent the API and the relationship between the API, I_ij＝i_ijE {0,1} represents API_iAnd API_jWhether the same Invoke-method is used, if i_ij1 then API_iAnd API_jThe same Invoke method is used, otherwise, the relation is represented by R3;

wherein, APP_iRepresenting the ith android sample software, API_iRepresenting the selected ith sensitive API;

the method for generating different feature matrices according to different meta paths in the heterogeneous information network in step S3 includes:

meta path

m of the training set_BIs M_B(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A1_nj＝＝1)&&(B_ij1), M of the test set_BIs M_B(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A2_nj＝＝1)&&(B_ij1) respectively indicating whether the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample are in the same Block, if so, indicating that the jth API is in the same Block, otherwise, indicating that the jth API is not in the Block;

meta path

Indicating that a certain API in one APP and a certain API in another APP belong to the same Package, using M_PRepresenting a feature matrix generated from the meta path;

m of the training set_PIs M_P(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A1_nj＝＝1)&&(P_ij1), M of the test set_PIs M_P(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A2_nj＝＝1)&&(P_ij1), respectively indicating that the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample belong to the same Package, if the jth API is 1, indicating that the ith API belongs to the same Package, otherwise, no;

meta path

m of the training set_IIs M_I(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A1_nj＝＝1)&&(I_ij1), M of the test set_IIs M_I(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A2_nj＝＝1)&&(I_ij1) respectively indicates whether the jth API in the nth APP in the training or test sample and the ith API in the mth APP in the control sample are presentThe same Invoke-method is used, if the answer is 1, the same Invoke-method is used, otherwise, the answer is no;

2. The method for detecting the android malicious program based on the heterogeneous information network of claim 1, wherein the step S1 is implemented by the following specific steps:

s1-2: decompiling the dex file to obtain a smali code file;

3. The method for detecting the android malicious program based on the heterogeneous information network of claim 2, wherein the Block refers to a smali code before a pair of ". method" and ". endmethod" in a smali file.

4. The method for detecting the android malicious program based on the heterogeneous information network of claim 1, wherein the step S2 is implemented by the following specific steps:

5. The method for android malware detection based on heterogeneous information network of claim 3, wherein the step S3 is that the step of constructing the heterogeneous information network by the entities and different relationships in the relationship matrix is that:

the method comprises the following steps that different entities are classified into two types, namely an android application program APP represented by an APK file and an API call; four different relationships, R0, R1, R2 and R3;

6. The method for android malware detection based on heterogeneous information network of claim 1, wherein the different feature matrix combinations of step S4 include:

7. The method for detecting the android malicious program based on the heterogeneous information network of claim 1, wherein the step S5 is implemented by the following specific steps:

8. A system for android malicious program detection based on a heterogeneous information network is characterized by comprising,

a preprocessing module: the method is mainly responsible for decompiling the APK file to obtain a smali file, and counting API calls required to be used from all the smali files;

The four relationship matrices include:

a: the relationship matrix A is used to represent the relationship between APP and API, A_ij＝a_ijE {0,1} represents APP_iWhether or not it contains an API_jIf a is_ij1 then APP_iIn which an API is contained_jOtherwise, the relation is represented by R0;

the method for generating different feature matrixes in the heterogeneous information network according to different element paths comprises the following steps:

meta path

meta path

Indicating that a certain API in one APP and a certain API in another APP belong to the same Package, using M_PRepresenting a feature matrix generated from the meta-path；

meta path

m of the training set_IIs M_I(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A1_nj＝＝1)&&(I_ij1), M of the test set_IIs M_I(N*n+j,N*m+i)＝(A0_mi＝＝1)&&(A2_nj＝1)&&(I_ij1) respectively indicating whether the j-th API in the nth APP in the training sample or the test sample and the i-th API in the mth APP in the control sample use the same Invoke-method, if so, indicating that the same Invoke-method is used, otherwise, no;