CN112149124B - Android malicious program detection method and system based on heterogeneous information network - Google Patents

Android malicious program detection method and system based on heterogeneous information network Download PDF

Info

Publication number
CN112149124B
CN112149124B CN202011206884.8A CN202011206884A CN112149124B CN 112149124 B CN112149124 B CN 112149124B CN 202011206884 A CN202011206884 A CN 202011206884A CN 112149124 B CN112149124 B CN 112149124B
Authority
CN
China
Prior art keywords
api
app
matrix
relationship
android
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011206884.8A
Other languages
Chinese (zh)
Other versions
CN112149124A (en
Inventor
牛伟纳
张洪彬
张小松
鲁启扬
朱航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202011206884.8A priority Critical patent/CN112149124B/en
Publication of CN112149124A publication Critical patent/CN112149124A/en
Application granted granted Critical
Publication of CN112149124B publication Critical patent/CN112149124B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention discloses a method and a system for detecting android malicious programs based on a heterogeneous information network, which belong to the field of software security detection.A main scheme comprises the steps of decompiling all APK files in a preprocessing module to obtain a smali file, and selecting a used API; the relation matrix construction module constructs four relation matrixes according to the smali file; entities and different relations in the relation matrix form a heterogeneous information network, and the meta-path construction module generates different feature matrices in the heterogeneous information network according to different meta-paths; using different feature matrix combinations for modeling a multi-core learning model; and testing different multi-core learning models which are modeled, and selecting the model with the best effect as a final android malicious program detection classifier. The invention analyzes different relationships among the APIs, fills more detailed API information into the meta-paths, collects high-level semantics of different meta-paths by using multi-core learning, and effectively detects and identifies the android malicious program.

Description

Android malicious program detection method and system based on heterogeneous information network
Technical Field
The invention belongs to the field of software security detection, and discloses a method and a system for detecting android malicious programs based on a heterogeneous information network.
Background
Today, mobile devices such as smartphones are widely used in our daily lives. Due to the popularity of Android devices and the openness of the Android OS, the number of Android malware is rapidly increasing. Infection of a mobile device with malware may result in the leakage of important private information, such as a user's account and password. Also, the rise of malicious software that wastes user time and cheats the theft of money results in economic loss for the user. Therefore, there is an urgent need to effectively detect and defend against Android malware.
The currently mainstream Android malware detection method identifies and classifies malicious applications by using a machine learning technology with different characteristics (mainly API call characteristics, assembly code characteristics and binary code characteristics) as input. Current Android malware detection methods can be divided into two broad categories, namely signature-based methods and behavior-based methods. The first approach is to classify malicious Android applications according to some unique digital signature of a known malware type. Such signatures are generated by statically analyzing known malware samples to see their code structure, including assembly code and features at the binary level. However, signature-based methods are generally unable to detect unknown Android malware or more complex malware, such as polymorphic malware. The second method executes a given Android malware sample in a sandbox, obtains runtime behavior information of the given Android malware sample, and can extract relevant API calling features and assembly code features from the runtime behavior information. This behavior-based approach is called dynamic analysis. However, methods based on dynamic analysis naturally incur high overhead at runtime and may lose some hidden malicious behavior (if not triggered during monitoring). The performance of machine learning based Android malware detection methods depends on the degree to which the extracted features represent the difference between different types of malicious Android applications and benign Android applications. The characteristics used by the existing Android malware detection method are usually too simple, and rich relationships among the functions are not considered, so that the detection accuracy is limited.
Disclosure of Invention
In view of the above problems in the prior art, an object of the present invention is to provide a method and a system for detecting an android malicious program based on a heterogeneous information network, where the method analyzes different relationships between APIs, fills more detailed API information into meta-paths, and collects high-level semantics of different meta-paths using multi-core learning, so that detection and identification of the android malicious program can be effectively performed.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for detecting android malicious programs based on a heterogeneous information network comprises the following steps:
s1: decompiling all APK files in a preprocessing module to obtain a smali file, counting the called times of various API calls from all the smali files, and selecting the first N APIs with the largest called times;
s2: the relation matrix construction module extracts API calls from all the smali files and constructs four relation matrixes according to different API relations;
s3: entities and different relations in the relation matrix form a heterogeneous information network, and the meta-path construction module generates different feature matrices in the heterogeneous information network according to different meta-paths;
s4: using different feature matrix combinations for modeling a multi-core learning model;
s5: and testing the multi-core learning model which is modeled by using different characteristic matrix combinations by using a test data set, and selecting one model with the accuracy reaching a threshold value as a final android malicious program detection classifier.
In the above technical solution, the step S1 is implemented by the following specific steps:
s1-1: decompressing the APK files of all the collected android samples to obtain a dex file;
s1-2: decompiling the dex file to obtain a smali code file;
s1-3: and comprehensively counting the called times of various API calls in all the smali files, and selecting the first N APIs with the largest called times.
In the above technical solution, the four relationship matrices in step S2 include:
a: the relationship matrix A is used to represent the relationship between APP and API, Aij=aijE {0,1} represents APPiWhether or not it contains an APIjIf a isij1 denotes APPiIn which an API is containedjOtherwise, the relation is represented by R0;
b: the relationship matrix B is used to represent the API and the relationship between the API, Bij=bijE {0,1} represents APIiAnd APIjWhether it is in the same Block, if bij1 denotes APIiAnd APIjIn the same Block, otherwise, the relation is represented by R1;
p: the relationship matrix P is used to represent the API and the relationship between the API, Pij=pijE {0,1} represents APIiAnd APIjWhether or not they belong to the same Package, if pij1 denotes APIiAnd APIjBelonging to the same Package, otherwise, not, the relation is represented by R2;
i: the relationship matrix I is used to represent the API and the relationship between the API, Iij=iijE {0,1} represents APIiAnd APIjWhether the same Invoke-method is used, if iij1 denotes APIiAnd APIjThe same Invoke method is used, otherwise, the relationship is represented by R3.
Wherein, APPiRepresenting the ith android sample software, APIiIndicating the selected ith sensitive API.
In the above technical solution, the Block refers to a smali code before a pair of ". method" and ". endmethod" in a smali file.
In the above technical solution, the step S2 is implemented by the following specific steps:
s2-1: selecting a part of the android software samples as comparison samples, and constructing a relation matrix A0 between the APP and the API of the comparison samples;
s2-2: selecting a part of the android software samples as training samples, and constructing a relation matrix A1 between the APP and the API of the training samples;
s2-3: selecting a part of the android software samples as test samples, and constructing a relation matrix A2 between the APP and the API of the test samples;
s2-4: relationship matrices B, P and I are constructed using all android software samples.
In the above technical solution, the heterogeneous information network composed of different entities and relationships in step S3 means:
the method comprises the following steps that different entities are classified into two types, namely an android application program APP represented by an APK file and an API call; the different relations are four, namely R0, R1, R2 and R3.
The two entities and four relations between the two entities jointly form a heterogeneous information network G ═ V, E >, wherein V is a node set, and node types are limited to the two entities of APP and API; e is the collection of the connection among the nodes, and the connection relation is limited to four relations of R0, R1, R2 and R3.
In the above technical solution, the method for generating different feature matrices according to different element paths in the heterogeneous information network in step S3 includes:
meta path
Figure GDA0003541995280000041
Means that a certain API in one APP and a certain API in another APP appear in the same Block, with MBRepresenting a feature matrix generated from the meta path;
m of the training setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Bij1), M of the test setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Bij1) respectively indicating whether the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample are present in the same Block, and if so, indicating that the jth API is present in the same Block, otherwise, indicating that the jth API is not present.
Meta path
Figure GDA0003541995280000042
Indicating that a certain API in one APP and a certain API in another APP belong to the same Package, using MPRepresenting the feature matrix generated by the meta-path. (ii) a
M of the training setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Pij==1) M of test setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Pij1), respectively indicating that the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample belong to the same Package, if the jth API is 1, indicating that the ith API belongs to the same Package, otherwise, no;
meta path
Figure GDA0003541995280000043
Means that a certain API in one APP and a certain API in another APP use the same Invoke-method, with MIRepresenting a feature matrix generated from the meta path;
m of the training setIIs MI(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(IijMI of test set is M ═ 1)I(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Iij1) respectively indicates whether the j-th API in the nth APP in the training sample or the test sample and the i-th API in the mth APP in the control sample use the same Invoke-method, and if 1, indicates that the same Invoke-method is used, otherwise, it does not.
M aboveB,MPAnd MIN in the expression represents the number of sensitive APIs, and two subscripts in the relationship matrices a0, a1, a2, B, P, and I represent the corresponding rows and columns of the relationship matrix, respectively.
In the above technical solution, the combination of different feature matrices in step S4 includes:
a total of seven feature matrix combinations: m aloneBMatrix, individual MPMatrix, individual MIMatrix, MBAnd MPMatrix combination, MBAnd MIMatrix combination, MPAnd MIMatrix combination and MB、MPAnd MIA combination of three feature matrices.
In the above technical solution, the step S5 is implemented by the following specific steps:
s5-1: three feature matrices M constructed based on the test sample and the control sample obtained in step S3B、MP、MIAfter the combination of different feature matrices is performed, the combined feature matrices are respectively used as the input of the corresponding multi-core learning model constructed in the step S4, and the accuracy of the corresponding model is tested;
s5-2: and comparing the accuracy rates of the multi-core learning models obtained in the step S5-1, and selecting the model with the accuracy rate meeting the threshold value as a final android malicious program detection classifier.
A system for android malicious program detection based on a heterogeneous information network comprises,
a preprocessing module: the method is mainly responsible for decompiling the APK file to obtain a smali file, and counting API calls required to be used from all the smali files.
A relationship matrix construction module: extracting API calls from the smali files of all android samples, and constructing four relationship matrixes according to different API relationships;
a meta path construction module: the four relation matrixes comprise two entities and four relations between the entities, so that a heterogeneous information network is formed, and characteristic matrixes are respectively constructed according to three different element paths in the heterogeneous network;
the multi-core learning modeling module: using different characteristic matrix combinations as the input of the multi-core learning model, and respectively modeling the multi-core learning model;
a classifier: and testing the multi-core learning model which is modeled by using different characteristic matrix combinations by using a test data set, and selecting one model with the accuracy reaching a threshold value as a final android malicious program detection classifier.
Compared with the prior art, the invention has the beneficial effects that:
first, the API used in the present invention is a sensitive API. Sensitive APIs are more used by malware to perform some sensitive or malicious operations than normal APIs. Thus, sensitive APIs are more representative of malicious behavior of software, and in benign android applications, sensitive APIs may appear less. Thanks to the nature of sensitive APIs, the present invention can use a smaller number of APIs to characterize the behavior of malware;
secondly, the meta-path information used by the invention can describe the API relationship among different APPs in a finer granularity. Different from the traditional meta path which only can show how many APIs of different APPs appear in the same Block, the feature matrix constructed based on the meta path in the invention describes which APIs of different APPs appear in the same Block at all. Similarly, it also describes which APIs belong to the same Package among different APPs, and which APIs use the same Invoke-method. The API relationships among different APPs are described in detail by the fine-grained characteristic information, so that the learning model can capture more differences among android malicious software, benign software and different android malicious software families, and the accuracy of the android malicious program detection classifier is improved;
and thirdly, the anti-confusion capacity is stronger.
Drawings
FIG. 1 is a block diagram of the system architecture of the present invention;
fig. 2 is a schematic flow chart of the method for extracting the android malicious program detection classification model.
Detailed Description
The invention is further illustrated by the following specific examples.
The invention provides an android malicious program detection method based on a heterogeneous information network, which is characterized by comprising the following steps of:
s1: decompiling all APK files in a preprocessing module to obtain a smali file, counting the called times of various API calls from all the smali files, and selecting the first N sensitive APIs with the largest called times;
s2: the relation matrix construction module extracts sensitive API calls from all the smali files and constructs four relation matrixes according to different sensitive API relations;
s3: entities and different relations in the relation matrix form a heterogeneous information network, and the meta-path construction module generates different feature matrices in the heterogeneous information network according to different meta-paths;
s4: using different feature matrix combinations for modeling a multi-core learning model;
s5: and testing the multi-core learning model which is modeled by using different characteristic matrix combinations by using a test data set, and selecting one model with the accuracy reaching a threshold value as a final android malicious program detection classifier.
In the present invention, the step S1 is implemented by the following specific steps:
s1-1: decompressing the APK files of all the collected android samples to obtain a dex file;
s1-2: decompiling the dex file to obtain a smali code file;
s1-3: and comprehensively counting the called times of various API calls in all the smali files, and selecting the first N sensitive APIs with the largest called times. Fewer sensitive APIs may be used here because it takes into account that the length of the feature vector is squared with the number of sensitive APIs, while at the same time sensitive APIs can better represent the behavior of the android malware than normal APIs.
In the present invention, the four relationship matrices in step S2 include:
a: the relationship matrix A is used to represent the relationship between APP and sensitive API, Aij=aijE {0,1} represents APPiWhether or not to have sensitive APIsjIf a isij1 denotes APPiContaining a sensitive APIjOtherwise, the relation is represented by R0;
b: the relationship matrix B is used to represent the relationships between sensitive APIs and sensitive APIs, Bij=bijE {0,1} represents a sensitive APIiAnd sensitive APIjWhether it is in the same Block, if bij1 then represents a sensitive APIiAnd sensitive APIjIn the same Block, otherwise, the relation is represented by R1;
p: the relation matrix P is used to represent the relation between the sensitive API and the sensitive API, Pij=pijE {0,1} represents a sensitive APIiAnd sensitive APIjWhether or not they belong to the same Package, if pij1 then represents a sensitive APIiAnd sensitive APIjBelong to the same Package, otherwise, the PackageNo, this relationship is represented by R2;
i: the relationship matrix I is used to represent the relationships between the sensitive APIs and the sensitive APIs, Iij=iijE {0,1} represents a sensitive APIiAnd sensitive APIjWhether the same Invoke-method is used, if iij1 then represents a sensitive APIiAnd sensitive APIjThe same Invoke method is used, otherwise, the relationship is represented by R3.
Wherein, APPiRepresenting the ith android sample software, APIiIndicating the selected ith sensitive API.
In the invention, the Block refers to a smali code before a pair of a 'method' and an 'endmethod' in a smali file.
In the present invention, in order to keep the feature matrix constructed by the training set and the feature matrix constructed by the test set to have the same dimension, and also in order to ensure the consistency of the data, a fixed android software sample is further selected as a control sample, so the step S2 is implemented by the following specific steps:
s2-1: selecting a part of the android software samples as comparison samples, and constructing a relation matrix A0 between the APP and the sensitive API of the comparison samples;
s2-2: selecting a part of the android software samples as training samples, and constructing a relation matrix A1 between the APP and the sensitive API of the training samples;
s2-3: selecting a part of the android software samples as test samples, and constructing a relation matrix A2 between the APP and the sensitive API of the test samples;
s2-4: relationship matrices B, P and I are constructed using all android software samples.
In the present invention, the heterogeneous information network composed of different entities and relationships in step S3 refers to:
the method comprises the following steps that different entities are classified into two types, namely an android application program APP represented by an APK file and a sensitive API call; the different relations are four, namely R0, R1, R2 and R3.
The two entities and four relations between the two entities jointly form a heterogeneous information network G ═ V, E >, wherein V is a node set, and node types are limited to the two entities of APP and sensitive API; e is the collection of the connection among the nodes, and the connection relation is limited to four relations of R0, R1, R2 and R3.
In the present invention, the method for generating different feature matrices according to different meta paths in the heterogeneous information network in step S3 includes:
meta path
Figure GDA0003541995280000081
Means that a sensitive API in one APP and a sensitive API in another APP have appeared in the same Block, using MBRepresenting a feature matrix generated from the meta path;
m of the training setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Bij1), M of the test setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(BijAnd 1) respectively indicating whether the j-th sensitive API in the nth APP in the training sample or the test sample and the i-th sensitive API in the mth APP in the control sample are present in the same Block, if so, indicating that the p-th sensitive API is present in the same Block, otherwise, indicating that the p-th sensitive API is not present in the same Block.
Meta path
Figure GDA0003541995280000082
Indicating that a sensitive API in one APP and a sensitive API in another APP belong to the same Package, using MPRepresenting the feature matrix generated by the meta-path. (ii) a
M of the training setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Pij1), M of the test setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Pij1) in the training sample or test sample, respectively, the jth sensitive API in the nth APP in the training sample or test sample, and the jth sensitive API in the control sampleThe ith sensitive API in the m APPs belongs to the same Package, if the ith sensitive API is 1, the ith sensitive API belongs to the same Package, otherwise, the ith sensitive API does not belong to the same Package;
meta path
Figure GDA0003541995280000091
Indicating that a sensitive API in one APP and a sensitive API in another APP use the same Invoke-method, with MIRepresenting a feature matrix generated from the meta path;
m of the training setIIs MI(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Iij1), M of the test setIIs MI(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Iij1) respectively indicates whether the j-th sensitive API in the nth APP in the training sample or the test sample and the i-th sensitive API in the mth APP in the control sample use the same Invoke-method, and if 1, indicates that the same Invoke-method is used, otherwise, indicates no.
M aboveB,MPAnd MIN in the expression indicates the number of sensitive APIs, and the two subscripts in the matrices a0, a1, a2, B, P, and I indicate the corresponding rows and columns of the matrix, respectively.
M of the training and test data set hereB、MPAnd MIThe feature matrix is constructed relative to a reference sample, so that m in the feature matrix of the training and testing set is the same, and the dimension of the feature matrix is ensured to be unchanged.
In the present invention, the different feature matrix combinations in step S4 include:
a total of seven feature matrix combinations: m aloneBMatrix, individual MPMatrix, individual MIMatrix, MBAnd MPMatrix combination, MBAnd MIMatrix combination, MPAnd MIMatrix combination and MB、MPAnd MIA combination of three feature matrices.
In the present invention, the step S5 is implemented by the following specific steps:
s5-1: three feature matrices M constructed based on the test sample and the control sample obtained in step S3B、MP、MIAfter the combination of different feature matrices is performed, the combined feature matrices are respectively used as the input of the corresponding multi-core learning model constructed in the step S4, and the accuracy of the corresponding model is tested;
s5-2: and comparing the accuracy rates of the multi-core learning models obtained in the step S5-1, and selecting the model with the accuracy rate meeting the threshold value as a final android malicious program detection classifier.
The present embodiments are to be considered as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description. All the technical solutions formed by the transformation or the equivalent substitution fall within the protection scope of the present invention.

Claims (8)

1. A method for detecting android malicious programs based on a heterogeneous information network is characterized by comprising the following steps:
s1: decompiling all APK files in a preprocessing module to obtain a smali file, counting the called times of various API calls from all the smali files, and selecting the first N APIs with the largest called times;
s2: the relation matrix construction module extracts API calls from all the smali files and constructs four relation matrixes according to different API relations;
s3: entities and different relations in the relation matrix form a heterogeneous information network, and the meta-path construction module generates different feature matrices in the heterogeneous information network according to different meta-paths;
s4: using different feature matrix combinations for modeling a multi-core learning model;
s5: testing a multi-core learning model which is modeled by using different characteristic matrix combinations by using a test data set, and selecting one model with the accuracy reaching a threshold value as a final android malicious program detection classifier;
step S2 shows that the four relationship matrices include:
a: relation matrix A usesIn representing the relationship between APP and API, Aij=aijE {0,1} represents APPiWhether or not it contains an APIjIf a isij1 then APPiIn which an API is containedjOtherwise, the relation is represented by R0;
b: the relationship matrix B is used to represent the API and the relationship between the API, Bij=bijE {0,1} represents APIiAnd APIjWhether it is in the same Block, if bij1 then APIiAnd APIjIn the same Block, otherwise, the relation is represented by R1;
p: the relationship matrix P is used to represent the API and the relationship between the API, Pij=pijE {0,1} represents APIiAnd APIjWhether or not they belong to the same Package, if pij1 then APIiAnd APIjBelonging to the same Package, otherwise, not, the relation is represented by R2;
i: the relationship matrix I is used to represent the API and the relationship between the API, Iij=iijE {0,1} represents APIiAnd APIjWhether the same Invoke-method is used, if iij1 then APIiAnd APIjThe same Invoke method is used, otherwise, the relation is represented by R3;
wherein, APPiRepresenting the ith android sample software, APIiRepresenting the selected ith sensitive API;
the method for generating different feature matrices according to different meta paths in the heterogeneous information network in step S3 includes:
meta path
Figure FDA0003556388980000021
Means that a certain API in one APP and a certain API in another APP appear in the same Block, with MBRepresenting a feature matrix generated from the meta path;
m of the training setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Bij1), M of the test setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Bij1) respectively indicating whether the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample are in the same Block, if so, indicating that the jth API is in the same Block, otherwise, indicating that the jth API is not in the Block;
meta path
Figure FDA0003556388980000022
Indicating that a certain API in one APP and a certain API in another APP belong to the same Package, using MPRepresenting a feature matrix generated from the meta path;
m of the training setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Pij1), M of the test setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Pij1), respectively indicating that the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample belong to the same Package, if the jth API is 1, indicating that the ith API belongs to the same Package, otherwise, no;
meta path
Figure FDA0003556388980000023
Means that a certain API in one APP and a certain API in another APP use the same Invoke-method, with MIRepresenting a feature matrix generated from the meta path;
m of the training setIIs MI(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Iij1), M of the test setIIs MI(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Iij1) respectively indicates whether the jth API in the nth APP in the training or test sample and the ith API in the mth APP in the control sample are presentThe same Invoke-method is used, if the answer is 1, the same Invoke-method is used, otherwise, the answer is no;
m aboveB,MPAnd MIN in the expression represents the number of sensitive APIs, and two subscripts in the relationship matrices a0, a1, a2, B, P, and I represent the corresponding rows and columns of the relationship matrix, respectively.
2. The method for detecting the android malicious program based on the heterogeneous information network of claim 1, wherein the step S1 is implemented by the following specific steps:
s1-1: decompressing the APK files of all the collected android samples to obtain a dex file;
s1-2: decompiling the dex file to obtain a smali code file;
s1-3: and comprehensively counting the called times of various API calls in all the smali files, and selecting the first N APIs with the largest called times.
3. The method for detecting the android malicious program based on the heterogeneous information network of claim 2, wherein the Block refers to a smali code before a pair of ". method" and ". endmethod" in a smali file.
4. The method for detecting the android malicious program based on the heterogeneous information network of claim 1, wherein the step S2 is implemented by the following specific steps:
s2-1: selecting a part of the android software samples as comparison samples, and constructing a relation matrix A0 between the APP and the API of the comparison samples;
s2-2: selecting a part of the android software samples as training samples, and constructing a relation matrix A1 between the APP and the API of the training samples;
s2-3: selecting a part of the android software samples as test samples, and constructing a relation matrix A2 between the APP and the API of the test samples;
s2-4: relationship matrices B, P and I are constructed using all android software samples.
5. The method for android malware detection based on heterogeneous information network of claim 3, wherein the step S3 is that the step of constructing the heterogeneous information network by the entities and different relationships in the relationship matrix is that:
the method comprises the following steps that different entities are classified into two types, namely an android application program APP represented by an APK file and an API call; four different relationships, R0, R1, R2 and R3;
the two entities and four relations between the two entities jointly form a heterogeneous information network G ═ V, E >, wherein V is a node set, and node types are limited to the two entities of APP and API; e is the collection of the connection among the nodes, and the connection relation is limited to four relations of R0, R1, R2 and R3.
6. The method for android malware detection based on heterogeneous information network of claim 1, wherein the different feature matrix combinations of step S4 include:
a total of seven feature matrix combinations: m aloneBMatrix, individual MPMatrix, individual MIMatrix, MBAnd MPMatrix combination, MBAnd MIMatrix combination, MPAnd MIMatrix combination and MB、MPAnd MIA combination of three feature matrices.
7. The method for detecting the android malicious program based on the heterogeneous information network of claim 1, wherein the step S5 is implemented by the following specific steps:
s5-1: three feature matrices M constructed based on the test sample and the control sample obtained in step S3B、MP、MIAfter the combination of different feature matrices is performed, the combined feature matrices are respectively used as the input of the corresponding multi-core learning model constructed in the step S4, and the accuracy of the corresponding model is tested;
s5-2: and comparing the accuracy rates of the multi-core learning models obtained in the step S5-1, and selecting the model with the accuracy rate meeting the threshold value as a final android malicious program detection classifier.
8. A system for android malicious program detection based on a heterogeneous information network is characterized by comprising,
a preprocessing module: the method is mainly responsible for decompiling the APK file to obtain a smali file, and counting API calls required to be used from all the smali files;
a relationship matrix construction module: extracting API calls from the smali files of all android samples, and constructing four relationship matrixes according to different API relationships;
a meta path construction module: the four relation matrixes comprise two entities and four relations between the entities, so that a heterogeneous information network is formed, and characteristic matrixes are respectively constructed according to three different element paths in the heterogeneous network;
the multi-core learning modeling module: using different characteristic matrix combinations as the input of the multi-core learning model, and respectively modeling the multi-core learning model;
a classifier: and testing the multi-core learning model which is modeled by using different characteristic matrix combinations by using a test data set, and selecting one model with the accuracy reaching a threshold value as a final android malicious program detection classifier.
The four relationship matrices include:
a: the relationship matrix A is used to represent the relationship between APP and API, Aij=aijE {0,1} represents APPiWhether or not it contains an APIjIf a isij1 then APPiIn which an API is containedjOtherwise, the relation is represented by R0;
b: the relationship matrix B is used to represent the API and the relationship between the API, Bij=bijE {0,1} represents APIiAnd APIjWhether it is in the same Block, if bij1 then APIiAnd APIjIn the same Block, otherwise, the relation is represented by R1;
p: the relationship matrix P is used to represent the API and the relationship between the API, Pij=pijE {0,1} represents APIiAnd APIjWhether or not they belong to the same Package, if pij1 then APIiAnd APIjBelonging to the same Package, otherwise, not, the relation is represented by R2;
i: the relationship matrix I is used to represent the API and the relationship between the API, Iij=iijE {0,1} represents APIiAnd APIjWhether the same Invoke-method is used, if iij1 then APIiAnd APIjThe same Invoke method is used, otherwise, the relation is represented by R3;
wherein, APPiRepresenting the ith android sample software, APIiRepresenting the selected ith sensitive API;
the method for generating different feature matrixes in the heterogeneous information network according to different element paths comprises the following steps:
meta path
Figure FDA0003556388980000051
Means that a certain API in one APP and a certain API in another APP appear in the same Block, with MBRepresenting a feature matrix generated from the meta path;
m of the training setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Bij1), M of the test setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Bij1) respectively indicating whether the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample are in the same Block, if so, indicating that the jth API is in the same Block, otherwise, indicating that the jth API is not in the Block;
meta path
Figure FDA0003556388980000052
Indicating that a certain API in one APP and a certain API in another APP belong to the same Package, using MPRepresenting a feature matrix generated from the meta-path;
M of the training setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Pij1), M of the test setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Pij1), respectively indicating that the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample belong to the same Package, if the jth API is 1, indicating that the ith API belongs to the same Package, otherwise, no;
meta path
Figure FDA0003556388980000061
Means that a certain API in one APP and a certain API in another APP use the same Invoke-method, with MIRepresenting a feature matrix generated from the meta path;
m of the training setIIs MI(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Iij1), M of the test setIIs MI(N*n+j,N*m+i)=(A0mi==1)&&(A2nj=1)&&(Iij1) respectively indicating whether the j-th API in the nth APP in the training sample or the test sample and the i-th API in the mth APP in the control sample use the same Invoke-method, if so, indicating that the same Invoke-method is used, otherwise, no;
m aboveB,MPAnd MIN in the expression represents the number of sensitive APIs, and two subscripts in the relationship matrices a0, a1, a2, B, P, and I represent the corresponding rows and columns of the relationship matrix, respectively.
CN202011206884.8A 2020-11-02 2020-11-02 Android malicious program detection method and system based on heterogeneous information network Active CN112149124B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011206884.8A CN112149124B (en) 2020-11-02 2020-11-02 Android malicious program detection method and system based on heterogeneous information network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011206884.8A CN112149124B (en) 2020-11-02 2020-11-02 Android malicious program detection method and system based on heterogeneous information network

Publications (2)

Publication Number Publication Date
CN112149124A CN112149124A (en) 2020-12-29
CN112149124B true CN112149124B (en) 2022-04-29

Family

ID=73953779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011206884.8A Active CN112149124B (en) 2020-11-02 2020-11-02 Android malicious program detection method and system based on heterogeneous information network

Country Status (1)

Country Link
CN (1) CN112149124B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113553446B (en) * 2021-07-28 2022-05-24 厦门国际银行股份有限公司 Financial anti-fraud method and device based on heterograph deconstruction
CN114491529A (en) * 2021-12-20 2022-05-13 西安电子科技大学 Android malicious application program identification method based on multi-modal neural network
CN114756860A (en) * 2022-02-22 2022-07-15 广州大学 Malicious software detection method based on meta-path
CN114662105B (en) * 2022-03-17 2023-03-31 电子科技大学 Method and system for identifying Android malicious software based on graph node relationship and graph compression
CN114722391B (en) * 2022-04-07 2023-03-28 电子科技大学 Method for detecting android malicious program

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670306A (en) * 2018-11-27 2019-04-23 国网山东省电力公司济宁供电公司 Electric power malicious code detecting method, server and system based on artificial intelligence
CN109711163A (en) * 2018-12-26 2019-05-03 西安电子科技大学 Android malware detection method based on API Calls sequence
CN110298173A (en) * 2018-03-23 2019-10-01 瞻博网络公司 The detection Malware hiding by the delay circulation of software program
CN110348214A (en) * 2019-07-16 2019-10-18 电子科技大学 To the method and system of Malicious Code Detection
CN110532776A (en) * 2019-09-05 2019-12-03 广西大学 Android malware efficient detection method, system and medium based on runtime data analysis
CN111163057A (en) * 2019-12-09 2020-05-15 中国科学院信息工程研究所 User identification system and method based on heterogeneous information network embedding algorithm
CN111316268A (en) * 2017-09-06 2020-06-19 分形工业有限公司 Advanced cyber-security threat mitigation for interbank financial transactions
KR20200076426A (en) * 2018-12-19 2020-06-29 건국대학교 산학협력단 Method and apparatus for malicious detection based on heterogeneous information network
CN111523117A (en) * 2020-04-10 2020-08-11 西安电子科技大学 Android malicious software detection and malicious code positioning system and method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8571255B2 (en) * 2009-01-07 2013-10-29 Dolby Laboratories Licensing Corporation Scalable media fingerprint extraction
CN110069927A (en) * 2019-04-22 2019-07-30 中国民航大学 Malice APK detection method, system, data storage device and detection program
CN110598130B (en) * 2019-09-30 2022-06-24 重庆邮电大学 Movie recommendation method integrating heterogeneous information network and deep learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111316268A (en) * 2017-09-06 2020-06-19 分形工业有限公司 Advanced cyber-security threat mitigation for interbank financial transactions
CN110298173A (en) * 2018-03-23 2019-10-01 瞻博网络公司 The detection Malware hiding by the delay circulation of software program
CN109670306A (en) * 2018-11-27 2019-04-23 国网山东省电力公司济宁供电公司 Electric power malicious code detecting method, server and system based on artificial intelligence
KR20200076426A (en) * 2018-12-19 2020-06-29 건국대학교 산학협력단 Method and apparatus for malicious detection based on heterogeneous information network
CN109711163A (en) * 2018-12-26 2019-05-03 西安电子科技大学 Android malware detection method based on API Calls sequence
CN110348214A (en) * 2019-07-16 2019-10-18 电子科技大学 To the method and system of Malicious Code Detection
CN110532776A (en) * 2019-09-05 2019-12-03 广西大学 Android malware efficient detection method, system and medium based on runtime data analysis
CN111163057A (en) * 2019-12-09 2020-05-15 中国科学院信息工程研究所 User identification system and method based on heterogeneous information network embedding algorithm
CN111523117A (en) * 2020-04-10 2020-08-11 西安电子科技大学 Android malicious software detection and malicious code positioning system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于树型结构的APT攻击预测方法;张小松等;《电子科技大学学报》;20160730;第45卷(第4期);第582-588页 *

Also Published As

Publication number Publication date
CN112149124A (en) 2020-12-29

Similar Documents

Publication Publication Date Title
CN112149124B (en) Android malicious program detection method and system based on heterogeneous information network
CN107392025B (en) Malicious android application program detection method based on deep learning
CN107256357B (en) Detection and analysis method for android malicious application based on deep learning
CN106055980B (en) A kind of rule-based JavaScript safety detecting method
CN109271788B (en) Android malicious software detection method based on deep learning
CN106570399B (en) A kind of detection method of across App inter-module privacy leakage
CN108200054A (en) A kind of malice domain name detection method and device based on dns resolution
CN107659570A (en) Webshell detection methods and system based on machine learning and static and dynamic analysis
CN108133139A (en) A kind of Android malicious application detecting system compared based on more running environment behaviors
CN109614795B (en) Event-aware android malicious software detection method
CN103679030B (en) Malicious code analysis and detection method based on dynamic semantic features
Li et al. Opcode sequence analysis of Android malware by a convolutional neural network
CN108712448A (en) A kind of injection attack detection model based on the analysis of dynamic stain
CN107944274A (en) A kind of Android platform malicious application off-line checking method based on width study
Daoudi et al. A deep dive inside drebin: An explorative analysis beyond android malware detection scores
CN109214178A (en) APP application malicious act detection method and device
CN109255241A (en) Android privilege-escalation leak detection method and system based on machine learning
CN113901465A (en) Heterogeneous network-based Android malicious software detection method
CN111049828B (en) Network attack detection and response method and system
CN113378167A (en) Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing
CN110069927A (en) Malice APK detection method, system, data storage device and detection program
Congyi et al. Method for detecting Android malware based on ensemble learning
Waghmare et al. A review on malware detection methods
CN108427882A (en) The Android software dynamic analysis detection method of Behavior-based control feature extraction
CN112100621B (en) Android malicious application detection method based on sensitive permission and API

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant