CN112149124B - Android malicious program detection method and system based on heterogeneous information network - Google Patents
Android malicious program detection method and system based on heterogeneous information network Download PDFInfo
- Publication number
- CN112149124B CN112149124B CN202011206884.8A CN202011206884A CN112149124B CN 112149124 B CN112149124 B CN 112149124B CN 202011206884 A CN202011206884 A CN 202011206884A CN 112149124 B CN112149124 B CN 112149124B
- Authority
- CN
- China
- Prior art keywords
- api
- app
- matrix
- relationship
- android
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
- G06F21/563—Static detection by source code analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Abstract
The invention discloses a method and a system for detecting android malicious programs based on a heterogeneous information network, which belong to the field of software security detection.A main scheme comprises the steps of decompiling all APK files in a preprocessing module to obtain a smali file, and selecting a used API; the relation matrix construction module constructs four relation matrixes according to the smali file; entities and different relations in the relation matrix form a heterogeneous information network, and the meta-path construction module generates different feature matrices in the heterogeneous information network according to different meta-paths; using different feature matrix combinations for modeling a multi-core learning model; and testing different multi-core learning models which are modeled, and selecting the model with the best effect as a final android malicious program detection classifier. The invention analyzes different relationships among the APIs, fills more detailed API information into the meta-paths, collects high-level semantics of different meta-paths by using multi-core learning, and effectively detects and identifies the android malicious program.
Description
Technical Field
The invention belongs to the field of software security detection, and discloses a method and a system for detecting android malicious programs based on a heterogeneous information network.
Background
Today, mobile devices such as smartphones are widely used in our daily lives. Due to the popularity of Android devices and the openness of the Android OS, the number of Android malware is rapidly increasing. Infection of a mobile device with malware may result in the leakage of important private information, such as a user's account and password. Also, the rise of malicious software that wastes user time and cheats the theft of money results in economic loss for the user. Therefore, there is an urgent need to effectively detect and defend against Android malware.
The currently mainstream Android malware detection method identifies and classifies malicious applications by using a machine learning technology with different characteristics (mainly API call characteristics, assembly code characteristics and binary code characteristics) as input. Current Android malware detection methods can be divided into two broad categories, namely signature-based methods and behavior-based methods. The first approach is to classify malicious Android applications according to some unique digital signature of a known malware type. Such signatures are generated by statically analyzing known malware samples to see their code structure, including assembly code and features at the binary level. However, signature-based methods are generally unable to detect unknown Android malware or more complex malware, such as polymorphic malware. The second method executes a given Android malware sample in a sandbox, obtains runtime behavior information of the given Android malware sample, and can extract relevant API calling features and assembly code features from the runtime behavior information. This behavior-based approach is called dynamic analysis. However, methods based on dynamic analysis naturally incur high overhead at runtime and may lose some hidden malicious behavior (if not triggered during monitoring). The performance of machine learning based Android malware detection methods depends on the degree to which the extracted features represent the difference between different types of malicious Android applications and benign Android applications. The characteristics used by the existing Android malware detection method are usually too simple, and rich relationships among the functions are not considered, so that the detection accuracy is limited.
Disclosure of Invention
In view of the above problems in the prior art, an object of the present invention is to provide a method and a system for detecting an android malicious program based on a heterogeneous information network, where the method analyzes different relationships between APIs, fills more detailed API information into meta-paths, and collects high-level semantics of different meta-paths using multi-core learning, so that detection and identification of the android malicious program can be effectively performed.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for detecting android malicious programs based on a heterogeneous information network comprises the following steps:
s1: decompiling all APK files in a preprocessing module to obtain a smali file, counting the called times of various API calls from all the smali files, and selecting the first N APIs with the largest called times;
s2: the relation matrix construction module extracts API calls from all the smali files and constructs four relation matrixes according to different API relations;
s3: entities and different relations in the relation matrix form a heterogeneous information network, and the meta-path construction module generates different feature matrices in the heterogeneous information network according to different meta-paths;
s4: using different feature matrix combinations for modeling a multi-core learning model;
s5: and testing the multi-core learning model which is modeled by using different characteristic matrix combinations by using a test data set, and selecting one model with the accuracy reaching a threshold value as a final android malicious program detection classifier.
In the above technical solution, the step S1 is implemented by the following specific steps:
s1-1: decompressing the APK files of all the collected android samples to obtain a dex file;
s1-2: decompiling the dex file to obtain a smali code file;
s1-3: and comprehensively counting the called times of various API calls in all the smali files, and selecting the first N APIs with the largest called times.
In the above technical solution, the four relationship matrices in step S2 include:
a: the relationship matrix A is used to represent the relationship between APP and API, Aij=aijE {0,1} represents APPiWhether or not it contains an APIjIf a isij1 denotes APPiIn which an API is containedjOtherwise, the relation is represented by R0;
b: the relationship matrix B is used to represent the API and the relationship between the API, Bij=bijE {0,1} represents APIiAnd APIjWhether it is in the same Block, if bij1 denotes APIiAnd APIjIn the same Block, otherwise, the relation is represented by R1;
p: the relationship matrix P is used to represent the API and the relationship between the API, Pij=pijE {0,1} represents APIiAnd APIjWhether or not they belong to the same Package, if pij1 denotes APIiAnd APIjBelonging to the same Package, otherwise, not, the relation is represented by R2;
i: the relationship matrix I is used to represent the API and the relationship between the API, Iij=iijE {0,1} represents APIiAnd APIjWhether the same Invoke-method is used, if iij1 denotes APIiAnd APIjThe same Invoke method is used, otherwise, the relationship is represented by R3.
Wherein, APPiRepresenting the ith android sample software, APIiIndicating the selected ith sensitive API.
In the above technical solution, the Block refers to a smali code before a pair of ". method" and ". endmethod" in a smali file.
In the above technical solution, the step S2 is implemented by the following specific steps:
s2-1: selecting a part of the android software samples as comparison samples, and constructing a relation matrix A0 between the APP and the API of the comparison samples;
s2-2: selecting a part of the android software samples as training samples, and constructing a relation matrix A1 between the APP and the API of the training samples;
s2-3: selecting a part of the android software samples as test samples, and constructing a relation matrix A2 between the APP and the API of the test samples;
s2-4: relationship matrices B, P and I are constructed using all android software samples.
In the above technical solution, the heterogeneous information network composed of different entities and relationships in step S3 means:
the method comprises the following steps that different entities are classified into two types, namely an android application program APP represented by an APK file and an API call; the different relations are four, namely R0, R1, R2 and R3.
The two entities and four relations between the two entities jointly form a heterogeneous information network G ═ V, E >, wherein V is a node set, and node types are limited to the two entities of APP and API; e is the collection of the connection among the nodes, and the connection relation is limited to four relations of R0, R1, R2 and R3.
In the above technical solution, the method for generating different feature matrices according to different element paths in the heterogeneous information network in step S3 includes:
meta pathMeans that a certain API in one APP and a certain API in another APP appear in the same Block, with MBRepresenting a feature matrix generated from the meta path;
m of the training setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Bij1), M of the test setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Bij1) respectively indicating whether the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample are present in the same Block, and if so, indicating that the jth API is present in the same Block, otherwise, indicating that the jth API is not present.
Meta pathIndicating that a certain API in one APP and a certain API in another APP belong to the same Package, using MPRepresenting the feature matrix generated by the meta-path. (ii) a
M of the training setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Pij==1) M of test setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Pij1), respectively indicating that the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample belong to the same Package, if the jth API is 1, indicating that the ith API belongs to the same Package, otherwise, no;
meta pathMeans that a certain API in one APP and a certain API in another APP use the same Invoke-method, with MIRepresenting a feature matrix generated from the meta path;
m of the training setIIs MI(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(IijMI of test set is M ═ 1)I(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Iij1) respectively indicates whether the j-th API in the nth APP in the training sample or the test sample and the i-th API in the mth APP in the control sample use the same Invoke-method, and if 1, indicates that the same Invoke-method is used, otherwise, it does not.
M aboveB,MPAnd MIN in the expression represents the number of sensitive APIs, and two subscripts in the relationship matrices a0, a1, a2, B, P, and I represent the corresponding rows and columns of the relationship matrix, respectively.
In the above technical solution, the combination of different feature matrices in step S4 includes:
a total of seven feature matrix combinations: m aloneBMatrix, individual MPMatrix, individual MIMatrix, MBAnd MPMatrix combination, MBAnd MIMatrix combination, MPAnd MIMatrix combination and MB、MPAnd MIA combination of three feature matrices.
In the above technical solution, the step S5 is implemented by the following specific steps:
s5-1: three feature matrices M constructed based on the test sample and the control sample obtained in step S3B、MP、MIAfter the combination of different feature matrices is performed, the combined feature matrices are respectively used as the input of the corresponding multi-core learning model constructed in the step S4, and the accuracy of the corresponding model is tested;
s5-2: and comparing the accuracy rates of the multi-core learning models obtained in the step S5-1, and selecting the model with the accuracy rate meeting the threshold value as a final android malicious program detection classifier.
A system for android malicious program detection based on a heterogeneous information network comprises,
a preprocessing module: the method is mainly responsible for decompiling the APK file to obtain a smali file, and counting API calls required to be used from all the smali files.
A relationship matrix construction module: extracting API calls from the smali files of all android samples, and constructing four relationship matrixes according to different API relationships;
a meta path construction module: the four relation matrixes comprise two entities and four relations between the entities, so that a heterogeneous information network is formed, and characteristic matrixes are respectively constructed according to three different element paths in the heterogeneous network;
the multi-core learning modeling module: using different characteristic matrix combinations as the input of the multi-core learning model, and respectively modeling the multi-core learning model;
a classifier: and testing the multi-core learning model which is modeled by using different characteristic matrix combinations by using a test data set, and selecting one model with the accuracy reaching a threshold value as a final android malicious program detection classifier.
Compared with the prior art, the invention has the beneficial effects that:
first, the API used in the present invention is a sensitive API. Sensitive APIs are more used by malware to perform some sensitive or malicious operations than normal APIs. Thus, sensitive APIs are more representative of malicious behavior of software, and in benign android applications, sensitive APIs may appear less. Thanks to the nature of sensitive APIs, the present invention can use a smaller number of APIs to characterize the behavior of malware;
secondly, the meta-path information used by the invention can describe the API relationship among different APPs in a finer granularity. Different from the traditional meta path which only can show how many APIs of different APPs appear in the same Block, the feature matrix constructed based on the meta path in the invention describes which APIs of different APPs appear in the same Block at all. Similarly, it also describes which APIs belong to the same Package among different APPs, and which APIs use the same Invoke-method. The API relationships among different APPs are described in detail by the fine-grained characteristic information, so that the learning model can capture more differences among android malicious software, benign software and different android malicious software families, and the accuracy of the android malicious program detection classifier is improved;
and thirdly, the anti-confusion capacity is stronger.
Drawings
FIG. 1 is a block diagram of the system architecture of the present invention;
fig. 2 is a schematic flow chart of the method for extracting the android malicious program detection classification model.
Detailed Description
The invention is further illustrated by the following specific examples.
The invention provides an android malicious program detection method based on a heterogeneous information network, which is characterized by comprising the following steps of:
s1: decompiling all APK files in a preprocessing module to obtain a smali file, counting the called times of various API calls from all the smali files, and selecting the first N sensitive APIs with the largest called times;
s2: the relation matrix construction module extracts sensitive API calls from all the smali files and constructs four relation matrixes according to different sensitive API relations;
s3: entities and different relations in the relation matrix form a heterogeneous information network, and the meta-path construction module generates different feature matrices in the heterogeneous information network according to different meta-paths;
s4: using different feature matrix combinations for modeling a multi-core learning model;
s5: and testing the multi-core learning model which is modeled by using different characteristic matrix combinations by using a test data set, and selecting one model with the accuracy reaching a threshold value as a final android malicious program detection classifier.
In the present invention, the step S1 is implemented by the following specific steps:
s1-1: decompressing the APK files of all the collected android samples to obtain a dex file;
s1-2: decompiling the dex file to obtain a smali code file;
s1-3: and comprehensively counting the called times of various API calls in all the smali files, and selecting the first N sensitive APIs with the largest called times. Fewer sensitive APIs may be used here because it takes into account that the length of the feature vector is squared with the number of sensitive APIs, while at the same time sensitive APIs can better represent the behavior of the android malware than normal APIs.
In the present invention, the four relationship matrices in step S2 include:
a: the relationship matrix A is used to represent the relationship between APP and sensitive API, Aij=aijE {0,1} represents APPiWhether or not to have sensitive APIsjIf a isij1 denotes APPiContaining a sensitive APIjOtherwise, the relation is represented by R0;
b: the relationship matrix B is used to represent the relationships between sensitive APIs and sensitive APIs, Bij=bijE {0,1} represents a sensitive APIiAnd sensitive APIjWhether it is in the same Block, if bij1 then represents a sensitive APIiAnd sensitive APIjIn the same Block, otherwise, the relation is represented by R1;
p: the relation matrix P is used to represent the relation between the sensitive API and the sensitive API, Pij=pijE {0,1} represents a sensitive APIiAnd sensitive APIjWhether or not they belong to the same Package, if pij1 then represents a sensitive APIiAnd sensitive APIjBelong to the same Package, otherwise, the PackageNo, this relationship is represented by R2;
i: the relationship matrix I is used to represent the relationships between the sensitive APIs and the sensitive APIs, Iij=iijE {0,1} represents a sensitive APIiAnd sensitive APIjWhether the same Invoke-method is used, if iij1 then represents a sensitive APIiAnd sensitive APIjThe same Invoke method is used, otherwise, the relationship is represented by R3.
Wherein, APPiRepresenting the ith android sample software, APIiIndicating the selected ith sensitive API.
In the invention, the Block refers to a smali code before a pair of a 'method' and an 'endmethod' in a smali file.
In the present invention, in order to keep the feature matrix constructed by the training set and the feature matrix constructed by the test set to have the same dimension, and also in order to ensure the consistency of the data, a fixed android software sample is further selected as a control sample, so the step S2 is implemented by the following specific steps:
s2-1: selecting a part of the android software samples as comparison samples, and constructing a relation matrix A0 between the APP and the sensitive API of the comparison samples;
s2-2: selecting a part of the android software samples as training samples, and constructing a relation matrix A1 between the APP and the sensitive API of the training samples;
s2-3: selecting a part of the android software samples as test samples, and constructing a relation matrix A2 between the APP and the sensitive API of the test samples;
s2-4: relationship matrices B, P and I are constructed using all android software samples.
In the present invention, the heterogeneous information network composed of different entities and relationships in step S3 refers to:
the method comprises the following steps that different entities are classified into two types, namely an android application program APP represented by an APK file and a sensitive API call; the different relations are four, namely R0, R1, R2 and R3.
The two entities and four relations between the two entities jointly form a heterogeneous information network G ═ V, E >, wherein V is a node set, and node types are limited to the two entities of APP and sensitive API; e is the collection of the connection among the nodes, and the connection relation is limited to four relations of R0, R1, R2 and R3.
In the present invention, the method for generating different feature matrices according to different meta paths in the heterogeneous information network in step S3 includes:
meta pathMeans that a sensitive API in one APP and a sensitive API in another APP have appeared in the same Block, using MBRepresenting a feature matrix generated from the meta path;
m of the training setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Bij1), M of the test setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(BijAnd 1) respectively indicating whether the j-th sensitive API in the nth APP in the training sample or the test sample and the i-th sensitive API in the mth APP in the control sample are present in the same Block, if so, indicating that the p-th sensitive API is present in the same Block, otherwise, indicating that the p-th sensitive API is not present in the same Block.
Meta pathIndicating that a sensitive API in one APP and a sensitive API in another APP belong to the same Package, using MPRepresenting the feature matrix generated by the meta-path. (ii) a
M of the training setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Pij1), M of the test setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Pij1) in the training sample or test sample, respectively, the jth sensitive API in the nth APP in the training sample or test sample, and the jth sensitive API in the control sampleThe ith sensitive API in the m APPs belongs to the same Package, if the ith sensitive API is 1, the ith sensitive API belongs to the same Package, otherwise, the ith sensitive API does not belong to the same Package;
meta pathIndicating that a sensitive API in one APP and a sensitive API in another APP use the same Invoke-method, with MIRepresenting a feature matrix generated from the meta path;
m of the training setIIs MI(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Iij1), M of the test setIIs MI(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Iij1) respectively indicates whether the j-th sensitive API in the nth APP in the training sample or the test sample and the i-th sensitive API in the mth APP in the control sample use the same Invoke-method, and if 1, indicates that the same Invoke-method is used, otherwise, indicates no.
M aboveB,MPAnd MIN in the expression indicates the number of sensitive APIs, and the two subscripts in the matrices a0, a1, a2, B, P, and I indicate the corresponding rows and columns of the matrix, respectively.
M of the training and test data set hereB、MPAnd MIThe feature matrix is constructed relative to a reference sample, so that m in the feature matrix of the training and testing set is the same, and the dimension of the feature matrix is ensured to be unchanged.
In the present invention, the different feature matrix combinations in step S4 include:
a total of seven feature matrix combinations: m aloneBMatrix, individual MPMatrix, individual MIMatrix, MBAnd MPMatrix combination, MBAnd MIMatrix combination, MPAnd MIMatrix combination and MB、MPAnd MIA combination of three feature matrices.
In the present invention, the step S5 is implemented by the following specific steps:
s5-1: three feature matrices M constructed based on the test sample and the control sample obtained in step S3B、MP、MIAfter the combination of different feature matrices is performed, the combined feature matrices are respectively used as the input of the corresponding multi-core learning model constructed in the step S4, and the accuracy of the corresponding model is tested;
s5-2: and comparing the accuracy rates of the multi-core learning models obtained in the step S5-1, and selecting the model with the accuracy rate meeting the threshold value as a final android malicious program detection classifier.
The present embodiments are to be considered as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description. All the technical solutions formed by the transformation or the equivalent substitution fall within the protection scope of the present invention.
Claims (8)
1. A method for detecting android malicious programs based on a heterogeneous information network is characterized by comprising the following steps:
s1: decompiling all APK files in a preprocessing module to obtain a smali file, counting the called times of various API calls from all the smali files, and selecting the first N APIs with the largest called times;
s2: the relation matrix construction module extracts API calls from all the smali files and constructs four relation matrixes according to different API relations;
s3: entities and different relations in the relation matrix form a heterogeneous information network, and the meta-path construction module generates different feature matrices in the heterogeneous information network according to different meta-paths;
s4: using different feature matrix combinations for modeling a multi-core learning model;
s5: testing a multi-core learning model which is modeled by using different characteristic matrix combinations by using a test data set, and selecting one model with the accuracy reaching a threshold value as a final android malicious program detection classifier;
step S2 shows that the four relationship matrices include:
a: relation matrix A usesIn representing the relationship between APP and API, Aij=aijE {0,1} represents APPiWhether or not it contains an APIjIf a isij1 then APPiIn which an API is containedjOtherwise, the relation is represented by R0;
b: the relationship matrix B is used to represent the API and the relationship between the API, Bij=bijE {0,1} represents APIiAnd APIjWhether it is in the same Block, if bij1 then APIiAnd APIjIn the same Block, otherwise, the relation is represented by R1;
p: the relationship matrix P is used to represent the API and the relationship between the API, Pij=pijE {0,1} represents APIiAnd APIjWhether or not they belong to the same Package, if pij1 then APIiAnd APIjBelonging to the same Package, otherwise, not, the relation is represented by R2;
i: the relationship matrix I is used to represent the API and the relationship between the API, Iij=iijE {0,1} represents APIiAnd APIjWhether the same Invoke-method is used, if iij1 then APIiAnd APIjThe same Invoke method is used, otherwise, the relation is represented by R3;
wherein, APPiRepresenting the ith android sample software, APIiRepresenting the selected ith sensitive API;
the method for generating different feature matrices according to different meta paths in the heterogeneous information network in step S3 includes:
meta pathMeans that a certain API in one APP and a certain API in another APP appear in the same Block, with MBRepresenting a feature matrix generated from the meta path;
m of the training setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Bij1), M of the test setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Bij1) respectively indicating whether the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample are in the same Block, if so, indicating that the jth API is in the same Block, otherwise, indicating that the jth API is not in the Block;
meta pathIndicating that a certain API in one APP and a certain API in another APP belong to the same Package, using MPRepresenting a feature matrix generated from the meta path;
m of the training setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Pij1), M of the test setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Pij1), respectively indicating that the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample belong to the same Package, if the jth API is 1, indicating that the ith API belongs to the same Package, otherwise, no;
meta pathMeans that a certain API in one APP and a certain API in another APP use the same Invoke-method, with MIRepresenting a feature matrix generated from the meta path;
m of the training setIIs MI(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Iij1), M of the test setIIs MI(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Iij1) respectively indicates whether the jth API in the nth APP in the training or test sample and the ith API in the mth APP in the control sample are presentThe same Invoke-method is used, if the answer is 1, the same Invoke-method is used, otherwise, the answer is no;
m aboveB,MPAnd MIN in the expression represents the number of sensitive APIs, and two subscripts in the relationship matrices a0, a1, a2, B, P, and I represent the corresponding rows and columns of the relationship matrix, respectively.
2. The method for detecting the android malicious program based on the heterogeneous information network of claim 1, wherein the step S1 is implemented by the following specific steps:
s1-1: decompressing the APK files of all the collected android samples to obtain a dex file;
s1-2: decompiling the dex file to obtain a smali code file;
s1-3: and comprehensively counting the called times of various API calls in all the smali files, and selecting the first N APIs with the largest called times.
3. The method for detecting the android malicious program based on the heterogeneous information network of claim 2, wherein the Block refers to a smali code before a pair of ". method" and ". endmethod" in a smali file.
4. The method for detecting the android malicious program based on the heterogeneous information network of claim 1, wherein the step S2 is implemented by the following specific steps:
s2-1: selecting a part of the android software samples as comparison samples, and constructing a relation matrix A0 between the APP and the API of the comparison samples;
s2-2: selecting a part of the android software samples as training samples, and constructing a relation matrix A1 between the APP and the API of the training samples;
s2-3: selecting a part of the android software samples as test samples, and constructing a relation matrix A2 between the APP and the API of the test samples;
s2-4: relationship matrices B, P and I are constructed using all android software samples.
5. The method for android malware detection based on heterogeneous information network of claim 3, wherein the step S3 is that the step of constructing the heterogeneous information network by the entities and different relationships in the relationship matrix is that:
the method comprises the following steps that different entities are classified into two types, namely an android application program APP represented by an APK file and an API call; four different relationships, R0, R1, R2 and R3;
the two entities and four relations between the two entities jointly form a heterogeneous information network G ═ V, E >, wherein V is a node set, and node types are limited to the two entities of APP and API; e is the collection of the connection among the nodes, and the connection relation is limited to four relations of R0, R1, R2 and R3.
6. The method for android malware detection based on heterogeneous information network of claim 1, wherein the different feature matrix combinations of step S4 include:
a total of seven feature matrix combinations: m aloneBMatrix, individual MPMatrix, individual MIMatrix, MBAnd MPMatrix combination, MBAnd MIMatrix combination, MPAnd MIMatrix combination and MB、MPAnd MIA combination of three feature matrices.
7. The method for detecting the android malicious program based on the heterogeneous information network of claim 1, wherein the step S5 is implemented by the following specific steps:
s5-1: three feature matrices M constructed based on the test sample and the control sample obtained in step S3B、MP、MIAfter the combination of different feature matrices is performed, the combined feature matrices are respectively used as the input of the corresponding multi-core learning model constructed in the step S4, and the accuracy of the corresponding model is tested;
s5-2: and comparing the accuracy rates of the multi-core learning models obtained in the step S5-1, and selecting the model with the accuracy rate meeting the threshold value as a final android malicious program detection classifier.
8. A system for android malicious program detection based on a heterogeneous information network is characterized by comprising,
a preprocessing module: the method is mainly responsible for decompiling the APK file to obtain a smali file, and counting API calls required to be used from all the smali files;
a relationship matrix construction module: extracting API calls from the smali files of all android samples, and constructing four relationship matrixes according to different API relationships;
a meta path construction module: the four relation matrixes comprise two entities and four relations between the entities, so that a heterogeneous information network is formed, and characteristic matrixes are respectively constructed according to three different element paths in the heterogeneous network;
the multi-core learning modeling module: using different characteristic matrix combinations as the input of the multi-core learning model, and respectively modeling the multi-core learning model;
a classifier: and testing the multi-core learning model which is modeled by using different characteristic matrix combinations by using a test data set, and selecting one model with the accuracy reaching a threshold value as a final android malicious program detection classifier.
The four relationship matrices include:
a: the relationship matrix A is used to represent the relationship between APP and API, Aij=aijE {0,1} represents APPiWhether or not it contains an APIjIf a isij1 then APPiIn which an API is containedjOtherwise, the relation is represented by R0;
b: the relationship matrix B is used to represent the API and the relationship between the API, Bij=bijE {0,1} represents APIiAnd APIjWhether it is in the same Block, if bij1 then APIiAnd APIjIn the same Block, otherwise, the relation is represented by R1;
p: the relationship matrix P is used to represent the API and the relationship between the API, Pij=pijE {0,1} represents APIiAnd APIjWhether or not they belong to the same Package, if pij1 then APIiAnd APIjBelonging to the same Package, otherwise, not, the relation is represented by R2;
i: the relationship matrix I is used to represent the API and the relationship between the API, Iij=iijE {0,1} represents APIiAnd APIjWhether the same Invoke-method is used, if iij1 then APIiAnd APIjThe same Invoke method is used, otherwise, the relation is represented by R3;
wherein, APPiRepresenting the ith android sample software, APIiRepresenting the selected ith sensitive API;
the method for generating different feature matrixes in the heterogeneous information network according to different element paths comprises the following steps:
meta pathMeans that a certain API in one APP and a certain API in another APP appear in the same Block, with MBRepresenting a feature matrix generated from the meta path;
m of the training setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Bij1), M of the test setBIs MB(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Bij1) respectively indicating whether the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample are in the same Block, if so, indicating that the jth API is in the same Block, otherwise, indicating that the jth API is not in the Block;
meta pathIndicating that a certain API in one APP and a certain API in another APP belong to the same Package, using MPRepresenting a feature matrix generated from the meta-path;
M of the training setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Pij1), M of the test setPIs MP(N*n+j,N*m+i)=(A0mi==1)&&(A2nj==1)&&(Pij1), respectively indicating that the jth API in the nth APP in the training sample or the test sample and the ith API in the mth APP in the control sample belong to the same Package, if the jth API is 1, indicating that the ith API belongs to the same Package, otherwise, no;
meta pathMeans that a certain API in one APP and a certain API in another APP use the same Invoke-method, with MIRepresenting a feature matrix generated from the meta path;
m of the training setIIs MI(N*n+j,N*m+i)=(A0mi==1)&&(A1nj==1)&&(Iij1), M of the test setIIs MI(N*n+j,N*m+i)=(A0mi==1)&&(A2nj=1)&&(Iij1) respectively indicating whether the j-th API in the nth APP in the training sample or the test sample and the i-th API in the mth APP in the control sample use the same Invoke-method, if so, indicating that the same Invoke-method is used, otherwise, no;
m aboveB,MPAnd MIN in the expression represents the number of sensitive APIs, and two subscripts in the relationship matrices a0, a1, a2, B, P, and I represent the corresponding rows and columns of the relationship matrix, respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011206884.8A CN112149124B (en) | 2020-11-02 | 2020-11-02 | Android malicious program detection method and system based on heterogeneous information network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011206884.8A CN112149124B (en) | 2020-11-02 | 2020-11-02 | Android malicious program detection method and system based on heterogeneous information network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112149124A CN112149124A (en) | 2020-12-29 |
CN112149124B true CN112149124B (en) | 2022-04-29 |
Family
ID=73953779
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011206884.8A Active CN112149124B (en) | 2020-11-02 | 2020-11-02 | Android malicious program detection method and system based on heterogeneous information network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112149124B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113553446B (en) * | 2021-07-28 | 2022-05-24 | 厦门国际银行股份有限公司 | Financial anti-fraud method and device based on heterograph deconstruction |
CN114491529A (en) * | 2021-12-20 | 2022-05-13 | 西安电子科技大学 | Android malicious application program identification method based on multi-modal neural network |
CN114756860A (en) * | 2022-02-22 | 2022-07-15 | 广州大学 | Malicious software detection method based on meta-path |
CN114662105B (en) * | 2022-03-17 | 2023-03-31 | 电子科技大学 | Method and system for identifying Android malicious software based on graph node relationship and graph compression |
CN114722391B (en) * | 2022-04-07 | 2023-03-28 | 电子科技大学 | Method for detecting android malicious program |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109670306A (en) * | 2018-11-27 | 2019-04-23 | 国网山东省电力公司济宁供电公司 | Electric power malicious code detecting method, server and system based on artificial intelligence |
CN109711163A (en) * | 2018-12-26 | 2019-05-03 | 西安电子科技大学 | Android malware detection method based on API Calls sequence |
CN110298173A (en) * | 2018-03-23 | 2019-10-01 | 瞻博网络公司 | The detection Malware hiding by the delay circulation of software program |
CN110348214A (en) * | 2019-07-16 | 2019-10-18 | 电子科技大学 | To the method and system of Malicious Code Detection |
CN110532776A (en) * | 2019-09-05 | 2019-12-03 | 广西大学 | Android malware efficient detection method, system and medium based on runtime data analysis |
CN111163057A (en) * | 2019-12-09 | 2020-05-15 | 中国科学院信息工程研究所 | User identification system and method based on heterogeneous information network embedding algorithm |
CN111316268A (en) * | 2017-09-06 | 2020-06-19 | 分形工业有限公司 | Advanced cyber-security threat mitigation for interbank financial transactions |
KR20200076426A (en) * | 2018-12-19 | 2020-06-29 | 건국대학교 산학협력단 | Method and apparatus for malicious detection based on heterogeneous information network |
CN111523117A (en) * | 2020-04-10 | 2020-08-11 | 西安电子科技大学 | Android malicious software detection and malicious code positioning system and method |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8571255B2 (en) * | 2009-01-07 | 2013-10-29 | Dolby Laboratories Licensing Corporation | Scalable media fingerprint extraction |
CN110069927A (en) * | 2019-04-22 | 2019-07-30 | 中国民航大学 | Malice APK detection method, system, data storage device and detection program |
CN110598130B (en) * | 2019-09-30 | 2022-06-24 | 重庆邮电大学 | Movie recommendation method integrating heterogeneous information network and deep learning |
-
2020
- 2020-11-02 CN CN202011206884.8A patent/CN112149124B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111316268A (en) * | 2017-09-06 | 2020-06-19 | 分形工业有限公司 | Advanced cyber-security threat mitigation for interbank financial transactions |
CN110298173A (en) * | 2018-03-23 | 2019-10-01 | 瞻博网络公司 | The detection Malware hiding by the delay circulation of software program |
CN109670306A (en) * | 2018-11-27 | 2019-04-23 | 国网山东省电力公司济宁供电公司 | Electric power malicious code detecting method, server and system based on artificial intelligence |
KR20200076426A (en) * | 2018-12-19 | 2020-06-29 | 건국대학교 산학협력단 | Method and apparatus for malicious detection based on heterogeneous information network |
CN109711163A (en) * | 2018-12-26 | 2019-05-03 | 西安电子科技大学 | Android malware detection method based on API Calls sequence |
CN110348214A (en) * | 2019-07-16 | 2019-10-18 | 电子科技大学 | To the method and system of Malicious Code Detection |
CN110532776A (en) * | 2019-09-05 | 2019-12-03 | 广西大学 | Android malware efficient detection method, system and medium based on runtime data analysis |
CN111163057A (en) * | 2019-12-09 | 2020-05-15 | 中国科学院信息工程研究所 | User identification system and method based on heterogeneous information network embedding algorithm |
CN111523117A (en) * | 2020-04-10 | 2020-08-11 | 西安电子科技大学 | Android malicious software detection and malicious code positioning system and method |
Non-Patent Citations (1)
Title |
---|
基于树型结构的APT攻击预测方法;张小松等;《电子科技大学学报》;20160730;第45卷(第4期);第582-588页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112149124A (en) | 2020-12-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112149124B (en) | Android malicious program detection method and system based on heterogeneous information network | |
CN107392025B (en) | Malicious android application program detection method based on deep learning | |
CN107256357B (en) | Detection and analysis method for android malicious application based on deep learning | |
CN106055980B (en) | A kind of rule-based JavaScript safety detecting method | |
CN109271788B (en) | Android malicious software detection method based on deep learning | |
CN106570399B (en) | A kind of detection method of across App inter-module privacy leakage | |
CN108200054A (en) | A kind of malice domain name detection method and device based on dns resolution | |
CN107659570A (en) | Webshell detection methods and system based on machine learning and static and dynamic analysis | |
CN108133139A (en) | A kind of Android malicious application detecting system compared based on more running environment behaviors | |
CN109614795B (en) | Event-aware android malicious software detection method | |
CN103679030B (en) | Malicious code analysis and detection method based on dynamic semantic features | |
Li et al. | Opcode sequence analysis of Android malware by a convolutional neural network | |
CN108712448A (en) | A kind of injection attack detection model based on the analysis of dynamic stain | |
CN107944274A (en) | A kind of Android platform malicious application off-line checking method based on width study | |
Daoudi et al. | A deep dive inside drebin: An explorative analysis beyond android malware detection scores | |
CN109214178A (en) | APP application malicious act detection method and device | |
CN109255241A (en) | Android privilege-escalation leak detection method and system based on machine learning | |
CN113901465A (en) | Heterogeneous network-based Android malicious software detection method | |
CN111049828B (en) | Network attack detection and response method and system | |
CN113378167A (en) | Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing | |
CN110069927A (en) | Malice APK detection method, system, data storage device and detection program | |
Congyi et al. | Method for detecting Android malware based on ensemble learning | |
Waghmare et al. | A review on malware detection methods | |
CN108427882A (en) | The Android software dynamic analysis detection method of Behavior-based control feature extraction | |
CN112100621B (en) | Android malicious application detection method based on sensitive permission and API |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |