CN114491530A

CN114491530A - Android application program classification method based on abstract flow graph and graph neural network

Info

Publication number: CN114491530A
Application number: CN202111566330.3A
Authority: CN
Inventors: 孙聪; 史鉴; 王培丞; 伍亚飞; 马建峰
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-05-13
Anticipated expiration: 2041-12-20
Also published as: CN114491530B

Abstract

The invention discloses an android application program classification method based on an abstract flow graph and a graph neural network, and mainly solves the problems that in the prior art, the identification capability of unknown malicious application programs is weak, and malicious program behaviors cannot be accurately identified. The implementation scheme is as follows: downloading malicious and benign android application software samples from a related sample library and domestic and foreign mainstream application markets; constructing a key Application Program Interface (API) list related to malicious behaviors and vulnerabilities of the android application program; generating a calling tracking graph of the android application program, and constructing an abstract flow graph of the android application program by utilizing the calling tracking graph and the API list; adding labels to the abstract flow graph and training a graph neural network GNN by using the abstract flow graph with the labels; and classifying benign or malicious classes for the android application program with unknown safety by using the trained graph neural network GNN. The invention has the advantages of strong generalization capability and high behavior discrimination of benign and malicious application programs, and can be used for detecting the malicious application programs.

Description

Android application program classification method based on abstract flow graph and graph neural network

Technical Field

The invention belongs to the technical field of network security, and particularly relates to an android application program classification method which can be used for detecting malicious application programs.

Background

The android operating system is a Linux-based operating system, is mainly used for mobile devices such as smart phones and tablet computers, and is the most widely used mobile terminal at present. In 2021, the Android operating system occupied more than 80% of the smartphone market. Due to the high occupancy rate of the market and the open source characteristic of the Android system, the Android application program becomes the key point of malicious attack by hackers. Therefore, the smartphone malware on the Android becomes a main security problem in daily life, and personal information security of users and even national information security are seriously threatened.

At present, detection technologies for malicious Android applications are mainly divided into a static detection technology and a dynamic detection technology. The static detection technology is used for extracting the characteristics of an application program, such as required permission, intention information, sensitive application program interface API (application program interface) call and the like, from the program under the condition that the Android application program is not operated;

the dynamic detection technology is used for recording software behaviors in operation by executing software in an actual environment, and although the accuracy of the dynamic detection technology is high in existing malicious software, the dynamic detection technology cannot identify a novel malicious application program, is not beneficial to detecting the unrecorded malicious software, takes a long time for research, is complex, and is not suitable for actual use, so that static analysis is generally adopted.

The above conventional detection techniques of static detection and dynamic detection all require a security expert to artificially define features to be extracted, which not only consumes manpower, but also is prone to errors in the extraction process. Therefore, in recent years, more and more Android malware detection means perform automatic detection by means of a deep learning method, and due to the high detection precision, deep learning is considered to be a powerful and effective tool in the security field.

An Application program written in a file with the name "Permission-Based Android mail Application protection Using Multi-Layer Perceptin" by the author of O.S. Jannath Nisha, S.Mary Saira Bhanu, DOI: 10.1007/978-3-030-16660-1-36, researchers have trained through a multi-layered sensor with the rights required by Android applications as features to achieve the purpose of identifying malicious programs, and this method can achieve good effects in widely different samples, but lacks the generalization ability for complex samples in the real world and unknown malicious applications.

The patent document with the application number of CN201811024430.1 of China civil aviation university provides an Android malicious application program detection method based on a two-channel convolutional neural network. The method comprises the steps of firstly decompiling an installation package APK file, extracting an operation code sequence and an instruction function sequence as features, inputting the features into a convolutional neural network, training the neural network, and detecting an application program to be detected through the trained neural network. The method takes the operation code sequence and the instruction function sequence of the extracted application program as the characteristics for detecting the malicious program, and the characteristics contain more redundant information irrelevant to the malicious code and the vulnerability, so the discrimination of benign and malicious application program categories is not high.

Disclosure of Invention

The invention aims to provide an android application program classification method based on an abstract flow graph and a graph neural network aiming at the defects of the prior art, so as to effectively distinguish benign and malicious application programs, capture malicious behaviors of unknown malicious programs and improve generalization capability.

The technical idea of the invention is as follows: and extracting the characteristics which can fully reflect the sensitive API call relation of the Android application program and the interactive behavior of sensitive information among the components from the Android application program, and training the graph neural network by the characteristics which embody the semantic meaning and the structure of the program to realize the identification of the malicious application program. The implementation scheme comprises the following steps:

(1) malicious and benign Android application software samples are downloaded from a relevant sample library and mainstream application markets at home and abroad, wherein the proportion of the benign samples to the malicious samples is 1: 1, the number of each type of samples is not less than A, and A is a positive integer greater than or equal to 1;

(2) constructing a key application program interface API list related to malicious behaviors and vulnerabilities of the Android application program;

(3) generating a call tracing graph of all the software samples in (1);

(3a) performing decompiling processing on an Android Application Package (APK) file, and taking a decompiled result as the input of a modified Intellimid tool, so that all calling paths taking an application program entry point as a source can be output as a sub-graph of a calling tracking graph;

(3b) extracting inter-component interaction ICC information of the Android application program through an ic3 tool, acquiring a function for inter-component interaction ICC through Intent and a component interacting with the function, and accordingly connecting nodes for component communication through Intent in different subgraphs to generate a call tracing graph of the whole Android application program;

(4) generating abstract flow diagrams of all software samples;

(4a) dividing nodes of the calling tracking graph to generate nodes of an abstract flow graph;

(4b) performing depth-first traversal on the call tracing graph to obtain a call path, filtering the call path through the key API list in the step (2) to obtain a key edge, extracting an Intent sending edge, an adjacent edge, an ICC edge and an implicit adjacent edge from the call tracing graph, acquiring reverse edges of the 5 edges, adding the 10 edges into the abstract flow graph to be used as an edge set of the abstract flow graph, and deleting isolated nodes which are not connected with the edges in the abstract flow graph;

(4c) adding labels to all nodes and edges to form an abstract flow diagram;

(5) generating an abstract flow diagram of the software sample in the step (1) through the steps (3) to (4), taking the abstract flow diagram as the input of the graph neural network GNN, and training the GNN by using a back propagation algorithm and a gradient descent method to obtain a trained graph neural network GNN;

(6) and (4) generating an abstract flow graph of the application program to be detected by adopting the same method as the method in the steps (3) to (4) for the Android application program with unknown security, inputting the abstract flow graph into a trained graph neural network GNN, outputting the benign and malicious probabilities of the application program to be classified, and taking the result with the maximum probability as the final judgment type. Compared with the prior art, the invention has the following advantages:

firstly, the sensitive API calling behavior of the Android application program can be more accurately extracted by constructing the key application program interface API list, so that more malicious behavior characteristics of the Android malicious software are identified, and the detection of unknown malicious Android application software is realized.

Secondly, the method and the device can extract and abstract the sensitive API calling behavior of the application program and the inter-component interaction ICC information of the program by generating the abstract flow diagram of the Android application program, can more accurately describe the behavior characteristics of the malicious program, and have high distinguishability on benign application programs and malicious application programs.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a sub-flow diagram of the generation of a trace call graph in accordance with the present invention;

FIG. 3 is a sub-flowchart of the present invention for generating an abstract flow graph;

fig. 4 is a schematic diagram of an abstract flow diagram generated in the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of the invention are as follows:

step 1, downloading and collecting benign and malicious Android application program samples.

Malicious and benign Android application software samples are downloaded from a related sample library and domestic and foreign mainstream application markets, wherein the proportion of the benign samples to the malicious samples is 1: 1.

in this embodiment, 9873 benign Android applications are collected from the public website, and 9873 malicious Android applications are collected from the CICInvesAndMal dataset, Drebin dataset, dreidanalytics dataset, and VirusShare sample library.

And 2, constructing a key API list related to malicious behaviors and vulnerabilities of the Android application program.

2.1) official website from CVE using web crawler technologyAnd crawling sentences related to Android vulnerability description on an explicit Database website, collecting vulnerable Android code samples from a Stack Overflow website, and establishing a sentence library for describing the Android vulnerability: malicious _ lib ═ V₁，V₂，...，V_i，...，V_nIn which V_iI is more than or equal to 1 and less than or equal to n which is a description statement used for describing the ith Android vulnerability, and n is the total number of statements used for describing the Android vulnerability;

2.2) performing text mining on the malicious _ lib, extracting keywords from the text mining, and calculating the word frequency of the keywords-the inverse document frequency P:

word frequency-inverse document frequency is used to evaluate the importance of a word to a corpus of documents or a corpus, and is calculated as follows:

calculating V_iWord frequency of the specific keyword:

wherein h is the description sentence V of the keyword_iS is a sentence V_iThe total number of words contained in (1);

calculating V_iThe inverse document frequency of the specific keyword:

b is the total number of sentences containing specific keywords;

calculating V according to the word frequency TF and the inverse document frequency IDF_iWord frequency of medium specific vocabulary-inverse document frequency P:

p ═ TF × IDF, where × represents multiplication;

2.3) calculating the word frequency-inverse document frequency P of other keywords except the Java keyword, the built-in type and the variable name through the process of 2.2), sequencing the word frequency-inverse document frequency P of the words from large to small, and finally collecting and sequencing 10782 keywords;

2.4) selecting keywords with the top rank of 150, wherein the keywords with the top rank have good Android application program category distinguishing capability, searching application program interface API lists in official online documents and ready-made tools through the keywords, finding out all APIs containing the keywords from the API lists, and finally obtaining a key API list related to Android vulnerabilities and malicious behaviors, wherein the key API list contains 632 key APIs.

And 3, generating a calling tracking graph of all the software samples in the step 1.

Referring to fig. 2, the specific implementation of this step is as follows:

3.1) performing decompiling treatment on an Android Application Package (APK) file, taking a decompiled result as the input of a modified Intellidroid tool, wherein the file output by the IntelliDroid tool before modification only comprises an entry point of a calling path and a final sensitive API call, and modifying the Intellidroid tool to enable the Intellidroid tool to output all complete calling paths taking an application entry point as a source to serve as a subgraph of a calling tracking graph;

3.2) extracting inter-component interaction ICC information of the Android application program through an ic3 tool, acquiring a function for inter-component interaction ICC through Intent and a component for interaction with the function, taking the function for inter-component interaction ICC as a starting point of an ICC interaction edge, taking an onCreate function executed by the component when the component is started as an end point of the ICC interaction edge, and connecting nodes for component communication through Intent in different subgraphs through the ICC interaction edge to generate a call tracing graph of the whole Android application program.

And 4, generating an abstract flow diagram of all the software samples.

Referring to fig. 3, the specific implementation of this step is as follows:

4.1) segmenting the nodes for calling the tracking graph to generate nodes of an abstract flow graph;

4.1a) decompiling the Android application program by means of an Android library to obtain an operation code of each function in the application program, and obtaining an operation code opcode corresponding to a node of a calling tracing graph;

4.1b) judging whether the operation code opcode calls a user-defined function or sends an application program interface API of Intent inside the node calling the trace graph:

if so, segmenting the opcode of the function at the calling position, and taking the opcode sequence obtained after segmentation as a node of the abstract flow graph;

if not, not carrying out segmentation;

4.2) obtaining an edge set of the abstract flow graph according to the key API list and the call tracing graph:

4.2a) depth-first traversal is carried out on the calling tracking graph to obtain a calling path, the calling path is filtered through the key API list to search a function belonging to the key API list in the calling path, and a first node in an entry point of the calling path of the function and a last node in the function are abstracted to be a starting point and an end point of a key edge to generate the key edge;

4.2b) extracting an Intent sending edge, an adjacent edge, an ICC edge and an implicit adjacent edge from the calling tracking graph;

carrying out depth-first traversal from the first node calling the entry point of the tracking graph, taking the first node of the entry point as the starting point of the Intent sending edge, stopping traversal if a node which takes the API calling the Intent sending as the end point is found in the traversal process, and taking the node as the end point of the Intent sending edge to generate the Intent sending edge;

judging whether the nodes of the abstract flow graph are obtained by partitioning the nodes of the same calling tracking graph:

if so, connecting the nodes of the abstract flow graph which are adjacent in sequence by using directed edges, and calling the edges as adjacent edges;

if not, not performing connection;

using an Android Application Package (APK) file as the input of ic3, obtaining a result output by ic3, extracting a function for inter-component communication (ICC) through Intent and a component interacting with the function, using a node in the function, which takes an API calling and sending Intent as the end, as the starting point of an ICC edge in an abstract flow diagram, and using a first node of an OnCreate function executed when the component is started as the end point of the ICC edge, so as to generate the ICC edge;

for the generated one Intent sending edge, taking the starting point of the generated Intent sending edge as the starting point of the implicit adjacent edge, and if one node in the abstract flow diagram simultaneously meets the following three conditions:

within the same component as the origin of the Intent send edge;

executed after the origin of the Intent send edge;

calling an application program interface API for receiving Intent;

then the node is taken as the terminal point of the hidden adjacent edge to generate the hidden adjacent edge;

FIG. 4 is an abstract flow graph of an exemplary program containing the five different edges, where the nodes are in the format "function name _ V _ Numbers" and represent the fourth node in a function of the application; the abstract flow graph contains 5 edges: wherein, the edge marked with 'a' is a key edge, the edge marked with 'b' is an Intent sending edge, the edge marked with 'e' is an adjacent edge, the edge marked with 'c' is an ICC edge, and the edge marked with'd' is an implicit adjacent edge;

4.2c) obtaining the reverse edges corresponding to the 5 kinds of edges, taking the 10 kinds of edges as an edge set of the abstract flow graph, and deleting the isolated nodes without edge connection in all the nodes;

4.3) adding labels to all nodes and edges after the isolated nodes are deleted to form an abstract flow diagram;

4.3a) acquiring a set of all operation codes opcode from the Android network, wherein the set of all operation codes opcode comprises 232 operation codes, and the operation codes are sequentially encoded, namely, the first operation code of the operation code set is encoded into 1, the second operation code is encoded into 2, and the like, and all opcode are encoded into numbers;

4.3b) encoding the opcode sequence of each node according to the encoded opcode set, taking the encoded values corresponding to all opcodes in the nodes as the tags of the nodes, carrying out truncation processing on the tags of the nodes with the length being larger than D and carrying out filling processing on the tags of the nodes with the length being smaller than D because the numbers of the opcodes in the nodes are inconsistent and the lengths of the generated node tags are inconsistent, and filling the tags of the nodes with 0 to make the lengths of the tags of the nodes consistent because 0 does not represent any information;

4.3c) generating labels of the edges by one-hot coding the edge types, numbering the 10 edge types from 1 to 10, wherein the labels of the edges are vectors with the length of 10, only the bits corresponding to the edge types in the vectors are 1, and the rest bits are 0.

And step 5, training the graph neural network GNN by adopting a back propagation algorithm and a gradient descent method according to the abstract flow graph of all the software samples to obtain the trained GNN.

(5.1) marking the abstract flow graph of the benign Android application program as benign, and marking the abstract flow graph of the malicious Android application program as malicious to obtain the abstract flow graph with the label;

(5.2) setting the maximum training time E of the graph neural network GNN as 65, randomly initializing GNN network parameters, and inputting the abstract flow graph with the label into the GNN;

(5.3) the output of the GNN is benign and malicious probabilities of sample software, the probability is high and is taken as a prediction result of the GNN to the software sample, a loss function is calculated according to the prediction result and a label corresponding to an abstract flow diagram, gradient values of all parameters in the network are calculated from deep to shallow, and an index F1 for evaluating the performance of the GNN network is calculated:

classifying and marking the prediction result of the sample software by the GNN model:

if the original malicious software is predicted to be malicious by the GNN model, the malicious software is marked as a true example TP;

if the GNN model predicts the original benign software as malicious, the benign software is marked as a false positive example FP;

if the GNN model predicts the original malicious software as benign, the original malicious software is marked as a false negative case FN;

calculating the prediction accuracy rate according to TP and FP:

recall was calculated from TP, FN:

and (3) calculating an index for evaluating the GNN network property according to the precision P and the recall ratio R:

(5.4) carrying out iterative updating on the parameters in the network along the opposite direction of the parameter gradient in the network so as to gradually reduce the loss function;

and (5.5) circularly executing the steps (5.3) to (5.4) until the maximum training times are reached, and selecting a network model with the optimal evaluation index F1 from the training times E as a trained GNN network model.

And 6, classifying the Android application programs through the trained GNN network.

And (3) generating an abstract flow graph of the application program to be detected by the Android application program with unknown security through the steps 3 and 4, inputting the abstract flow graph into the graph neural network GNN trained in the step 5, outputting benign and malicious probabilities of the application program to be classified, and taking the result with the maximum probability as the final judgment type.

The foregoing description is only an example of the present invention and is not intended to limit the invention, so that it will be apparent to those skilled in the art that various changes and modifications in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims

1. An android application program classification method based on an abstract flow graph and a graph neural network is characterized by comprising the following steps:

(1) malicious and benign Android application software samples are downloaded from a related sample library and domestic and foreign mainstream application markets, wherein the proportion of the benign samples to the malicious samples is 1: 1, the number of each type of samples is not less than A, and A is a positive integer greater than or equal to 1;

(3) generating a call tracing graph of all the software samples in (1);

(4) generating abstract flow diagrams of all software samples;

(4c) adding labels to all nodes and edges to form an abstract flow diagram;

(6) and (4) generating an abstract flow graph of the application program to be detected by adopting the same method as the method in the steps (3) to (4) for the Android application program with unknown security, inputting the abstract flow graph into a trained graph neural network GNN, outputting benign and malicious probabilities of the application program to be classified, and taking the class with higher probability as a final judgment type.

2. The method according to claim 1, wherein a list of key Application Program Interfaces (APIs) related to malicious behaviors and vulnerabilities of the Android application is constructed in (2) and is implemented as follows:

(2a) using a web crawler technology, crawling sentences describing Android vulnerabilities from a vulnerability repository, collecting vulnerable Android code samples from a code repository, establishing a sentence library for describing the Android vulnerabilities and malicious behaviors, performing text mining on the sentence library, and extracting M keywords from the sentence library, wherein M is a positive integer greater than or equal to 1;

(2b) sorting other keywords except Java keywords, built-in types and variable names in the keywords extracted in the step (2a) by using a word frequency-inverse document frequency TF-IDF method;

(2c) and selecting keywords N before ranking, wherein N is more than or equal to 1 and less than M, and filtering API lists in official online documents and some common analysis tools through the keywords which are ranked at the top to finally obtain a key API list associated with Android vulnerabilities and malicious behaviors.

3. The method according to claim 1, wherein the connecting of nodes in different subgraphs and communicating through an Intent component in (3b) is performed by using a function of ICC interaction between components as a starting point of an ICC interaction edge, using an onCreate function executed by the component at startup as an end point of the ICC interaction edge, and bridging the subgraphs of the call trace graph through the ICC interaction edge to obtain the call trace graph of the whole Android application.

4. The method of claim 1, wherein the nodes of the call tracing graph are segmented in (4a) to generate abstract flow graph nodes, and the method is implemented as follows:

(4a1) performing decompiling on the application program package APK by means of an android library to obtain an operation code opcode corresponding to the node of the calling tracking graph;

(4a2) judging whether the operation code opcode calls a user-defined function or an application program interface API (application program interface) for sending Intent or not in the node for calling the tracking graph:

if not, then no segmentation is performed.

5. The method of claim 1, wherein the filtering of the call path through the key API list in (4b) to obtain the key edge is to find a function belonging to the key API list in the call path, abstract a first node in an entry point of the call path and a last node in the function as a start point and an end point of the key edge if found, and add the key edge to a key edge set of the abstract flow graph.

6. The method of claim 1, wherein the extracting of Intent sending edge, neighboring edge, ICC edge, implicit neighboring edge from the call trace graph in (4b) is implemented as follows:

(4b1) carrying out depth-first traversal from a first node calling an entry point of the tracking graph, taking the first node of the entry point as a starting point of an Intent sending edge, stopping traversal if a node which takes an API calling to send Intent as an end point is found in the traversal process, taking the node as an end point of the Intent sending edge, and adding the Intent sending edge into an Intent sending edge set of the abstract flow graph;

(4b2) judging whether the generated abstract flow graph node is obtained by node segmentation of the same call tracing graph or not:

if so, connecting the nodes of the abstract flow graph which are adjacent in sequence by using a directed edge, and adding the directed edge to an adjacent edge set of the abstract flow graph;

if not, not performing connection;

(4b3) according to the function for performing inter-component communication ICC through Intent and the component interacting with the function obtained in the step (3b), taking a node in the function, which takes an API for calling and sending Intent as an end, as a starting point of an ICC edge in the abstract flow graph, taking a first node of an onCreate function executed when the component is started as an end point of the ICC edge, and adding the ICC edge into an ICC edge set of the abstract flow graph;

(4b4) regarding any one of the generated Intent sending edges in (4b1), taking the starting point of the Intent sending edge as the starting point of the implicit adjacent edge, and if one node in the abstract flow graph simultaneously meets the following three conditions:

within the same component as the origin of the Intent send edge;

executed after the origin of the Intent send edge;

calling an application program interface API for receiving Intent;

such a node is taken as the end point of the implicit neighboring edge and the implicit neighboring edge is added to the set of implicit neighboring edges of the abstract flow graph.

7. The method of claim 1, wherein (4c) labels all nodes and edges of the abstract flow graph as follows:

(4c1) acquiring a set of all operation codes opcode from an Android official network, taking the set as a global dictionary of the operation codes opcode, and encoding the opcode into a number;

(4c2) coding an opcode sequence of each node of the abstract flow graph according to the opcode dictionary, and taking the coding values of all opcodes in the nodes as labels of the nodes;

(4c3) and performing one-hot coding on the type of the edge to obtain the label of the edge.

8. The method of claim 1, wherein the GNN is trained using a back propagation algorithm and a gradient descent method in (5) as follows:

(5a) marking the abstract flow graph of the benign Android application program as benign, and marking the abstract flow graph of the malicious Android application program as malicious to obtain the abstract flow graph with the label;

(5b) setting the maximum training times E of the GNN network, randomly initializing GNN network parameters, and inputting the abstract flow graph with labels into the GNN;

(5c) calculating a loss function through the output of the GNN network and a label corresponding to the abstract flow graph, calculating gradient values of all parameters in the network from depth to depth, and calculating an index for evaluating the performance of the GNN network, namely a harmonic mean F1 of an accuracy P and a recall R;

(5d) iteratively updating the parameters in the network along the opposite direction of the parameter gradient in the network to gradually reduce the loss function;

(5e) and (5c) executing the steps (5d) in a circulating mode until the maximum training times are reached, and selecting a network model with the optimal evaluation index F1 from the training times E as a trained GNN network model.