CN114491530B

CN114491530B - Android application program classification method based on abstract flowsheet and graph neural network

Info

Publication number: CN114491530B
Application number: CN202111566330.3A
Authority: CN
Inventors: 孙聪; 史鉴; 王培丞; 伍亚飞; 马建峰
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2024-05-17
Anticipated expiration: 2041-12-20
Also published as: CN114491530A

Abstract

The invention discloses an android application program classification method based on an abstract flow diagram and a graph neural network, which mainly solves the problems that unknown malicious application programs are weak in identification capacity and malicious program behaviors cannot be accurately identified in the prior art. The implementation scheme is as follows: downloading malicious and benign android application software samples from a relevant sample library and a main stream application market at home and abroad; constructing a key application program interface API list related to malicious behaviors and vulnerabilities of the android application program; generating a call trace diagram of the android application program, and constructing an abstract flow diagram of the android application program by using the call trace diagram and the API list; adding labels to the abstract flowsheet and training the graph neural network GNN by using the abstract flowsheet with the labels; and classifying benign or malicious categories of the android application program with unknown security by using the trained graph neural network GNN. The method has the advantages of strong generalization capability and high behavior distinction of benign and malicious application programs, and can be used for detecting the malicious application programs.

Description

Android application program classification method based on abstract flowsheet and graph neural network

Technical Field

The invention belongs to the technical field of network security, and particularly relates to an android application program classification method which can be used for detecting malicious application programs.

Background

The android operating system is a Linux-based operating system, is mainly used for mobile equipment such as smart phones and tablet computers, and is the most widely used mobile terminal at present. The Android operating system occupies more than 80% of the smart phone market in 2021. Because of the high occupancy rate of the market and the open source characteristic of the Android system, android application programs become key points of malicious attacks by hackers. Therefore, the malicious software of the smart phone on the Android becomes a main safety problem in the daily life of people, which seriously threatens the personal information safety of users and even the national information safety.

At present, detection technologies for Android malicious application programs are mainly divided into a static detection technology and a dynamic detection technology. The static detection technology is to extract the characteristics of the application program, such as required authority, intention information, sensitive application program interface API call and the like, from the program under the condition of not running the Android application program;

The dynamic detection technology is characterized in that the software behavior is recorded and run by executing software in an actual environment, and the dynamic detection technology has high accuracy in the existing malicious software, but can not identify novel malicious application programs, is unfavorable for detecting the unrecorded malicious software, takes more time for research, is complex and is not suitable for actual use, so that static analysis is generally adopted.

The conventional detection techniques of the static detection and the dynamic detection require a security expert to manually define the features to be extracted, which not only consumes manpower, but also is easy to generate errors in the extraction process. Therefore, in recent years, more and more Android malicious software detection means are used for automatically detecting by means of a deep learning method, and due to high detection precision, the deep learning is more and more considered as a powerful and effective tool in the safety field.

The name is "Permission-Based Android Malware Application Detection Using Multi-Layer Perceptron" is O.S. Jannath Nisha, S.Mary Saira Bhanu, DOI:10.1007/978-3-030-16660-1_36, a researcher is characterized by rights required by an Android application program, and trains through a multi-layer perceptron to achieve the purpose of identifying malicious programs.

The university of civil aviation in China proposes an Android malicious application program detection method based on a double-channel convolutional neural network in a patent document with the application number of CN 201811024430.1. According to the method, an APK file of an installation package is decompiled, an operation code sequence and an instruction function sequence are extracted to serve as features, the features are input into a convolutional neural network, the neural network is trained, and an application program to be detected is detected through the trained neural network. According to the method, the operation code sequence and the instruction function sequence of the extracted application program are used as the characteristics for detecting the malicious program, and the characteristics contain more redundant information irrelevant to malicious codes and loopholes, so that the distinction degree between benign and malicious application program categories is not high.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an android application program classification method based on an abstract flow graph and a graph neural network, so as to effectively distinguish benign and malicious application programs, capture malicious behaviors of unknown malicious programs and improve generalization capability.

The technical idea of the invention is as follows: features which can fully reflect sensitive API call relations of the Android application program and interaction behaviors of sensitive information among components are extracted from the Android application program, and the graph neural network is trained through the features which embody program semantics and structures, so that malicious application programs are identified. The implementation scheme comprises the following steps:

(1) Downloading malicious and benign Android application software samples from a related sample library and a main stream application market at home and abroad, wherein the proportion of the benign samples to the malicious samples is 1:1, the number of samples of each type is not less than A, and A is a positive integer greater than or equal to 1;

(2) Constructing a key application program interface API list related to malicious behaviors and vulnerabilities of the Android application program;

(3) Generating a call trace graph of all the software samples in (1);

(3a) Decompiling the Android application package APK file, taking a decompiled result as the input of a modified Intellidroid tool, so that the Android application package APK file can output all call paths taking an application entry point as a source and serve as a subgraph of a call trace graph;

(3b) Extracting inter-component interaction ICC information of the Android application program through an ic3 tool, acquiring a function for performing inter-component interaction ICC through an Intent and a component for interacting with the function, connecting nodes for performing component communication through the Intent in different subgraphs according to the function, and generating a call trace diagram of the whole Android application program;

(4) Generating an abstract flowsheet of all software samples;

(4a) Dividing the nodes of the call trace graph to generate nodes of the abstract flow graph;

(4b) Performing depth-first traversal on the call trace graph to obtain a call path, filtering the call path through the key API list in the step (2) to obtain key edges, extracting Intent sending edges, adjacent edges, ICC edges and implicit adjacent edges from the call trace graph, acquiring reverse edges of the 5 edges, adding the 10 edges into the abstract flow graph to serve as an edge set of the abstract flow graph, and deleting isolated nodes without edge connection in the abstract flow graph;

(4c) Adding labels to all nodes and edges to form an abstract flow graph;

(5) Generating an abstract flow diagram of the software sample in the step (1) through the steps (3) - (4), taking the abstract flow diagram as the input of the graph neural network GNN, and training the GNN by using a back propagation algorithm and a gradient descent method to obtain a trained graph neural network GNN;

(6) And (3) for the Android application program with unknown security, the same method as in (3) - (4) is adopted to generate an abstract flow diagram of the application program to be detected, the abstract flow diagram is input into a trained graph neural network GNN, benign and malicious probabilities of the application program to be classified are output, and a result with the maximum probability is taken as a final discrimination type. Compared with the prior art, the invention has the following advantages:

Firstly, by constructing the key application program interface API list, the sensitive API calling behavior of the Android application program can be extracted more accurately, so that more malicious behavior characteristics of the Android malicious software are identified, and the detection of unknown malicious Android application software is realized.

Secondly, by generating the abstract flow diagram of the Android application program, the sensitive API calling behavior of the application program and the inter-component interaction ICC information of the program are extracted and abstracted, the behavior characteristics of the malicious program can be more accurately described, and the method has high differentiation on the benign application program and the malicious application program.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a sub-flowchart of generating a trace call graph in accordance with the present invention;

FIG. 3 is a sub-flowchart of generating an abstract flow diagram in accordance with the present invention;

fig. 4 is a schematic diagram of an abstract flow diagram generated in the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of the present invention are as follows:

and step1, downloading and collecting benign and malicious Android application program samples.

Downloading malicious and benign Android application software samples from a related sample library and a main stream application market at home and abroad, wherein the proportion of the benign samples to the malicious samples is 1:1.

In this embodiment, 9873 benign Android applications are collected from public websites, and 9873 malicious Android applications are collected from CICINVESANDMAL data sets, drebin data sets, droidAnalytics data sets and VirusShare sample libraries.

And 2, constructing a key API list related to malicious behaviors and vulnerabilities of the Android application program.

2.1 Using web crawler technology to crawl sentences describing Android vulnerabilities from CVE official networks and Exploit Database websites, collecting vulnerable Android code samples from Stack over flow websites, and establishing a sentence library describing the Android vulnerabilities: malicious _lib= { V ₁,V₂,...,V_i,...,V_n }, wherein V _i is a description sentence for describing the ith Android vulnerability, i is more than or equal to 1 and less than or equal to n, and n is the total number of sentences for describing the Android vulnerability;

2.2 Text mining is carried out on malicious _lib, keywords are extracted from the text mining, and word frequency-inverse document frequency P of the keywords is calculated:

Word frequency-inverse document frequency is used to evaluate the importance of a word to a corpus or corpus, which is calculated as follows:

calculating word frequency of specific keywords in V _i: Wherein h is the number of times the keyword appears in the description sentence V _i, and s is the total number of words contained in the sentence V _i;

Calculating the inverse document frequency of the specific keyword in V _i: wherein B is the total number of sentences containing specific keywords;

calculating word frequency-inverse document frequency P of a specific word in V _i according to the word frequency TF and the inverse document frequency IDF:

P=tf IDF, where x represents multiplication;

2.3 2.2), calculating word frequency-inverse document frequency P of other keywords except Java keywords, built-in types and variable names, and sorting the word frequency-inverse document frequency P of the words from large to small, so as to finally collect and sort 10782 keywords;

2.4 The keywords with the top 150 ranks are selected, the keywords with the top ranks have good Android application program category distinguishing capability, the keywords are used for searching an API list of an application program interface in an official online document and a ready-made tool, all APIs containing the keywords are found out, and finally a list of key APIs associated with Android vulnerabilities and malicious behaviors is obtained, wherein the list contains 632 key APIs.

And 3, generating a call trace diagram of all the software samples in the step 1.

Referring to fig. 2, the specific implementation of this step is as follows:

3.1 Decompilation processing is carried out on the APK file of the Android application program package, the decompilation result is used as the input of a Intellidroid tool after modification, the file output by the IntelliDroid tool before modification only comprises the entry point of one call path and the final sensitive API call, and the Intellidroid tool is modified so that the tool can output all complete call paths taking one application program entry point as a source and is used as a subgraph of a call trace graph;

3.2 Inter-component interaction ICC information of the Android application program is extracted through an ic3 tool, a function of inter-component interaction ICC through an Intet and a component interacting with the function are obtained, the function of inter-component interaction ICC is used as a starting point of an ICC interaction edge, a onCreate function executed by the component at the starting time is used as an end point of the ICC interaction edge, and nodes of different subgraphs for component communication through the Intet are connected through the ICC interaction edge, so that a call trace diagram of the whole Android application program is generated.

And 4, generating an abstract flow graph of all the software samples.

Referring to fig. 3, the specific implementation of this step is as follows:

4.1 Dividing the nodes of the call trace graph to generate nodes of an abstract flow graph;

4.1 a) decompiling an Android application program by means of Androguard libraries to obtain the operation code of each function in the application program, and acquiring the operation code opcodes corresponding to the call trace graph nodes;

4.1 b) judging whether the operation code opcode calls a user-defined function or sends an Intent application program interface API in the node calling the trace diagram:

if yes, dividing the opcode of the function at the calling place, and taking the opcode sequence obtained after dividing as a node of the abstract flow graph;

If not, not dividing;

4.2 Obtaining an edge set of the abstract flowsheet from the key API list and the call trace graph:

4.2 a) performing depth-first traversal on the call trace graph to obtain a call path, filtering the call path through a key API list to find a function belonging to the key API list in the call path, and abstracting a first node in an entry point of the call path of the found function and a last node in the function as a starting point and an end point of a key edge to generate the key edge;

4.2 b) extracting an Intent sending edge, an adjacent edge, an ICC edge and an implicit adjacent edge from the call trace diagram;

Performing depth-first traversal from a first node of an entry point of the call trace graph, taking the first node of the entry point as a starting point of an Intint sending edge, stopping the traversal if a node ending with an API for calling and sending the Intnt is found in the traversal process, taking the node as an ending point of the Intnt sending edge, and generating the Intnt sending edge;

Judging whether the nodes of the abstract flow graph are segmented from the nodes of the same call trace graph or not to obtain:

if so, using directed edges to connect the sequentially adjacent abstract flow graph nodes, and calling the edges as adjacent edges;

if not, not connecting;

Taking an Android Application Package (APK) file as an input of ic3, acquiring a result output by ic3, extracting a function for inter-component communication (ICC) through an Intent and a component for interacting with the function, taking a node in the function with an API for calling and sending the Intent as an end as a starting point of an ICC edge in an abstract flow diagram, taking a first node of OnCreate functions executed when the component is started as an end point of the ICC edge, and generating the ICC edge;

And regarding the generated Intentint sending edge as the starting point of the implicit adjacent edge, if one node in the abstract flow diagram simultaneously meets the following three conditions:

The starting point of the Intint sending edge is in the same component;

Executing after the start of the Intint send edge;

an application program interface API for receiving Intent is called;

then the node is used as the end point of the implicit adjacent edge to generate the implicit adjacent edge;

FIG. 4 is an abstract flow diagram of an exemplary program, including the five different edges described above, in which the node format is "function name_V_number" indicating the number of nodes in a function of the application; the 5 edges contained in the abstract flow graph are: wherein the edge marked with 'a' is a key edge, the edge marked with 'b' is an Intent sending edge, the edge marked with 'e' is an adjacent edge, the edge marked with 'c' is an ICC edge, and the edge marked with'd' is an implicit adjacent edge;

4.2 c) obtaining the reverse edges corresponding to the 5 edges, taking the 10 edges as an edge set of the abstract flow graph, and deleting isolated nodes without edge connection in all nodes;

4.3 Tagging all nodes and edges after the isolated nodes are deleted to form an abstract flow graph;

4.3 a) acquiring a set of all operation codes opcodes from the Android functional network, wherein 232 operation codes are contained in total, and the operation codes are coded sequentially, namely, the first operation code of the operation code set is coded as 1, the second operation code is coded as 2, and the like, and all the opcodes are coded as numbers;

4.3 b) coding the opcode sequence of each node according to the coded opcode set, taking the coding values corresponding to all opcodes in the nodes as the labels of the nodes, wherein the lengths of the generated node labels are inconsistent due to inconsistent numbers of the opcodes in the nodes, performing truncation processing on the node labels with the length being greater than D, and performing alignment processing on the node labels with the length being less than D, wherein 0 does not represent any information, so that the node labels are aligned by 0, and the lengths of the node labels are consistent;

4.3 c) generating edge labels by performing one-hot coding on the edge types, numbering the 10 edge types from 1 to 10, wherein the edge labels are vectors with the length of 10, only the bit positions corresponding to the edge types in the vectors are 1, and the rest bits are 0.

And step 5, training the graph neural network GNN by adopting a back propagation algorithm and a gradient descent method according to the abstract flow diagrams of all the software samples to obtain a trained GNN network.

(5.1) Marking the abstract flow graph of the benign Android application program as benign, marking the abstract flow graph of the malicious Android application program as malicious, and obtaining the abstract flow graph with the label;

(5.2) setting the maximum training frequency E=65 of the GNN network of the graph neural network, randomly initializing the parameters of the GNN network, and inputting the abstract flowsheet with the label into the GNN;

(5.3) the output of the GNN is benign and malicious probability of sample software, the probability is large and is taken as a prediction result of the GNN on the software sample, a loss function is calculated according to the prediction result and labels corresponding to abstract flow graphs, gradient values of all parameters in a network are calculated from deep to shallow, and an index F1 for evaluating the performance of the GNN network is calculated:

classifying and marking the prediction result of the sample software by the GNN model:

if the GNN model predicts the original malicious software as malicious, the original malicious software is recorded as a real example TP;

if the GNN model predicts the original benign software as malicious, the original benign software is marked as a false positive FP;

if the GNN model predicts the original malicious software as benign, the GNN model is marked as a false negative example FN;

calculating the prediction accuracy according to TP and FP:

calculating recall rate according to TP and FN:

Calculating and evaluating the index of the GNN network according to the precision rate P and the recall rate R:

(5.4) iteratively updating parameters in the network along the opposite direction of the parameter gradient in the network to gradually reduce the loss function;

And (5.5) circularly executing the steps (5.3) - (5.4) until the maximum training times are reached, and selecting a network model with the optimal evaluation index F1 from the E times of training as a trained GNN network model.

And 6, classifying the Android application programs through the trained GNN network.

And (3) for the Android application program with unknown security, generating an abstract flow diagram of the application program to be detected through the step (3) and the step (4), inputting the abstract flow diagram into the Graph Neural Network (GNN) trained in the step (5), outputting benign and malicious probabilities of the application program to be classified, and taking a result with the maximum probability as a final discrimination type.

The above description is only one specific example of the invention and does not constitute any limitation of the invention, and it will be apparent to those skilled in the art that various modifications and changes in form and details may be made without departing from the principle and construction of the invention, but these modifications and changes based on the idea of the invention remain within the scope of the claims of the invention.

Claims

1. An android application program classification method based on an abstract flow graph and a graph neural network is characterized by comprising the following steps of:

(1) Downloading malicious and benign Android application software samples from a sample library and an application market, wherein the proportion of the benign samples to the malicious samples is 1:1, the number of samples of each type is not less than A, and A is a positive integer greater than or equal to 1;

(3) Generating a call trace graph of all the software samples in (1);

(4) Generating an abstract flowsheet of all software samples;

(4c) Adding labels to all nodes and edges to form an abstract flow graph;

(6) And (3) for the Android application program with unknown security, the same method as in (3) - (4) is adopted to generate an abstract flow diagram of the application program to be detected, the abstract flow diagram is input into a trained graph neural network GNN, benign and malicious probabilities of the application program to be classified are output, and the class with the larger probability is taken as the final discrimination type.

2. The method of claim 1, wherein constructing a list of key application program interfaces API related to malicious behavior and vulnerabilities of the Android application in (2) is accomplished by:

(2a) Using a web crawler technology to crawl sentences describing Android vulnerabilities from a vulnerability storage library, collecting vulnerable Android code samples from a code storage library, establishing a sentence library for describing the Android vulnerabilities and malicious behaviors, performing text mining on the sentence library, extracting M keywords from the sentence library, wherein M is a positive integer greater than or equal to 1;

(2b) Ordering other keywords except Java keywords, built-in types and variable names in the keywords extracted in the step (2 a) by using a word frequency-inverse document frequency TF-IDF method;

(2c) And selecting keywords with N being higher than or equal to 1 and N < M, and filtering API lists in the official online documents and analysis tools through the keywords with N being higher than or equal to 1, so as to finally obtain a key API list associated with Android vulnerabilities and malicious behaviors.

3. The method of claim 1, wherein (3 b) connects nodes in different subgraphs that communicate through the Intint component, uses a function of inter-component interaction ICC as a starting point of an ICC interaction edge, uses onCreate functions executed by the component at startup as an ending point of the ICC interaction edge, and bridges subgraphs of the call trace graph through the ICC interaction edge to obtain the call trace graph of the whole Android application program.

4. The method of claim 1, wherein partitioning call trace graph nodes in (4 a) generates abstract flowsheet nodes by:

(4a1) Decompiling the application package APK by means of Androguard library to obtain an operation code opcode corresponding to the call trace graph node;

(4a2) Judging whether the operation code opcode calls a user-defined function or sends an Intent application program interface API in a node calling the trace diagram:

If not, no segmentation is performed.

5. The method of claim 1, wherein filtering the call path through the key API list to obtain the key edge in (4 b) is to find a function belonging to the key API list in the call path, and if found, abstract a first node in an entry point of the call path and a last node in the function as a start point and an end point of the key edge, and add the key edge to a set of key edges in the abstract flow graph.

6. The method of claim 1, wherein the extracting of the Intint send edge, the adjacent edge, the ICC edge, the implicit adjacent edge from the call trace graph in (4 b) is performed as follows:

(4b1) Performing depth-first traversal from a first node of an entry point of a call trace graph, taking the first node of the entry point as a starting point of an Intint sending edge, stopping traversal if a node ending in an API for calling and sending the Intnt is found in the traversal process, taking the node as an ending point of the Intint sending edge, and adding the Intnt sending edge into an Intint sending edge set of an abstract flow graph;

(4b2) Judging whether the abstract flow graph nodes generated in (4 a) are segmented from the nodes of the same call trace graph or not, wherein the abstract flow graph nodes are obtained by:

If yes, using directed edges to connect the adjacent abstract flow graph nodes in sequence, and adding the directed edges into adjacent edge sets of the abstract flow graph;

if not, not connecting;

(4b3) According to the function of inter-component communication ICC through the Intent and the component interacting with the function obtained in the step (3 b), taking a node of the function ending with an API for calling the transmitting Intent as a starting point of an ICC edge in the abstract flow diagram, taking a first node of OnCreate functions executed when the component is started as an end point of the ICC edge, and adding the ICC edge into an ICC edge set of the abstract flow diagram;

(4b4) And (3) regarding any Intent sending edge generated in the step (4 b 1), taking the starting point of the Intent sending edge as the starting point of an implicit adjacent edge, and if one node in the abstract flow diagram simultaneously meets the following three conditions:

The starting point of the Intint sending edge is in the same component;

Executing after the start of the Intint send edge;

an application program interface API for receiving Intent is called;

Such node is taken as the end point of the implicit neighbor and the implicit neighbor is added to the implicit neighbor set of the abstract flow graph.

7. The method of claim 1, wherein (4 c) tags all nodes and edges of the abstract flowsheet as follows:

(4c1) Acquiring a set of all operation codes opcodes from an Android functional network, taking the set of all operation codes opcodes as a global dictionary of the operation codes opcodes, and encoding the opcodes into numbers;

(4c2) Comparing the opcodes dictionary to encode the opcodes sequence of each node of the abstract flow graph, and taking the encoded values of all opcodes in the node as the labels of the node;

(4c3) And carrying out one-hot coding on the types of the edges to obtain the labels of the edges.

8. The method of claim 1, wherein training the GNN in (5) using a back propagation algorithm and a gradient descent method is accomplished by:

(5a) Labeling the abstract flow graph of the benign Android application program as benign, labeling the abstract flow graph of the malicious Android application program as malicious, and obtaining the abstract flow graph with the label;

(5b) Setting the maximum training frequency E of the GNN, randomly initializing the parameters of the GNN, and inputting the abstract flowsheet with the label into the GNN;

(5c) Calculating a loss function through the output of the GNN network and the label corresponding to the abstract flow graph, calculating gradient values of all parameters in the network from deep to shallow, and calculating an index for evaluating the GNN network performance, namely a harmonic mean F1 of the precision rate P and the recall rate R;

(5d) Along the opposite direction of the parameter gradient in the network, the parameter in the network is iteratively updated, so that the loss function is gradually reduced;

(5e) And (5 c) - (5 d) are circularly executed until the maximum training times are reached, and a network model with the optimal evaluation index F1 is selected from E times of training to serve as a trained GNN network model.