CN114595451A

CN114595451A - Graph convolution-based android malicious application classification method

Info

Publication number: CN114595451A
Application number: CN202210158644.8A
Authority: CN
Inventors: 林飞; 尹修恒; 易永波; 古元; 毛华阳; 华仲峰
Original assignee: Beijing Act Technology Development Co ltd
Current assignee: Beijing Act Technology Development Co ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2022-06-07

Abstract

An android malicious application classification method based on graph convolution relates to the technical field of information, and comprises the following steps: 1) decompiling the APK file; 2) extracting the calling relation of the malicious code program block and the instruction distribution characteristics of the malicious code program block; 3) selecting a key program block according to the importance degree associated with the malicious code program block; 4) performing dimensionality reduction and nonlinear transformation on instruction distribution characteristics of the malicious code program block; 5) the embedded characteristic of the calling relation graph of the malicious code program block is fused with the instruction distribution characteristic of the transformed malicious code program block; 6) establishing a graph convolution neural network model; 7) sorting and screening nodes; 8) carrying out family classification on the malicious software by the convolutional neural network; according to the method, the graph is constructed by using the calling relation and the program block characteristics, the graph neural network model is established by fully using the graph characteristics to identify the malicious applications and classify the malicious applications, and the identification accuracy is effectively improved.

Description

Graph convolution-based android malicious application classification method

Technical Field

The invention relates to the technical field of information.

Background

Android is a free and open source code operating system, which is led and developed by google, usa and the open mobile alliance, and is mainly used for mobile devices, such as smart phones and tablet computers. With the rapid development of the mobile internet, android terminal users are greatly increased, and malicious applications are gradually flooded. Besides the well-known trojans, viruses, lasso software, advertising, spyware, and the like. Mobile malware detection has become a hotspot problem in the field of network security.

At present, malicious software detection mainly adopts a static method and a dynamic method, wherein static analysis mainly extracts the authority, API sequence, code and other characteristics of an android program, and a dynamic analysis technology needs to simulate the real machine operation of the program and capture the behavior characteristics of the software as an analysis basis. Compared with dynamic analysis, static analysis has the characteristics of high analysis speed, less occupied resources and the like, and mainly comprises a detection method based on feature codes and a detection method based on machine learning.

The feature code-based malicious software detection method mainly comprises the steps of extracting feature codes of target software to be detected, matching the feature codes with a known malicious software feature code recognition library, defining the target software as malicious software if matching is successful, and defining the target software as normal software if matching is not successful. Common feature codes mainly include digital signatures of android applications, common API functions and sensitive permissions of malicious software, and the like.

The detection method based on machine learning mainly adopts the principle that different dimensional features of a program are analyzed and extracted, each application is represented by a multi-dimensional vector, and finally a machine learning classification algorithm is utilized to train a training set sample, so that a classifier is constructed to predict whether an unknown sample is malicious or not. The machine learning classification algorithm comprises: such as support vector machines, random forests, neural networks, etc. Common feature dimensions include rights, components, APIs, and APP presentation information, among others.

The identification method based on the feature codes is compared with a malicious software feature code database for identification, has the characteristics of high speed, high accuracy, strong interpretability and the like, is strongly dependent on the feature code database, meanwhile, along with the continuous emergence of novel malicious software, the database needs to be continuously updated, a large amount of labor cost and time cost are consumed, the feature codes are easily changed through technologies such as confusion, and malicious detection is avoided.

The algorithm based on machine learning represents malicious software in a multi-dimensional vector mode through feature extraction, and then trains a classifier to recognize. The method can quickly find the variant application of known malicious families, and can deeply analyze the characteristics and screen important characteristics. However, the method only analyzes the static information of the programs and ignores the calling relation among the software programs.

The invention provides a malicious application identification method based on graph convolution, which is characterized in that Dalvik byte codes are obtained by decompiling APK files, program blocks are divided according to the Dalvik instruction execution sequence, each program block has different instructions, calling relations exist among the program blocks, a graph convolution neural network model is established, and malicious applications are classified.

Prior Art

Dalvik is a virtual machine specially designed for the android operating system by Google, and is deeply optimized. The Davilk bytecode is of only two types: a base type and a reference type. Both objects and arrays are reference types, and the description of the bytecode type in Davilk is consistent with the descriptor rules in the JVM.

The Dalvik byte code has its own instruction set, similar to assembly language, one Dalvik instruction includes corresponding operation code and operand, the Dalvik instruction in one function can be divided into basic blocks according to its execution sequence relation, each basic block is composed of several Dalvik instructions, the Dalvik byte code has 244 different instructions. Smali file stores Dalvik byte code, Smali supports annotation, debugging information and line number information, Smali supports the basic characteristic of Java, and Smali is generally used for reverse engineering of android programs.

An article, "Android native code control flow chart extraction method based on symbolic execution", which is carried in journal of network and information security, 2017, 7 months, volume 3, 7 th. The article presents a method for extracting program call graph and instruction distribution characteristics within a program block based on Dalvik bytecode execution using a symbolic substitution method.

Disclosure of Invention

In view of the defects of the prior art, the graph convolution-based android malicious application classification method provided by the invention comprises the following steps: 1) decompiling the APK file; 2) extracting the calling relation of the malicious code program block and the instruction distribution characteristics of the malicious code program block; 3) selecting a key program block according to the importance degree associated with the malicious code program block; 4) performing dimensionality reduction and nonlinear transformation on instruction distribution characteristics of the malicious code program block; 5) the embedded characteristic of the calling relation graph of the malicious code program block is fused with the instruction distribution characteristic of the transformed malicious code program block; 6) establishing a graph convolution neural network model; 7) sorting and screening nodes; 8) carrying out family classification on the malicious software by the convolutional neural network;

the graph convolution-based android malicious application classification method comprises the following specific implementation steps:

1) decompiling APK files

Decompiling the android application program by using an apktool tool to obtain Smali intermediate code;

2) extracting calling relation of malicious code program block and instruction distribution characteristic of malicious code program block

The Smali file stores Dalvik byte codes, and a known malicious code library is utilized to compare Smali intermediate codes to find out malicious codes; executing a Dalvik byte code by using a symbol substitution method, thereby extracting a program calling relationship, defining the calling relationship between a malicious code program block and a non-malicious code program block, marking and numbering the malicious code program block, and marking and numbering the non-malicious code program block which has a direct calling relationship with a malicious code; marking and numbering non-malicious code program blocks with indirect calling relation with malicious codes; 244 different instructions exist in the Dalvik byte code, and the instruction distribution characteristics of the malicious code program block are determined according to the instructions in the malicious code program block;

3) selecting a key program block according to the importance degree associated with the malicious code program block;

the importance of each chunk to the malicious code family is calculated:

；

；

TF-IDF=TF*IDF；

wherein

Is the number of occurrences of chunk i in malicious family j,

is the sum of the number of occurrences of all blocks of the j files,

is the total number of malicious applications,

is the number of malicious applications containing chunk i;

after the importance of each program block to the malicious code family is calculated, eliminating the program blocks with the importance ranking of the program blocks to the malicious code family beyond the first three quarters to form a malicious code program block calling relation graph by taking the importance ranking of the program blocks to the malicious code family as a threshold value;

4) de-scaling and non-linear transformation of malicious code block instruction distribution characteristics

The Dalvik bytecode has 244 different instructions, and the obtained instruction distribution characteristics of the malicious code program block are 244-dimensional vector e_i；

And performing reduced dimension and nonlinear transformation on the instruction distribution characteristics of the malicious code program block:

；

wherein E_iIs the instruction distribution characteristic of the transformed malicious code program block, the vector dimension is 100, W is a transformation weight parameter matrix, the transformation weight parameter matrix is obtained by model training, g is an activation function and has the function of being specific to the malicious code program blockAnd (3) performing nonlinear transformation to enhance the expression capability of the model, wherein a RELU activation function is taken, and a function expression is as follows: f (x) = max (0, x);

5) fusion of embedded characteristics of calling relation graph of malicious code program block and instruction distribution characteristics of transformed malicious code program block

Multidimensional vector G for embedding each node in calling relation graph of malicious code program block_iRepresenting, as a graph-embedded feature, the vector dimension i as an instruction distribution transformation feature E_i(ii) a By means of I_iThe fused features are represented as a result of the fusion,

wherein

，

The weight parameters are obtained through model training; i is_iThe dimension forming the feature fusion matrix I is N x d, N is the number of program block nodes, and d is the dimension of node embedding;

6) establishing a graph convolution neural network model

If the adjacent matrix of the calling relational graph of the malicious code program block is A, A is_ijIndicating that a calling relationship exists between the program block i and the program block j, wherein the value of the corresponding position is 1, and if no calling relationship exists, the value of the corresponding position is 0; defining the degree of calling the nodes of the relational graph by the malicious code program block as D, wherein the degree of the nodes represents the number of connections associated with the target nodes;

for feature extraction of the graph we use a multi-layer neural network structure, for each layer, using a mapping function

To calculate; h₁Is a graph-embedded expression of layer I, H₀An initialization node expression matrix representing the 0 th layer, namely a characteristic fusion matrix I; f is a non-linear function; the graph convolution network can be regarded as a recursive computation of a plurality of graph convolutions, and the nonlinear function of each layer can be expressed as:

wherein, the first and the second end of the pipe are connected with each other,

，E_Nis an N-dimensional identity matrix, which is added to the identity matrix for adding self-join;

is the in-out matrix for all the block nodes,

is an activation function, the function expression is:

；

7) sequencing screening node

According to the embedding vector inner product value, carrying out embedding sequencing on the nodes, and selecting the first 50 nodes with similar values;

8) carrying out family classification on the malicious software by the convolutional neural network;

and establishing a convolutional neural network model, embedding the 50 screened nodes into a matrix for convolutional calculation, and then connecting a full connection layer to classify malicious applications.

Advantageous effects

Applications in the same malicious family always have many common features and behave similarly, with the vast majority of malware being derivative variants of known malicious families. According to the method, the program similarity of the malicious family is considered, the program block of the malicious code is extracted, the graph is constructed by the calling relation and the program block characteristics, the graph neural network model is established by fully utilizing the graph characteristics to identify the malicious application and classify the malicious application, and the identification accuracy is effectively improved.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

Example one

Referring to fig. 1, the graph convolution-based android malicious application classification method provided by the invention comprises the following steps: s01 decompiling APK files; s02 extracting the calling relation of the malicious code program block and the instruction distribution characteristics of the malicious code program block; s03 selecting a key program block according to the importance degree associated with the malicious code program block; s04, performing dimensionality reduction and nonlinear transformation on the instruction distribution characteristics of the malicious code program block; s05 the embedded characteristic of the calling relation graph of the malicious code program block is fused with the instruction distribution characteristic of the transformed malicious code program block; s06, establishing a graph convolution neural network model; s07 sorting the screening nodes; s08, carrying out family classification on the malicious software by the convolutional neural network;

s01) decompiling APK files

Decompiling the android application program by using an tool, namely, using the apktool to obtain Smali intermediate code;

s02) extracting the calling relation of the malicious code program block and the instruction distribution characteristics of the malicious code program block

The Smali file stores Dalvik byte codes, and a known malicious code library is utilized to compare Smali intermediate codes to find out malicious codes; using a symbol substitution method to execute a Dalvik byte code, thereby extracting a program calling relationship, determining the calling relationship between a malicious code program block and a non-malicious code program block, marking and numbering the malicious code program block, and marking and numbering the non-malicious code program block which has a direct calling relationship with the malicious code; marking and numbering non-malicious code program blocks with indirect calling relation with malicious codes; 244 different instructions exist in the Dalvik byte code, and the instruction distribution characteristics of the malicious code program block are determined according to the instructions in the malicious code program block;

s03) selecting a key program block according to the importance degree associated with the malicious code program block;

the importance of each chunk to the malicious code family is calculated:

；

；

TF-IDF=TF*IDF；

wherein

Is the number of occurrences of chunk i in malicious family j,

is the sum of the number of occurrences of all blocks of the j files,

is the total number of malicious applications,

is the number of malicious applications containing chunk i;

s04) performing reduced dimension and nonlinear transformation on instruction distribution characteristics of malicious code program blocks

The Dalvik byte code has 244 different instructions, and the obtained instruction distribution characteristic of the malicious code program block is 244-dimensional vector e_i；

；

wherein E_iThe method is characterized in that the method is a transformed malicious code program block instruction distribution characteristic, vector dimension is 100, W is a transformation weight parameter matrix, the transformation weight parameter matrix is obtained through model training, g is an activation function, the function is to carry out nonlinear transformation on the characteristic, the expression capability of a model is enhanced, and a RELU activation function and a function expression are taken: f (x) = max (0, x);

s05) the embedded characteristic of the calling relation graph of the malicious code program block is fused with the instruction distribution characteristic of the transformed malicious code program block

wherein

，

s06) establishing a graph convolution neural network model

wherein the content of the first and second substances,

is the in-out matrix for all the block nodes,

is an activation function, the function expression is:

；

s07) sorting and screening node

s08) carrying out family classification on the malicious software by the convolutional neural network;

and establishing a convolutional neural network model, embedding the 50 screened nodes into a matrix for convolution calculation, and then connecting a full connection layer to classify malicious applications.

Claims

1. The graph convolution-based android malicious application classification method is characterized by comprising the following steps: 1) decompiling the APK file; 2) extracting the calling relation of the malicious code program block and the instruction distribution characteristics of the malicious code program block; 3) selecting a key program block according to the importance degree associated with the malicious code program block; 4) performing dimensionality reduction and nonlinear transformation on the instruction distribution characteristics of the malicious code program block; 5) the embedded characteristic of the calling relation graph of the malicious code program block is fused with the instruction distribution characteristic of the transformed malicious code program block; 6) establishing a graph convolution neural network model; 7) sorting and screening nodes; 8) carrying out family classification on the malicious software by the convolutional neural network;

1) decompiling APK files

The Smali file stores the Dalvik byte code, and a known malicious code library is used for comparing the Smali intermediate code to find a malicious code; executing a Dalvik byte code by using a symbol substitution method, thereby extracting a program calling relationship, defining the calling relationship between a malicious code program block and a non-malicious code program block, marking and numbering the malicious code program block, and marking and numbering the non-malicious code program block which has a direct calling relationship with a malicious code; marking and numbering non-malicious code program blocks with indirect calling relation with malicious codes; 244 different instructions exist in the Dalvik byte code, and the instruction distribution characteristics of the malicious code program block are determined according to the instructions in the malicious code program block;

the importance of each chunk to the malicious code family is calculated:

；

；

TF-IDF=TF*IDF；

wherein

Is the number of occurrences of chunk i in malicious family j,

is the sum of the number of occurrences of all blocks of the j files,

is the total number of malicious applications,

is the number of malicious applications containing chunk i;

；

in which

，

6) establishing a graph convolution neural network model

If the adjacent matrix of the calling relational graph of the malicious code program block is A, A is_ijIndicating that a calling relationship exists between the program block i and the program block j, wherein the corresponding position value is 1, and if no calling relationship exists, the corresponding position is 0; defining the degree of calling the nodes of the relational graph by the malicious code program block as D, wherein the degree of the nodes represents the number of connections associated with the target nodes;

is the in-out matrix for all the block nodes,

is an activation function, the function expression is:

；

7) sequencing screening node