CN113821799A

CN113821799A - Multi-label classification method for malicious software based on graph convolution neural network

Info

Publication number: CN113821799A
Application number: CN202111042100.7A
Authority: CN
Inventors: 宋玉蓉; 白敬华
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-09-07
Filing date: 2021-09-07
Publication date: 2021-12-21
Anticipated expiration: 2041-09-07
Also published as: CN113821799B

Abstract

The invention provides a multi-label classification method for malicious software based on a graph convolution neural network, wherein a classification model comprises the following steps: s100: extracting the characteristics of the function call graph, disassembling the original binary file to obtain an assembly code, extracting the semantic and structural characteristics of the function call graph to obtain a graph embedding vector of the sample; s200: extracting the characteristics of the multi-label relationship, and constructing a model for extracting the label relationship through a label relationship graph to obtain a multi-label classifier; s300: performing point multiplication on the graph embedding vector and the multi-label classifier, and performing structure mapping on a result obtained by the point multiplication to obtain a classification result; s400: and constructing a multi-label loss function, and calculating the loss value of the classification model by calculating the difference value between the classification result and the real result of each label. Compared with the prior art, the multi-label classification method has a good multi-label classification effect on the malicious software with various labels.

Description

Multi-label classification method for malicious software based on graph convolution neural network

Technical Field

The invention relates to the technical field of malicious software detection, in particular to a multi-label classification method for malicious software based on a graph convolution neural network.

Background

With the game of the malicious software protection technology and the malicious software, in the current network environment, the malicious software is not limited to an attack of a behavior, such as WannaCry which erupts in 2017, because the malicious software has the behavior of extorting sharp encrypted data and is classified as extorting virus by the public, but the malicious software removes the behavior of extorting virus encrypted files and also has the behaviors of copying and spreading worms through the network and disguising trojans and horses respectively. In recent years, with the development of Graph Neural Networks (GNNs), unsophisticated performances are obtained in the work of extracting the connections between entities, various fields begin to try to introduce Graph Neural networks to carry out research, and the field of malware detection also tries to carry out research by taking Control Flow Graphs (CFGs) and Function Call Graphs (FCGs) of binary files as entry points.

Graph Convolutional neural Network (GCN) is a Graph representation learning method. The method is a natural popularization of the convolutional neural network on graph data, and can simultaneously carry out end-to-end learning on node attribute information and topological structure information.

In view of the above, it is necessary to provide a new malware multi-tag classification method based on a graph convolution neural network to solve the above problem.

Disclosure of Invention

The invention aims to provide a multi-label classification method for malicious software based on a graph convolution neural network, which has a good classification effect on the malicious software with various labels.

In order to achieve the above object, the present invention provides a multi-tag malware classification method based on a convolutional neural network, which comprises the following steps:

s100: extracting the characteristics of the function call graph, disassembling the original binary file to obtain an assembly code, extracting the semantic and structural characteristics of the function call graph to obtain a graph embedding vector of the sample;

s200: extracting the characteristics of the multi-label relationship, and constructing a model for extracting the label relationship through a label relationship graph to obtain a multi-label classifier;

s300: performing point multiplication on the graph embedding vector and the multi-label classifier, and performing structure mapping on a result obtained by the point multiplication to obtain a classification result;

s400: and constructing a multi-label loss function, and calculating the loss value of the classification model by calculating the difference value between the classification result and the real result of each label.

As a further improvement of the present invention, the step S100 specifically includes:

s1: extracting the semantic and structural characteristics of the function call graph, disassembling the original binary file to obtain assembly codes, and constructing a model for extracting the semantic and structural characteristics of the function call graph, wherein the function call graph of the model is represented by

Wherein the content of the first and second substances,

set of statement blocks, epsilon, representing a function call graph₁A set of connected edges representing statement blocks of a function call graph,

the vector obtained by embedding words in the semantic features representing the statement blocks is expressed as

Wherein n represents the number of the sentence blocks of the function call graph, r represents the vector dimension of the sentence blocks after the words are embedded,

obtaining the graph embedding vector of the function call graph through GCN training

As a further improvement of the present invention, the step S1 specifically includes:

s11: acquiring a function call graph of the binary file, disassembling the original binary file to obtain an assembly code, analyzing the jump relation of the assembly code to obtain the function call graph of the binary file

Wherein the content of the first and second substances,

set of nodes, ε, for the statement blocks in the function call graph₁Connecting edge sets of statement blocks in the function call graph;

s12: counting all statement blocks of the function call graph, extracting operation codes (Opcodes) as words, wherein each Opcode corresponds to a word in a Natural Language Processing (NLP) task, each statement block corresponds to a sentence, training the statement blocks by adopting a word vectorization (World2Vec) model to obtain a vector of each statement block

Further, the function call graph is obtained and is expressed as

Wherein the content of the first and second substances,

s13: and (3) processing the obtained function call graph by a GCN model formula (1) to obtain the information update of the node after each layer of convolution:

H^(l+1)＝f(H^l,A) (1)

wherein the attention score calculation method using kipf graph convolution and self-attention graph pooling mechanism (SAGPool) yields equation (2):

where, σ is a non-linear activation function,

is that

The degree matrix of (c) is,

represents the weight of each layer of learning,

the result after graph convolution is subjected to global pooling by using SAGPool, and node selection is performed by using SAGPool

Wherein the pooling ratio k ∈ (0, 1) is a hyper-parameter representing the number of nodes to be reserved,

obtaining a result after global pooling through a Masking operation;

s14: reading (Readout) the result after the global pooling to obtain the graph embedding vector

As a further improvement of the present invention, the step S200 specifically includes:

s2: extracting behavior characteristics of the multi-label relationship, and constructing a model for extracting the label relationship through a label relationship graph, wherein the model is input into the label relationship graph and expressed as

Wherein the content of the first and second substances,

represents the set of all labels of the sample, ε₂A set of connected edges representing the label relationship,

the vector obtained by One-bit effective coding (One-Hot coding) of the representative node is expressed as

Wherein C and C respectively represent the category number of the nodes and the dimension of the label after being coded,

after multi-label training, a multi-label classifier is obtained and is represented as

As a further improvement of the present invention, the step S2 specifically includes:

s21: counting the labels of the samples to obtain the conditional probability and the joint probability of each label, and obtaining the probability among different labels according to a formula p (A | B) ═ p (A, B)/p (B);

s22: constructing a correlation coefficient matrix A_ij：

Wherein A is_ijRepresents the probability that, when j is present, i occurs,

performing One-Hot coding on each label to obtain a label relation graph

Wherein the content of the first and second substances,

representing the vector obtained by the node after One-Hot coding,

c and C respectively represent the category number of the nodes and the dimension of the label after encoding;

s23: label relationship graph obtained using GCN pairs

Semi-supervised learning is carried out, different node relations are mapped into one vector,

wherein the object classifier of the learned function is

The convolution formula is:

wherein the content of the first and second substances,

namely a multi-label classifier of the layer l.

As a further improvement of the present invention, the step S300 specifically includes:

s3: embedding the graph into vector X and the multi-label classifier

Performs a dot product operation, represented as

Obtaining a multi-label classification score, and then obtaining a final classification result through nonlinear operation

Wherein the content of the first and second substances,

representing the corresponding multi-classification result after the sample training,

and mapping the multi-classification result structure into {0,1} by using an activation function (Sigmoid).

As a further improvement of the present invention, the optimization objectives of step S400 are:

label Y e Y for each exemplar¹，y²，…y^C}，

Where y ∈ {0,1}, 0 denotes that the sample does not have this behavioral characteristic, 1 denotes that the sample has this behavioral characteristic,

the loss function of the model is a plurality of two-class loss sums, and the loss function is expressed as the following formula:

and calculating the loss value of the model by calculating the difference value between the classification result and the real result of each label.

The invention has the beneficial effects that: the invention extracts each block semantic information in the function call graph by using World2Vec through function call graph analysis constructed from binary file collection codes, and extracts the structure information of the function call graph by using GCN to obtain the semantic and graph structure embedding of each binary file, thereby effectively reflecting the operation characteristics executed by the binary files in the operation process. Secondly, the multi-label classifier is constructed by using multi-label classification and establishing the relation among different labels, the condition that the malicious software can have various types of behaviors is fully considered, the behavior of the malicious software is fully analyzed, and the multi-label classifier has a good classifying effect on the malicious software with various labels.

Drawings

FIG. 1 is a flowchart of the multi-label malware classification method based on the graph convolution neural network of the present invention.

Fig. 2 is a flowchart of the function call graph feature extraction in fig. 1.

FIG. 3 is a flow diagram of the get multi-label classifier of FIG. 1.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings of the embodiments of the present disclosure.

Referring to fig. 1-3, a multi-label malware classification method based on a convolutional neural network of the present invention includes the following steps:

Specifically, the step S100 specifically includes:

Wherein the content of the first and second substances,

set of statement blocks, epsilon, representing a function call graph₁Representative functionCalling the set of connected edges of the statement block of the graph,

Further, the step S1 specifically includes:

Wherein the content of the first and second substances,

s12: counting all statement blocks of the function call graph, extracting Opcodes therein as words, wherein each Opcode corresponds to a word in an NLP task, each statement block corresponds to a sentence, training the statement blocks by adopting a World2Vec model to obtain a vector of each statement block

Further, the function call graph is obtained and is expressed as

Wherein the content of the first and second substances,

H^(l+1)＝f(H^l,A) (1)

wherein formula (2) is obtained using a kipf graph convolution and a SAGPool attention score calculation method:

where, σ is a non-linear activation function,

is that

The degree matrix of (c) is,

represents the weight of each layer of learning,

obtaining a result after global pooling through Masking operation;

s14: performing Readout operation on the result after the global pooling to obtain the graph embedding vector

In the present application, step S200 specifically includes:

Wherein the content of the first and second substances,

the vector obtained by encoding the representative node by One-Hot is expressed as

Further, step S2 specifically includes:

s21: and counting the labels of the samples to obtain the conditional probability and the joint probability of each label, and obtaining the probability among different labels according to a formula p (A | B) ═ p (A, B)/p (B).

S22: constructing a correlation coefficient matrix A_ij：

Wherein A is_ijRepresents the probability that, when j is present, i occurs,

performing One-Hot coding on each label, whereinTo obtain a label relationship diagram

Wherein the content of the first and second substances,

representing the vector obtained by the node after One-Hot coding,

wherein C and C represent the category number of the node and the dimension of the label after encoding respectively.

S23: label relationship graph obtained using GCN pairs

wherein the object classifier of the learned function is

The convolution formula is:

wherein the content of the first and second substances,

namely a multi-label classifier of the layer l.

Step S300 specifically includes:

s3: embedding the graph into vector X and the multi-label classifier

Performs a dot product operation, represented as

Obtaining a multi-label classification score, thenObtaining the final classification result by nonlinear operation

Wherein the content of the first and second substances,

and mapping the multi-classification result structure into {0,1} by using Sigmoid.

The optimization goal of step S400 is: label Y e Y for each exemplar¹，y²，…y^C}，

In summary, the invention extracts each block of semantic information in the function call graph by World2Vec through function call graph analysis constructed from the binary file collection codes, and then extracts the structure information of the function call graph by using the GCN network to obtain the semantic and graph structure embedding of each binary file, thereby effectively reflecting the operation characteristics executed by the binary files in the operation process. Secondly, the multi-label classifier is constructed by using multi-label classification and establishing the relation among different labels, the condition that the malicious software can have various types of behaviors is fully considered, the behavior of the malicious software is fully analyzed, and the multi-label classifier has a good classifying effect on the malicious software with various labels.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. A multi-label classification method for malicious software based on a graph convolution neural network is characterized by comprising the following steps:

2. The multi-label malware classification method based on the graph convolution neural network as claimed in claim 1, wherein the step S100 specifically includes:

Wherein the content of the first and second substances,

a collection of statement blocks representing a function call graph,

a set of connected edges representing statement blocks of a function call graph,

obtaining the graph embedding vector of the function call graph through graph convolution neural network training

3. The multi-label malware classification method based on the graph convolution neural network as claimed in claim 2, wherein the step S1 specifically includes:

Wherein the content of the first and second substances,

s12: counting all the statement blocks of the function call graph, extracting operation codes as words, wherein each operation code corresponds to a word in a natural language processing task, each statement block corresponds to a sentence, training the statement blocks by adopting a word vectorization model to obtain a vector of each statement block

Further, the function call graph is obtained and is expressed as

Wherein the content of the first and second substances,

s13: and (3) processing the obtained function call graph by a graph convolution neural network model formula (1) to obtain the information update of each layer of convolved nodes:

H^(l+1)＝f(H^l,A) (1)

wherein, the attention score calculation method using kipf graph convolution and self-attention graph pooling mechanism yields formula (2):

where, σ is a non-linear activation function,

is that

The degree matrix of (c) is,

represents the weight of each layer of learning,

nodes after convolution of the graphIf the self-attention-drawing pooling mechanism is adopted for global pooling, the node selection is adopted

obtaining a result after global pooling through a mask operation;

s14: reading out the result after the global pooling to obtain the graph embedding vector

4. The multi-label malware classification method based on the graph convolution neural network as claimed in claim 1, wherein the step S200 specifically includes:

Wherein the content of the first and second substances,

the vector obtained by one-bit effective coding of the representative node is expressed as

5. The multi-label malware classification method based on the graph convolution neural network as claimed in claim 4, wherein the step S2 specifically includes:

s22: constructing a correlation coefficient matrix A_ij：

Wherein A is_ijRepresents the probability that, when j is present, i occurs,

carrying out one-bit effective coding on each label to obtain a label relation graph

Wherein the content of the first and second substances,

representing the vector obtained by the node after one-bit effective coding,

s23: label relationship graph obtained by using graph convolution neural network pair

Performing semi-supervised learningMapping different node relations into a vector,

wherein the object classifier of the learned function is

The convolution formula is:

wherein the content of the first and second substances,

namely a multi-label classifier of the layer l.

6. The multi-tag malware classification method based on the graph convolution neural network as claimed in claim 1, wherein the step S300 specifically includes:

s3: embedding the graph into vector X and the multi-label classifier

Performs a dot product operation, represented as

Wherein the content of the first and second substances,

and mapping the multi-classification result structure into {0,1} by using an activation function.

7. The multi-label malware classification method based on the graph convolution neural network as claimed in claim 1, wherein the optimization goal of the step S400 is:

label Y e Y for each exemplar¹，y²，…y^C}，