CN112163219A

CN112163219A - Malicious program identification and classification method based on word embedding and GCN

Info

Publication number: CN112163219A
Application number: CN202010875751.3A
Authority: CN
Inventors: 张小明; 周博文; 张骞允; 程磊
Original assignee: Beihang University; Shenzhen Research Institute of Big Data SRIBD
Current assignee: Beihang University; Shenzhen Research Institute of Big Data SRIBD
Priority date: 2020-08-27
Filing date: 2020-08-27
Publication date: 2021-01-01

Abstract

The invention discloses a malicious program identification and classification method based on word embedding and GCN, which comprises the following operation steps: 1: decompiling the sample; 2: constructing a function graph; 3: constructing an instruction corpus; 4: embedding instruction words; 5: extracting function characteristics; 6: the GCN convolves a single graph; 7: expanding the dimensionality of the convolution result; 8: classifying the application programs; the invention ensures the classification accuracy, enhances the robustness of the classification scheme, provides a new idea for application program classification, and improves the time efficiency of calculation. Therefore, the method has great practical significance and popularization prospect.

Description

Malicious program identification and classification method based on word embedding and GCN

Technical Field

The invention belongs to the field of network security, and relates to a malicious program identification and classification method which can realize the basic requirements of malicious program identification.

Background

With the continuous update of computer technology and the change of internet development, people's life is more and more kept away from the internet, and more work and affairs are completed through the internet. While the internet has brought about people's lives, the number of cyber attack events is also growing at an alarming rate. An attacker can complete the attacks of injecting and stealing the privacy of the user, stealing the file of the user, carrying out extortion on the user and the like by manufacturing a malicious program and other means. With the development of the internet of things, various small devices have the disadvantages of small size and poor calculation capability because the various small devices are designed only for certain specific tasks, which results in lower capability of resisting malicious programs. If malicious programs can be effectively detected, most network attack events can be blocked.

There are two methods of analyzing an application program, static analysis and dynamic analysis, conventionally. The scheme processes of the method comprise the steps of firstly collecting a sample, then analyzing the sample, and finally proposing a defense scheme according to an analysis result. However, in this process, due to the complicated analysis process, a lot of time is required from the time of collecting the sample to the time of putting the defense out to the time of practical application, which results in that the defense measures are much slower than the attack. At present, the development of machine learning technology brings new ideas for solving the problems of application program analysis and malicious program detection. The machine learning ability is strong, and the method has great advantages in the aspects of extracting sample characteristics and detecting specific targets. On the other hand, the current sample set is generally very large, which puts a great pressure on the traditional feature extraction method, and there may be a case of missing features. The machine learning does not have the defects, the strong fitting capability of the machine learning can find the features which are not easily perceived at ordinary times in the data, and the analysis and classification results of the machine learning are improved along with the increase of the data volume. Meanwhile, the generalization ability of the machine learning method enables the method to have a certain prediction function, so that the method can resist new threats which may appear in a future period of time based on the existing samples at present. This makes it possible to remedy the hysteresis disadvantage of the conventional method in the task of analyzing the malware.

Disclosure of Invention

The invention aims to realize the identification and classification of malicious programs by utilizing a new machine learning thought, namely word embedding and graph convolution neural network (GCN), so as to improve the accuracy and efficiency of the identification and classification.

In order to realize the purpose, the following technical scheme is adopted:

a malicious program identification and classification method based on word embedding and GCN comprises the following templates:

the graph building module extracts the assembly instruction and the functional relation of the application program for preparing data by the follow-up module and builds the single application program into a graph;

the function characteristic extraction module is used for extracting the characteristics of the function by using the word embedding model;

the application program feature extraction module is used for extracting the features of the application program; convolution results of the obtained graph;

and the application program classification module takes the convolution result of the graph as a classification basis and finishes the classification task according to the classification task requirement of the data.

The specific steps in the graph building module are as follows:

after a program sample is collected, extracting a function call relation of an application program through inverse compiling, and expressing the actual function of each function in an assembly instruction mode;

taking the functions of the application programs as nodes, taking the calling relation among the functions as nodes, constructing an undirected graph, connecting the instructions of the functions into character strings as the identity marks of the functions, distinguishing different nodes according to the character strings, taking the character strings as the identity marks of the functions as the basis of the function in the whole data set, wherein the functions also need to have corresponding numbers in a single application program, and the mapping relation between two numbers of the single function is also reserved;

and connecting the assembly instructions of each function as words into sentences, using the sentences as the attributes of the nodes, and sorting all the sentences into the same set to be used as input data of the function characteristic extraction module.

The module for extracting the feature of the function comprises the following steps,

performing word embedding by using a word embedding model, performing one-hot coding on all instructions, generating data to be trained by adopting a proper window length sliding mode according to sentences formed by instruction connection, constructing a neural network by adopting a proper hidden layer and an output layer according to the model, and finally performing parameter adjustment through back propagation and gradient descent to obtain an embedding result of instruction words;

splicing the embedded results according to the sequence to obtain the attribute of each node in the low-dimensional space, and coding the sequence information of each node instruction sequence into the node attribute; by adopting an LSTM network, nodes with smaller instruction sequence length can be supplemented in a mode of adding zero vectors; and finally, adding a full connection layer at the tail end of the LSTM network to compress the data to one dimension.

The module for extracting application characteristics comprises the following steps,

the graph is convolved through the GCN, the characteristics of the topological structure of the graph are extracted, and corresponding rows and columns in the GCN convolution kernel are selected to form a small-scale convolution kernel for convolution operation according to the global number of the graph node;

the convolution result is a vector related to the number of nodes of a single graph, and the vector is expanded to the size of all the nodes; the expansion basis is that according to the mapping of the node numbers, nodes not included in the graph are set to be zero, and the included node values are unchanged; eventually the convolution results for all the graphs will be vectors of the same size.

According to the invention, on the premise of ensuring that the user can normally identify the malicious program, the identification accuracy is improved, the time complexity is reduced to a certain extent, and the calculation efficiency is improved.

Drawings

FIG. 1 is a general block diagram of the system of the present invention. The composition and the function of each module in the figure, and the data flow are described in the technical scheme;

FIG. 2 is a block diagram of a flow chart of extracting functional features of module two according to the present invention;

FIG. 3 is a diagram illustrating the steps of the present invention;

FIG. 4 is a flowchart illustrating exemplary steps in an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

The invention uses GCN to extract the characteristic that the topological information is rich and can be extracted by using the application program as Graph (Graph); through word embedding, the relation between the instructions can be converted into quantifiable data; through LSTM (Long short-term memory) network and CNN (Current Neural network), the precedence information of the instruction can be coded into the characteristics. The invention improves the accuracy of identifying and classifying the malicious programs and improves the calculation efficiency by providing a thought which is different from the traditional method and the existing machine learning method, and the invention has greater practical significance in guiding the development direction of the industry and in practical use.

The specific scheme is as follows:

four techniques are used in the scheme of the invention: 1) and (3) reverse compiling: the application program to be analyzed is reversely analyzed to obtain the corresponding code structure and the specific content of the application program, so that the processing of the subsequent steps is facilitated. 2) Word embedding: word embedding is to turn words that the machine cannot process into low-dimensional vectors for use by other machine learning models. Generally, the simplest expression method of words in natural language is one-hot, but the expression method is too space-consuming and needs to be changed into low-dimensional vectors by adopting an embedding technology. The term in this disclosure is an assembly instruction. 3) LSTM: the Long short-term memory (LSTM) network is a special RNN, and can solve the problems of gradient extinction and gradient explosion in the Long sequence training process. LSTM can perform better in longer sequences than normal RNNs. Its main role is to extract features from the instruction sequence and further compress the data dimensions. 4) GCN: a Graph convolution neural Network (GCN) constructs a Graph convolution method based on an adjacency matrix by utilizing a Laplacian operator on a Graph through a traditional convolution theorem, so that the characteristics in a Graph of a non-Europe structure can be extracted through an optimization method.

The invention relates to a malicious program identification and classification method, which realizes functions by 8 steps of graph construction, function characteristic extraction, application program characteristic extraction and application program classification.

The invention provides a malicious program identification and classification method, which comprises the following implementation steps:

a first module: the graph building block prepares the data for subsequent blocks. The application program sample cannot be directly sent to a machine learning model for feature extraction. The module extracts the relationship between the assembly instruction and the function of the application program through decompiling, and then constructs a single application program into a graph;

step 1: after a program sample is collected, extracting a function call relation of an application program through inverse compiling, and expressing the actual function of each function in an assembly instruction mode;

step 2: and taking the functions of the application program as nodes, taking the calling relation among the functions as nodes, and constructing an undirected graph. The nodes should have unique numbers according to the uniqueness of the functions, and the specific method is to connect the instructions of the functions into character strings as the identity of the functions so as to distinguish different nodes and use the character strings as the basis of the functions in the whole data set. In addition, functions are numbered correspondingly in a single application for GCN processing convenience. The mapping relation between two serial numbers of a single function is also reserved;

and step 3: the assembly instructions of each function are used as words and phrases to be connected into sentences, the sentences are used as the attributes of the nodes, and all the sentences are sorted into the same set and used as input data of a function characteristic extraction module;

and a second module: and extracting the function characteristics after the graph is constructed, each node in the graph has certain attributes, namely a sentence formed by connecting assembly instructions as words. The order of the instructions cannot be disorganized, meaning that there is information about the function functions from which features can be extracted. An effective method for extracting information from character string information is word embedding, wherein the process of embedding one-hot representation into a low-dimensional space can also keep semantic information of words. In order to facilitate the application of subsequent modules, features are further extracted and data dimensions are compressed through the LSTM;

and 4, step 4: and performing word embedding by using a skip-gram model. And performing one-hot coding on all the instructions, and then generating data to be trained by adopting a proper window length to slide according to sentences formed by connecting the instructions. According to the skip-gram model, adopting a proper hidden layer and an output layer to construct a neural network, and finally carrying out parameter adjustment through back propagation and gradient descent to obtain an embedding result of the instruction words;

and 5: and splicing the embedding results in sequence to obtain the attribute of each node in the low-dimensional space. Since the instructions are order sensitive data, it is necessary to further encode order information of the instruction sequence of each node into the node attribute. By adopting the LSTM network, nodes with smaller instruction sequence length can be supplemented by adding a zero vector. Finally, adding a full connection layer at the tail end of the LSTM network to compress the data to one dimension;

and a third module: after the instruction features are extracted, the features of the application program are extracted according to the extracted instruction features. The purpose of extracting the application features is for the requirements of subsequent application classification;

step 6: by convolving the graph with the GCN, features of the topology of the graph can be extracted. A single graph will not contain all nodes, and thus the process of convolving a single graph with GCN does not have to be performed on a global scale. Selecting corresponding rows and columns in the GCN convolution kernel to form a small-scale convolution kernel for convolution operation according to the global number of the graph node;

and 7: the convolution result is a vector related to the number of nodes of a single graph, which is expanded to the size of the number of all nodes. The expansion is based on the mapping of node numbers, nodes not included in the graph are set to zero, and node values included in the graph are unchanged. Eventually the convolution results for all the graphs will be vectors of the same size;

and a module IV: the application program classification is characterized in that the features extracted from the application program are simple vectors, and the classification task is a plurality of classification problems and can be completed by adopting a general classification method;

and 8: the convolution result of the graph obtained in the step 7 is used as a classification basis, and classification tasks can be completed by methods such as a vector machine and linear regression according to the classification task requirements of the data;

wherein, in the whole text of the scheme of the invention, the 'diagram' corresponds to an application program; "nodes of the graph" correspond to functions of the application; the 'edge of the graph' corresponds to the calling relationship of the application program function;

the process described in

steps

4 and 5, in effect, constructs a mapping that maps all instructions of a single function to a rational number. Through the characteristics of the deep learning model, the global characteristics of the instructions are reserved in the mapping process, and the instruction characteristics of a single function are also integrated. The mapping result can well keep the functional characteristics of the function;

wherein, the skip-gram model in step 4 can be replaced by a CBOW model or other word embedding models, which has no influence on the novelty of the present invention. In fact, the invention aims to propose a method for extracting function features by adopting word embedding, which is one of the innovation points of the invention;

wherein, the convolution kernel reduction of step 6 and step 7 and the convolution result dimension expansion are not necessary, but step 6 and step 7 need to be adopted together or not adopted. In general, a large number of functions are included in a large number of samples, and the result of not performing step 6 often causes shortage of computing resources and unproductive waste of computing time, so that the scheme does not recommend omitting the two steps;

the classification method in the step 8 needs to be selected according to the classification required by the classification task, and for more problems such as unbalanced samples and the like, the scheme of the invention is not taken into consideration;

the depth and parameters of the neural network in all steps need to be set according to the condition of an actual data set, so that more detailed description cannot be performed;

the invention is named as 'a malicious program identification and classification method based on word embedding and GCN', and the expression of 'application classification' appears in the step, which is not contradictory. Because in the actual sample the application program is containing normal programs, it is not reasonable to describe it as a "malicious program" in the step statement.

Through the steps, on the premise of ensuring the classification accuracy, by introducing various machine learning models and schemes, the robustness of the classification scheme is enhanced; by introducing word vectors and GCN into the classification scheme, a new idea is provided for application program classification; by extracting the significant part of the GCN convolution kernel, the time efficiency of the scheme is improved. Therefore, the method has great practical significance and popularization prospect.

Specifically, as shown in fig. 4, the data set is composed of several applications. For one of the applications, as shown in the following figure, it is assumed that it is composed of 6 functions, the functions are numbered 1-6, and the specific contents (instructions) of each function have been written in the figure. Each node in the graph represents a corresponding function, and edges between the nodes represent that a calling relationship exists between the corresponding two functions. The eight steps of the four modules described above are explained based on the following figures.

A first module: graph construction

Step 1: after a program sample is collected, extracting a function call relation of an application program through inverse compiling, and expressing the actual function of each function in a mode of assembling an instruction, wherein the instruction is expressed in the upper graph;

step 2: the undirected graph construction is also shown above. In particular, the instructions of the function are concatenated into a string as the identity of the function, thereby distinguishing the different nodes, i.e., node 1, whose identity is the string "movsubstjmp". The identity is the global identification number of the function (node) and is used for establishing a mapping table in the global scope for the function. In each application (graph), functions (nodes) have local numbers, and the nodes in the upper graph are numbered as 1-6; each function (node) also has a number in the global scope, the global number is used for identifying the function in the whole data set scope, the number sequence is the sequence in which the function appears in the data set for the first time, for example, the global numbers of the nodes 1-6 in the figure may be 1, 13, 145, 670, 678 and 1029, which are indicated in the parenthesis in the figure; the mapping relationship between the two numbers is preserved, such as [1,1], [2,13], [3,145], [4,670], [5,678], [6,1029 ];

and step 3: the assembly instructions of all the nodes are used as words and are connected into sentences as node attributes, and all the sentences are arranged in the same set to be used as a corpus. The sentences contributed to the corpus in the above figure are: [ "MOV SUB JMP", "STM LDR LDR CMP", "ADD SUM", "JMP NJMP", "ADD B", "MOV", the form of the corpus is the sentence list;

and a second module: extracting functional features

And 4, step 4: all instructions one-hot are encoded with encoding dimensions of (N _ Words ), where N _ Words represents the number of all instructions. The above figures are examples:

MOV：1 0 0 0 0 …

SUB：0 1 0 0 0 …

JMP：0 0 1 0 0 …

STM：0 0 0 1 0 …

LDR：0 0 0 0 1 …

……

then, according to the corpus, word embedding is performed by adopting a skip-gram algorithm, and the embedded result dimension is (N _ Words, Embed _ Dim), wherein the Embed _ Dim is a preset embedding vector length, and the embedding result is (taking Embed _ Dim ═ 4 as an example, the vector number is designed only for an explanation method):

MOV：1.1111 1.1111 1.1111 1.1111

SUB：2.2222 2.2222 2.2222 2.2222

JMP：3.3333 3.3333 3.3333 3.3333

STM：4.4444 4.4444 4.4444 4.4444

LDR：5.5555 5.5555 5.5555 5.5555

……

and 5: and splicing the embedding results in sequence to obtain the attribute of each node in the low-dimensional space. For node 1, its attribute is "MOV SUB JMP", then its low dimensional spatial attributes are [ [ 1.11111.11111.11111.1111 ], [ 2.22222.22222.22222.2222 ], [3.33333.33333.33333.3333] ], and the attributes of node 7 are [ [ 1.11111.11111.11111.1111 ], [ 1.11111.11111.11111.1111 ], [ 1.11111.11111.11111.1111 ] ], depending on the embedding result. In order to encode the sequence information of each node instruction sequence into the node attributes, an LSTM network is adopted to extract features, then the results of each layer and the features in the final result are compressed by utilizing a full connection layer, and finally data with the dimension of (1,1) are obtained for each node as the node features. Then the characteristic dimension of all nodes is (N _ Vertex,1), where N _ Vertex is the total number of nodes in the dataset;

and a third module: extracting application features

Step 6: and extracting the topological characteristics of the graph by utilizing the GCN pair convolution. According to the nature of the GCN, in order to extract the topological features of all the applications in the data set, the convolution kernel of the GCN should be a diagonal matrix with a shape of (N _ Vertex ), and the data on the diagonal line is the undetermined parameter. The convolution form of the GCN is gaf (x), where G is a convolution kernel, a is an adjacency matrix of the graph, and f (x) is a function related to information of each node in the graph, i.e., the node feature extracted in step 5 in this example. The calculation process is as follows

The adjacency matrix of the upper graph is:

1 2 3 4 5 6

1[[1 1 0 0 0 0]

2 [1 1 1 1 1 0]

3 [0 1 1 1 0 0]

4 [0 1 1 1 0 0]

5 [0 1 0 0 1 1]

6 [0 0 0 0 1 1]]

because the convolution kernel is of size (N _ Vertex ), the same size of adjacency matrix should be used in computing the matrix multiplication. For a single application (top diagram), a large number of nodes that appear in other applications do not appear, so in an (N _ Vertex ) sized adjacency matrix, the applications shown in the top diagram are non-zero data only in 1, 13, 145, 670, 678, 1029 rows and columns. Depending on the nature of the matrix multiplication, all zero row and all zero column calculations can be discarded, since their calculation results are necessarily zero. Therefore, through the mapping tables of the global number and the local number, the data of the 1 st, 13 th, 145 th, 670 th, 678 th, 1029 th row and column are selected from the convolution kernel to form a matrix with the size of (6,6), and the matrix is multiplied by the adjacent matrix of the upper graph, namely the GA is finished; similarly, the node feature should select only the nodes related to this graph, i.e. the vector consisting of (6,1)

lines

1, 13, 145, 670, 678 and 1029 in the matrix of (N _ Vertex,1) is selected as f (x). The final GAF (x) calculation results in a vector of shape (6, 1);

and 7: the convolution result in step 6 is a matrix of (6,1), which contains topology information and node attribute information of the application program shown in the above diagram, that is, feature information of the application program shown in the above diagram. In order to complete the classification task of the application program (graph), the characteristic information is also required to be sent to a classifier for classification. For a classifier, the classifier input is a vector shaped as (N _ Vertex,1) in order to maintain the uniformity of the input for all applications. As can be seen from step 6, even if the convolution is directly performed on the global size without going through step 6, the calculation result at the node position irrelevant to the present application is only zero, and therefore, it is only necessary to extend the vector of (6,1) obtained in step 6 to (N _ Vertex,1) with zero. The expansion process should keep the node characteristics unchanged, that is, according to the mapping relationship of [1,1], [2,13], [3,145], [4,670], [5,678], [6,1029], each row in (6,1) is put into the corresponding row, and the rest rows are set to zero. The vector of (N _ Vertex,1) shape thus obtained is the information vector extracted from a single application. For the present example, the feature information of the application program shown in the above figure;

and a module IV: application classification

And 8: after the above 7 steps, all the applications are coded as a vector of (N _ Vertex,1), and the classification task can be completed by adopting various classification methods.

The advantages and the effects are as follows:

the invention provides a method for identifying and classifying malicious programs, which changes the view angle of application programs and completes the classification task of the malicious programs with higher accuracy and time efficiency by using a new machine learning technology. The innovation points of the invention are as follows:

1) the instruction features are described using word embedding techniques. The correlation between the instructions can be represented in the data by collecting all instruction sequences to establish a corpus and putting the instructions in a specific environment to execute embedding operation. Compared with other machine learning algorithms, the method considers the practical significance of the instruction, discards less information and can better acquire the characteristics of the instruction.

2) And for each node, extracting the instruction features by adopting an LSTM network. Because instructions are order sensitive to the function of functions, it is reasonable and efficient to employ RNN-family algorithms. The unique gate structure in the LSTM network can make the neural network more efficient at processing data at different locations in the sequence to better extract sequence features. This layer of the network is to extract its unique characteristics for each node.

3) The application is understood using a graph structure. The solution of the invention takes into account the topology of the application execution, which other solutions have not proposed. Although the execution of an application is linear, the logic behind most applications is not linear, which is of practical significance in terms of illustrating the organization of an application to its functionality. By converting the application into a graph and then extracting the information using the GCN network, features that are otherwise difficult to obtain can be obtained.

4) When the GCN carries out convolution operation, unnecessary nodes are abandoned, the data scale and the calculated amount are greatly reduced, and therefore the time efficiency is improved. It is necessary to reduce unnecessary calculations for economic reasons in the design, and what the solution of the invention discards is meaningless operations. This method is therefore less expensive in both space and time than similar methods.

In conclusion, the method can extract more features, better and faster complete the malicious program identification and classification tasks, and has guiding significance and better popularization prospect.

Claims

1. A malicious program identification and classification method based on word embedding and GCN is characterized in that: the method comprises the following templates:

2. The word embedding and GCN based malware identification and classification method of claim 1, wherein: the specific steps in the graph building module are as follows:

3. The word embedding and GCN based malware identification and classification method of claim 1, wherein:

4. The word embedding and GCN based malware identification and classification method of claim 3, wherein: the model is a skip-gram model or a CBOW model.

5. The word embedding and GCN based malware identification and classification method of claim 1, wherein: