CN113407938A

CN113407938A - Malicious code classification method based on attention mechanism

Info

Publication number: CN113407938A
Application number: CN202011267779.5A
Authority: CN
Inventors: 陈剑延; 张建国; 王翔龙
Original assignee: Xiamen Herocheer Electronic Technology Co ltd
Current assignee: Xiamen Herocheer Electronic Technology Co ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2021-09-17

Abstract

A malicious code classification method based on an attention mechanism comprises the following steps: s1, using IDA Pro as a resolver of malicious codes, constructing an automatic decoder by using scripts, and forming two files of each decoded executable code, namely a Byte file and an ASM file, wherein the Byte file contains hexadecimal to represent the resolved executable code, and the ASM file contains a disassembly statement to represent the resolved executable code; s2, extracting the disassembling sentences from the ASM file, extracting the hexadecimal numbers corresponding to the disassembling sentences from the Byte file, setting the length of each disassembling sentence, extracting the hexadecimal numbers according to the execution sequence, and forming a two-dimensional vector by contrasting the methods to be used as the input of the model; s3, constructing an ACNN recognition model: the ACNN recognition model comprises an attention mechanism and a CNN mechanism, and the two-dimensional vector is input into the ACNN recognition model to recognize and classify the malicious codes.

Description

Malicious code classification method based on attention mechanism

Technical Field

The invention relates to a classification method of malicious codes, in particular to a classification method of malicious codes based on an attention mechanism.

Background

For binary forms of malware, human experts are primarily concerned with the series of malicious operations that code performs when they analyze malware. Typically, we will study its disassembled code, which reflects to some extent the execution logic of the program. The malware is detected by researching the disassembled codes, so that the analysis of the malware from the point of view of binary and disassembled codes is an important component of the field of malware detection, a large number of scholars are engaged in the research, and a good feature extraction and model construction method is provided.

However, they may not be able to efficiently utilize information when processing binary code and disassembled code. For example, some scholars convert binary code directly to a fixed length picture, but there is no evidence that this number is fixed length. Others extract the operation code directly from the disassembled code to form a sequence, which to some extent retains sequence information but loses considerable disassembled code semantic information. It is known that even though the operation codes of assembly sentences are the same, the operands are different, and the meaning of the assembly sentences can be quite different.

At present, the methods for detecting malicious codes by machine learning generally include a traditional machine learning method and a deep learning method. The traditional machine learning method mainly comprises random forests, decision trees, SVM, XGboost and the like. The deep learning method mainly comprises a full-connection neural network, a convolution neural network, a time series neural network and the like. However, the two learning methods generally have the problem of low classification accuracy.

Disclosure of Invention

The invention provides a malicious code classification method based on an attention mechanism, and mainly aims to solve the problem of low accuracy of the existing malicious code classification method.

In order to solve the technical problems, the invention adopts the following technical scheme:

a malicious code classification method based on an attention mechanism comprises the following steps:

s1, malicious software analysis: using IDA Pro as a resolver of malicious codes to construct an automatic decoder by using scripts, wherein each decoded executable code forms two files, namely a Byte file and an ASM file, the Byte file comprises hexadecimal to represent the resolved executable code, and the ASM file comprises a disassembly statement to represent the resolved executable code;

s2, feature extraction: firstly, extracting the disassembling sentences from the ASM file, then extracting hexadecimal numbers corresponding to the disassembling sentences from the Byte file, setting the length of each disassembling sentence, extracting the hexadecimal numbers according to the execution sequence, and forming a two-dimensional vector by contrasting the method, wherein the two-dimensional vector is used as the input of the model;

s3, constructing an ACNN recognition model: the ACNN recognition model comprises an attention mechanism and a CNN mechanism, and the two-dimensional vector is input into the ACNN recognition model to recognize and classify the malicious codes.

Further, in the step S2, since the hexadecimal number lengths corresponding to each disassembled sentence are different, the length of each disassembled sentence is set to 20, and if the length is less than 20, 0 is complemented; if it exceeds 20, truncation is performed.

Furthermore, assembly sentences are horizontally represented in the two-dimensional vector, and the assembly sentence execution sequence is vertically represented in the two-dimensional vector.

Further, the ACNN identification model comprises an input layer, an attention mechanism layer, a CNN layer, a fully-connected dense layer and an output layer; inputting a two-dimensional vector at an input layer; then, the generated vector is input to an attention mechanism layer, the attention mechanism layer can output different score values for different input values according to the global context, namely the attention mechanism layer is utilized to obtain key assembly statement information, and the key assembly statement information is expressed in hexadecimal codes in the input vector; then, the score value is input into a CNN layer, and the CNN layer is used for acquiring adjacent assembly sentence characteristics, and usually the adjacent assembly sentences form a meaningful group of operations which serve as an important basis for distinguishing the category of the malicious software; then, the fully connected dense layer is used for learning the high-level features of the CNN layer; and finally, the output layer outputs the final classification result.

Furthermore, the attention mechanism layer adopts a multi-head attention mechanism, the multi-head attention mechanism uses a plurality of groups of attention and connects attention results, and more features can be extracted by using the plurality of groups of attention.

Still further, in the multi-head attention mechanism, the size of the head is set to be 8, and the number of the heads is set to be 2.

Further, the convolution layers in the CNN layer are set to 3 layers of convolution, each layer having a convolution kernel of 3 in size and 100 in number and connected to pooling layers of 3 in size.

Further, the data shape of the two-dimensional vector is (2000 × 20).

Further, the attention mechanism layer is described by the following formula:

Attention(Q,K,V)＝F(Q,K)V

in the formula, Q represents a query vector, and (K, V) represents a group of key-value pairs, and the query vector obtains the weight value of the vector by querying the key-value pairs. The formula is further analyzed and, in addition,

Q＝(q₁,q₂,...,q_n) If AV is Attention (Q, K, V), AV may be (AV)₁,av₂,...,av_n)。

av_iIs represented by q_iThe attention value in the global state, that is, by applying the attention mechanism, a weight value of each component in the global state, which represents the importance degree of the component in the global state, can be obtained.

As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following advantages: the invention fully utilizes the corresponding relation between the machine code (binary code of malicious software) and the assembly statement to construct the two-dimensional input vector which transversely represents the assembly statement and longitudinally represents the execution sequence of the assembly statement. Meanwhile, the identification model construction method fully utilizes the attention mechanism principle, and by using the idea of CNN convolution kernel for reference, through comparison, the identification model has higher fitting efficiency, and the classification accuracy is greatly improved.

Drawings

FIG. 1 is a schematic diagram of the steps of the present invention.

FIG. 2 is a schematic diagram of the ACNN recognition model of the present invention.

FIG. 3 is a classification and numbering diagram for nine malware.

FIG. 4 is a graph of training accuracy for the present invention.

Fig. 5 is a graph of the loss rate of training for the present invention.

Detailed Description

Referring to fig. 1, a malicious code classification method based on an attention mechanism includes the following steps:

s2, feature extraction: firstly, extracting disassembling sentences from an ASM file, then extracting hexadecimal numbers corresponding to the disassembling sentences from a Byte file, and setting the length of each disassembling sentence, wherein the length of each disassembling sentence is set to be 20 because the hexadecimal numbers corresponding to each disassembling sentence are different in length, and if the length is less than 20, 0 is supplemented; if it exceeds 20, truncating; by extracting hexadecimal numbers according to the execution sequence and contrasting the method, a two-dimensional vector which transversely represents the assembly statement and longitudinally represents the execution sequence of the assembly statement can be formed and used as the input of the model;

Referring to fig. 2, the ACNN recognition model includes an input layer, an attention mechanism layer, a CNN layer, a fully-connected dense layer, and an output layer; inputting a two-dimensional vector with a data shape of (2000 x 20) at an input layer; then, the generated vector is input to an attention mechanism layer, the attention mechanism layer can output different score values for different input values according to the global context, namely the attention mechanism layer is utilized to obtain key assembly statement information, and the key assembly statement information is expressed in hexadecimal codes in the input vector; then, the score value is input into a CNN layer, and the CNN layer is used for acquiring adjacent assembly sentence characteristics, and usually the adjacent assembly sentences form a meaningful group of operations which serve as an important basis for distinguishing the category of the malicious software; then, the fully connected dense layer is used for learning the high-level features of the CNN layer; and finally, the output layer outputs the final classification result.

The attention mechanism layer adopts a multi-head attention mechanism, the multi-head attention mechanism uses multiple groups of attention and connects attention results, and more features can be extracted by using the multiple groups of attention. In the invention, the size of the head part in the multi-head attention mechanism is set to be 8, and the number of the head parts is set to be 2; the convolution layers in the CNN layers are set to be 3 layers of convolution, the convolution kernel of each layer is 3 in size, the number of the convolution kernels is 100, and the convolution kernels are connected to pooling layers of 3 in size.

The attention mechanism layer is described by the following formula:

Attention(Q,K,V)＝F(Q,K)V

In order to prove the superiority of the feature extraction method and the ACNN recognition model, the following experiments are carried out: we selected data from Microsoft 2015 kagger competition as our dataset, whose data was malware of the window platform. After the IDA Pro parsing, Byte files and ASM files were generated, which had a total of 10086 samples and were classified into 9 different types of malware, Ramnit, locipop, Kelihos ver3, Vundo, Simda, Tracur, Kelihos ver1, Obfus-catalog. acy and Gatak, and the number of each category is shown in fig. 3.

The parameters of the ACNN recognition model based on the attention mechanism and the CNN mechanism are set as follows: the attention mechanism layer employs a multi-head attention model in which the head size is set to 8 and the number of heads is set to 2. The convolutional layers are arranged into 3 layers of convolution, the convolution kernel of each layer is 3 in size, the number of convolution kernels is 100, and the convolution kernels are connected to pooling layers with the size of 3.

After 200 training rounds, the accuracy reaches the highest, as shown in fig. 4. After 200 training rounds, the loss rate is the lowest, as shown in fig. 5.

As can be seen from fig. 4 and 5, our ACNN recognition model converges rapidly, with the advantages of high accuracy and low loss.

To verify the effectiveness of our feature extraction method, we set up several sets of comparative experiments. Then, we adopt a 10-fold cross validation method to divide the database into a training set and a test set, and then we calculate the average classification accuracy of these test sets.

The data we input into the model is a two-dimensional vector in which the horizontal row representation represents the hexadecimal representation of each assembly statement and the vertical column representation represents the sequential execution order of the assembly statements. To verify that this two-dimensional vector form is more efficient, we designed two forms of contrast. One is the commonly used one-dimensional data, which is generated by flattening a two-dimensional vector into a one-dimensional vector, the data shape being (6000, 1). The other is also two-dimensional data, which is only smaller than the horizontal dimension we extracted, we extract 20 horizontal dimensions, the data shape is (2000,20), and the data shape of the contrast two-dimensional data is (2000, 3). The results of the tests using the LSTM model and the random forest model evaluation model are shown in table 1.

TABLE 1

In this experiment, we first flatten the two-dimensional vector into a one-dimensional vector, which is used as the input of the "random forest" model, and the test results are listed in table 1.

As can be seen from table 1, for the conventional Random Forest model, the classification accuracy under different feature extraction methods is 0.71, and the influence of various feature extraction methods is not large, which proves that the influence of the feature extraction method on the conventional Random Forest model is not large. For the LSTM model, the larger the dimension extracted in the horizontal direction is, the better the classification effect is, and the optimal classification accuracy reaches 0.8725 under the data shape (2000,20), which proves that for the deep learning model, the feature extraction method is favorable for enhancing the horizontal correlation and improving the classification accuracy.

To validate the effectiveness of the ACNN framework, we established four baseline models, namely a Random Forest (Random Forest) model, a Bi-directional LSTM (Bi-LSTM) model, a Multi-head attention model, a Deep Neural Network (DNN) model, and others, and compared the results of the comparison tests as shown in table 2.

TABLE 2

As can be seen from Table 2, the classification accuracy of the ACNN (Attention + CNN) recognition model based on the combination of the Attention mechanism and the CNN mechanism reaches 0.9609. The method is nearly 5 percent higher than a Multi-head attention mechanism (Multi-head attention) model, nearly 9 percent higher than a bidirectional LSTM (Bi-Lstm) model, and nearly 25 percent higher than a traditional Random Forest (Random Forest) model. At the same time, the ACNN model is far ahead of other models based on the attention mechanism.

The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept should fall within the scope of infringing the present invention.

Claims

1. A malicious code classification method based on an attention mechanism is characterized by comprising the following steps: the method comprises the following steps:

2. The method for classifying malicious code based on an attention mechanism according to claim 1, wherein: in the step S2, because the hexadecimal numbers corresponding to each disassembly statement have different lengths, the length of each disassembly statement is set to be 20, and if the length is less than 20, 0 is complemented; if it exceeds 20, truncation is performed.

3. The method for classifying malicious code based on an attention mechanism according to claim 1, wherein: the assembly statement is horizontally represented in the two-dimensional vector, and the execution sequence of the assembly statement is vertically represented.

4. The method for classifying malicious code based on an attention mechanism according to claim 1, wherein: the ACNN identification model comprises an input layer, an attention mechanism layer, a CNN layer, a full-connection compact layer and an output layer; inputting a two-dimensional vector at an input layer; then, the generated vector is input to an attention mechanism layer, the attention mechanism layer can output different score values for different input values according to the global context, namely the attention mechanism layer is utilized to obtain key assembly statement information, and the key assembly statement information is expressed in hexadecimal codes in the input vector; then, the score value is input into a CNN layer, and the CNN layer is used for acquiring adjacent assembly sentence characteristics, and usually the adjacent assembly sentences form a meaningful group of operations which serve as an important basis for distinguishing the category of the malicious software; then, the fully connected dense layer is used for learning the high-level features of the CNN layer; and finally, the output layer outputs the final classification result.

5. The method for classifying malicious code based on an attention mechanism as claimed in claim 4, wherein: the attention mechanism layer adopts a multi-head attention mechanism, the multi-head attention mechanism uses multiple groups of attention and connects attention results, and more features can be extracted by using the multiple groups of attention.

6. The method for classifying malicious code based on an attention mechanism according to claim 5, wherein: in the multi-head attention mechanism, the size of the head is set to be 8, and the number of the heads is set to be 2.

7. The method for classifying malicious code based on an attention mechanism as claimed in claim 4, wherein: the convolution layers in the CNN layers are set to be 3 layers of convolution, the convolution kernel of each layer is 3 in size, the number of the convolution kernels is 100, and the convolution kernels are connected to pooling layers of 3 in size.

8. The method for classifying malicious code based on an attention mechanism according to claim 3, wherein: the data shape of the two-dimensional vector is (2000 x 20).

9. The method for classifying malicious code based on an attention mechanism as claimed in claim 4, wherein: the attention mechanism layer is described by the following formula:

Attention(Q,K,V)＝F(Q,K)V