CN117113163A

CN117113163A - Malicious code classification method based on bidirectional time domain convolution network and feature fusion

Info

Publication number: CN117113163A
Application number: CN202310711779.7A
Authority: CN
Inventors: 李思聪; 王坚; 黄玮; 史松昊; 李乐民; 王科
Original assignee: Air Force Engineering University of PLA
Current assignee: Air Force Engineering University of PLA
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2023-11-24

Abstract

The application belongs to the field of malicious code classification, and provides a malicious code classification method based on a bidirectional time domain convolution network and feature fusion, which comprises the following steps: step 11, acquiring an original file of malicious codes; step 12, preprocessing the malicious code original file to obtain a malicious code image, so that the feature extraction of the malicious code by the model is more comprehensive, and the classification and identification accuracy of the malicious code is further improved; step 13, inputting the preprocessed malicious code image into a bidirectional time domain convolution network for training, wherein the bidirectional time domain convolution network processes the malicious code image into data with uniform size; according to the application, two different features are combined by adopting a pooling fusion method, and are mutually supplemented, so that the features can be fully learned, thereby better retaining malicious code features, effectively improving the model learning feature capability, and further obtaining a good classification effect.

Description

Malicious code classification method based on bidirectional time domain convolution network and feature fusion

Technical Field

The application belongs to the field of malicious code classification, and particularly relates to a malicious code classification method based on a bidirectional time domain convolution network and feature fusion.

Background

Malicious code is a code or web page script that is harmful to a computer system, and is aimed at creating a system vulnerability in a target computer, implementing theft of data and information, and presenting a potential hazard to systems and files. Malicious code analysis techniques are classified into dynamic analysis techniques and static analysis techniques according to whether files are executed. Dynamic analysis refers to the practice of running executable files in sandboxes, simulators and virtual machines and monitoring and analyzing application program behaviors through system calls, and static analysis methods extract static features of malicious codes to identify illegal behaviors of samples, and can capture information related to structural characteristics, such as API calls, operation codes (opcodes), and the like.

The time domain convolutional network (temporal convolutional network, TCN) is a new member of the convolutional neural network (convolutional neural networks, CNN) family, and the expanded convolution is adopted to expand the receptive field of the model and reduce the calculation amount. The causal convolution ensures the time sequence of data, and simultaneously adds residual connection, thereby effectively relieving the problems of gradient disappearance and gradient explosion. Therefore, the TCN not only can process data in a large-scale parallel manner, but also can avoid the influence of future data on the past, the time domain convolution network mainly comprises causal convolution, expansion convolution and residual connection, the convolution network in the TCN has the characteristic of being capable of performing parallel calculation, the problem of overlong time consumption can be effectively solved, but a single TCN cannot encode information from back to front, and therefore the association between the current characteristic item and the later characteristic item cannot be learned.

In order to mine the bidirectional characteristic information contained in the malicious code sequence and exert the advantages of TCN in time sequence characteristic information processing, the application provides a malicious code classification method based on a bidirectional time domain convolution network and characteristic fusion, which is inspired by a bidirectional cyclic neural network.

Disclosure of Invention

In order to solve the technical problems, the application provides a malicious code classification method based on a bidirectional time domain convolution network and feature fusion, which aims to solve the problems that in the prior art, a single TCN cannot encode information from back to front, so that the association between a current feature item and a later feature item cannot be learned, and the like.

A malicious code classification method based on a bidirectional time domain convolution network and feature fusion comprises the following steps:

step 11, acquiring an original file of malicious codes;

step 12, preprocessing the malicious code original file to obtain a malicious code image, so that the feature extraction of the malicious code by the model is more comprehensive, and the classification and identification accuracy of the malicious code is further improved;

step 13, inputting the preprocessed malicious code image into a bidirectional time domain convolution network for training, wherein the bidirectional time domain convolution network processes the malicious code image into data with uniform size;

step 14, fusing the bidirectional features with uniform size through the bidirectional time domain convolution network, so as to acquire the dependency relationship of data between the forward and reverse propagation directions;

step 15, extracting and compressing further features of the feature map after the bidirectional time domain convolution;

and step 16, obtaining a classification result of the malicious codes.

Preferably, the method for preprocessing data in step 12 includes:

s121, performing batch disassembly on PE files of the malicious code original files to obtain. Asm files and. Bytes files;

s122, extracting N-Gram operation code subsequence features based on the bytes file, and extracting gray image texture features based on the asm file;

s123, fusing the N-Gram operation code sub-sequence features and the gray image texture features to obtain a malicious code image.

Preferably, the pre-classified malicious code image in step 12 is divided into a training set and a verification set, the training set is used for training a model, a training feature vector is generated, and the verification set is used for observing and evaluating the performance of the model, and a verification feature vector is generated.

Preferably, the proportion of the training set is 70% and the proportion of the verification set is 30%.

Preferably, the training method of the bidirectional time domain convolution network in the step 13 is as follows:

step 131: performing convolution calculation on the sequence from left to right to realize forward feature extraction;

step 132: and carrying out convolution calculation on the sequence from right to left to realize backward feature extraction.

Preferably, the method of fusion in step 14 is pooling fusion: and extracting deep features by a fusion pooling layer obtained by parallelly connecting the maximum pooling and the average pooling, and further grabbing the dependency relationship in the data.

Preferably, the formula of the pooling fusion is

Wherein h is the output through the bidirectional time domain convolution network, h _max Is the transport of the maximum pooling layerGo out, h _ave Is the output of the average pooling layer, h _fuse Is the output result after fusion pooling.

Compared with the prior art, the application has the following beneficial effects:

1. the gray image texture features are extracted from the asm file, the one-dimensional image is used instead of the two-dimensional image to represent the malicious code features, so that the problem that local correlation does not exist among pixel points in the malicious code image features caused by image folding is avoided, the gray image features and the operation code features respectively reflect the similarity of malicious codes in the same type on the global and local aspects of the malicious codes, the global and local features are fused, and the feature information of the malicious codes is utilized from multiple angles; the fusion features are used as the input of the bidirectional time domain convolution network model for training and classifying, so that the accuracy of malicious code detection can be increased, and the time domain convolution network can fully utilize the data information in the front direction and the back direction.

2. According to the application, two different features are combined by adopting a pooling fusion method, and are mutually supplemented, so that the features can be fully learned, thereby better retaining malicious code features, effectively improving the model learning feature capability, and further obtaining a good classification effect.

Drawings

FIG. 1 is a flow chart of a malicious code classification method according to the present application;

FIG. 2 is a flow chart of a method for preprocessing data according to the present application;

FIG. 3 is a flow chart of a training method for a two-way time domain convolutional network according to the present application;

FIG. 4 is a process diagram of malicious code byte feature extraction to generate a grayscale image in accordance with the present application;

FIG. 5 is a diagram of a malicious code classification model based on BiTCN-DLP according to the present application;

FIG. 6 is a graph of the performance of the training set and the test set as a function of training batch during the model training process of the present application;

FIG. 7 is a graph of loss rate as a function of training batch for the present application;

FIG. 8 is a graph showing the effect of n on model;

FIG. 9 is a comparison of the sequence features of the operational code, grayscale image, and hybrid features of the present application;

FIG. 10 is a graph showing the results of ablation experiments performed by BiTCN, forward TCN and reverse TCN of the present application;

FIG. 11 is a diagram of four different methods of pooling, mean pooling, max pooling, and fusion pooling of extracted features of the present application;

FIG. 12 is a comparison of the model of the present application with other malicious code classifications of recent years.

Detailed Description

Embodiments of the present application are described in further detail below with reference to the accompanying drawings and examples. The following examples are illustrative of the application but are not intended to limit the scope of the application.

Embodiment one: as shown in fig. 1 to 12: the application provides a malicious code classification method based on a bidirectional time domain convolution network and feature fusion, which comprises the following steps:

step 11, acquiring an original file of malicious codes;

the malicious codes are files composed of a series of bytes, binary files of the malicious codes are converted into gray images according to the similarity of the value ranges of the bytes and pixels in the gray images, and classification of malicious code families is achieved according to the concept that the malicious codes in the same family have similar textures of the gray images and the characteristics that the malicious codes in different families have different textures due to different structures of the malicious codes.

FIG. 4 is a process of extracting byte characteristics of malicious code to generate a gray level map, wherein malicious code files are converted into binary streams, vectors composed of 8-bit binary numbers are read from binary data, each vector corresponds to a pixel, binary values of the vectors are converted into decimal values, corresponding intervals are [0,255], wherein 0 is black, and 255 is white, so that malicious code can be converted into the gray level map.

and step 16, obtaining a classification result of the malicious codes, and classifying the malicious codes through a softmax layer.

The method for preprocessing the data in the step 12 comprises the following steps:

Dividing the pre-classified malicious code image in the step 12 into a training set and a verification set, wherein the training set is used for training a model to generate training feature vectors, and the verification set is used for observing and evaluating the performance of the model to generate verification feature vectors.

The proportion of the training set is 70% and the proportion of the verification set is 30%.

The training method of the bidirectional time domain convolution network in the step 13 is as follows:

The fusion method in the step 14 is pooling fusion: and extracting deep features by a fusion pooling layer obtained by parallelly connecting the maximum pooling and the average pooling, and further grabbing the dependency relationship in the data.

The formula of the pooling fusion is

Wherein h is the output through the bidirectional time domain convolution network, h _max Is the output of the maximum pooling layer, h _ave Is the output of the average pooling layer, h _fuse Is the output result after fusion pooling.

As can be seen from FIG. 5, the present application extracts the texture features of the gray image by using asm file, and uses one-dimensional image instead of two-dimensional image to represent the malicious code features, so as to avoid the local correlation which does not exist between the pixels in the malicious code image features caused by image folding, and the gray image features and the operation code features respectively reflect the similarity of the malicious codes in the same type on the global and local aspects of the malicious codes, so that the global and local features are fused, and the feature information of the malicious codes is utilized from multiple angles; the fusion features are used as the input of the bidirectional time domain convolution network model for training and classifying, so that the accuracy of malicious code detection can be increased, and the time domain convolution network can fully utilize the data information in the front direction and the back direction.

To verify the performance of the above method, the present application uses a training dataset in the public dataset provided by the malware classification challenge (BIG 2015) to estimate the performance of the model, which has 10868 tagged malicious code samples in total, divided into 9 malicious code families, wherein each malicious sample has been dehusked to contain two files, respectively hexadecimal represented. Bytes files, and malicious code binary files disassembled. Asm files, each malicious code file having an Id, a 20 character hash value of a unique identification file, and a Class.

The dataset is generally composed of 200G data, containing a.byte file of 50G and an. Asm file of 150G. The dataset is divided into two parts, a training set and a validation set. Wherein the training set is used for training the model and the verification set is used for observing and evaluating the performance of the model. In the experiment, 70% of the data set was divided into training sets and 30% into validation sets. The divided training set has 7608 samples and the test set has 3260 samples.

The performance of the model is evaluated herein by using four indexes, namely, accuracy (Accuracy), precision (Precision), recall (Recall) and F1 score (F1-score), which are widely applied to related researches, and the formula is as follows:

where TP indicates that the actual malicious code sample was correctly predicted as malicious code and FP indicates that the actual normal code sample was incorrectly predicted as malicious code. Similarly, the actual normal code samples of the TN are correctly predicted as normal codes, meaning that the actual malicious code samples are incorrectly predicted as normal codes.

In order to fully verify the effectiveness of the BiTCN-DLP based malicious code classification method presented herein, the following experiments were now set up:

experiment 1: biTCN-DLP Performance analysis experiment:

the selection of the hyper-parameters has a critical influence on the training effect of the model, so that the performance of the model is reflected to the greatest extent, and parameter adjustment of the model is needed. In the BiTCN-DLP model, only opcode single feature optimization model parameters are used.

The number of convolution kernels is selected from the parameters of the optimization class as the super-parameters, and the number of BiTCN layers and the number of neurons at each layer are selected from the parameters of the model class as variables.

The iteration times of the model are selected to be 200 times, the expansion coefficient in TCN is increased by 2 times of the index, the expansion coefficient is set to be (1, 2,4, 8), adam [37] is selected by the optimization algorithm, and the learning rate is set to be 0.002.

To avoid the over-fitting problem, a dropout layer was added and the value was 0.2. In order to make experimental data more accurate and effective, a five-fold cross-validation method is adopted.

Performing parameter optimization experiments on the number (2, 3,4,5, 6) of convolution kernels by using a grid search algorithm, and finally determining optimal parameter settings of a model as shown in the following table

And according to the parameter values, fusing two characteristics extracted by the operation code and the byte code, and carrying out experiments by using a BiTCN-DLP classification model.

FIG. 6 shows the performance of the training set and the test set as a function of training batch during model training, which is a plot of accuracy as a function of training batch.

Fig. 7 is a graph of loss rate as a function of training batch. Light grey represents the test set and dark grey represents the training set. It can be seen that the model can converge quickly. Through training and testing, the accuracy of the model reaches 99.54%, and the loss rate is 0.0292.

Experiment 2: n-gram feature selection experiment:

the model data extracts the characteristics of the operation codes in the malicious codes by using an N-gram algorithm in a data processing part, wherein the value of N has direct influence on the model effect. In order to obtain the optimal value of n, the rest conditions are the same, and four different values of n=2, 3,4,5 are compared, and the experimental result is shown in fig. 8:

as can be seen from the graph, when n=3, the accuracy of the model is as high as 99.54% compared with other N-gram values, which is higher than other values. When the value of n is higher than 3, the accuracy gradually decreases. Experimental results show that n=3 is the optimal value of N-gram.

Experiment 3: single-feature and multi-feature fusion contrast analysis experiments:

in order to further improve the extraction capacity of data information, the model uses an N-gram method to extract the characteristics of an operation code sequence during data processing, and uses byte codes to extract the gray image characteristics of malicious codes; and fusing the two. To verify the effectiveness of the method, a comparison experiment is set, the operation code sequence features, the gray image features and the mixed features are compared, the experimental result is shown in figure 9,

as can be seen from the graph, the four evaluation indexes of the accuracy, the precision, the recall and the F1 value of the mixed feature are respectively improved by 1.35%, 2.04%, 6.32% and 4.85% compared with the single operation code feature, and are respectively improved by 10.13%, 8.77%, 9.60% and 1.12% compared with the single byte code feature.

The experimental result shows that the effect of the mixed characteristic of the operation code characteristic and the gray image characteristic is obviously better than any single characteristic, the effect of the operation code characteristic and the gray image characteristic on the model is obviously improved, and the effectiveness of the method is verified. The analytical reasons are that: the characteristics of the operation code sequence and the gray level map can reflect the essence of the malicious code from different scales respectively, the characteristics extracted from the operation code sequence and the gray level map are combined, the characteristic information of the malicious code can be enriched, the complementary effect is generated, and the influence of confusion and crust addition of the malicious code is prevented, so that a better effect is obtained.

Experiment 4: two-way TCN validity verification experiment:

a bi-directional time domain convolutional network (BiTCN) is able to extract more comprehensive and robust features than a unidirectional network. As shown by a comparison experiment, the BiTCN provided by the application has the advantages of high accuracy and high convergence rate, and has certain advantages in parameter quantity and detection speed. To verify the effectiveness of BiTCN compared to unidirectional TCN, an ablation experiment was performed with BiTCN, forward TCN and reverse TCN, the results are shown in figure 10,

as can be seen from fig. 10, compared with the unidirectional model, the bidirectional model has certain advantages in terms of accuracy, recall, precision and F1 score in malicious code classification, which means that the bidirectional fusion feature is more effective in classifying malicious codes, can realize comprehensive utilization of the bidirectional feature, relieves limitation of the unidirectional feature on the malicious code classification, fully utilizes bidirectional correlation among distance units, ensures more balanced feature performance, has complementary advantage effect in the forward and reverse feature fusion, verifies rationality and effectiveness of the BiTCN-DLP on malicious code recognition, and is beneficial to further improving malicious code detection effect.

Experiment 5: pooling fusion validity verification experiment:

the problem of insufficient feature extraction capability of a model is solved by adopting a mode of maximum pooling and mean pooling fusion, and a comparison experiment of different pooling methods on the classified detection influence of malicious codes is set in this section for verifying the effectiveness of the pooling fusion method provided by articles: under the same experimental conditions, the model adopts four different methods of pooling, mean pooling, maximum pooling and fusion pooling to extract the characteristics, the experimental results of the four schemes are shown in figure 11,

as can be seen from fig. 11, the method of pooling fusion can obtain higher detection accuracy than the scheme of mean pooling or maximum pooling alone. The reason is that the mean pooling can extract the features containing global significance, and the features extracted by the maximum pooling have features with certain local significance, so that the features extracted by the two modes have great differences.

Two different features are combined by adopting a pooling fusion method, and the features can be fully learned by supplementing each other, so that malicious code features are better reserved, the model learning feature capability is effectively improved, and a good classification effect is obtained.

Experiment 6: model contrast analysis experiment:

to further verify the performance of the BiTCN-DLP based malicious code classification model, a comparison experiment was set to compare the model with other malicious code classification models in recent years, the results are shown in figure 12,

from the figure, the classification model based on BiTCN-DLP malicious codes provided herein has an accuracy rate as high as 99.54%, and is superior to all other methods in classification accuracy rate.

While embodiments of the present application have been shown and described above for purposes of illustration and description, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. The malicious code classification method based on the bidirectional time domain convolution network and the feature fusion is characterized by comprising the following steps:

step 11, acquiring an original file of malicious codes;

and step 16, obtaining a classification result of the malicious codes.

2. The malicious code classification method based on the bidirectional time domain convolutional network and feature fusion as recited in claim 1, wherein: the method for preprocessing the data in the step 12 comprises the following steps:

3. The malicious code classification method based on the bidirectional time domain convolutional network and feature fusion as recited in claim 1, wherein: dividing the pre-classified malicious code image in the step 12 into a training set and a verification set, wherein the training set is used for training a model to generate training feature vectors, and the verification set is used for observing and evaluating the performance of the model to generate verification feature vectors.

4. The malicious code classification method based on the bidirectional time domain convolutional network and feature fusion of claim 3, wherein: the proportion of the training set is 70% and the proportion of the verification set is 30%.

5. The malicious code classification method based on the bidirectional time domain convolutional network and feature fusion as recited in claim 1, wherein: the training method of the bidirectional time domain convolution network in the step 13 is as follows:

6. The malicious code classification method based on the bidirectional time domain convolutional network and feature fusion as recited in claim 1, wherein: the fusion method in the step 14 is pooling fusion: and extracting deep features by a fusion pooling layer obtained by parallelly connecting the maximum pooling and the average pooling, and further grabbing the dependency relationship in the data.

7. Malicious network and feature fusion based on a bi-directional time domain convolutional network as recited in claim 6The code classification method is characterized in that: the formula of the pooling fusion is