CN116340944A

CN116340944A - Malicious code classification method and system based on RGB image and lightweight model

Info

Publication number: CN116340944A
Application number: CN202310608993.XA
Authority: CN
Inventors: 赵大伟; 孙晨宇; 杨淑棉; 徐丽娟; 李鑫; 张雨鑫; 徐庆玲
Original assignee: Qilu University of Technology; Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Qilu University of Technology; Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-06-27
Anticipated expiration: 2043-05-29
Also published as: CN116340944B

Abstract

The invention belongs to the technical field of malicious code classification, and provides a method and a system for classifying malicious codes based on RGB images and a lightweight model, wherein the method comprises the following steps: decompiling an original malicious code file to generate an asm file and a bytes file; extracting an operation code sequence in an asm file and a byte sequence in a bytes file, and fusing a gray level image and a Markov image generated based on the operation code sequence and the Markov image generated based on the byte sequence to obtain a fused RGB image; and inputting the model into a trained lightweight model for classification. The invention extracts the operation code sequence and the byte sequence respectively to obtain a gray level image based on the frequency of the operation code, a Markov image based on the operation code sequence and a Markov image based on the byte sequence; the operation code sequence is visualized as a Markov image, the integrity of the extracted features is ensured to the maximum extent, and the generalization capability of the model is improved.

Description

Malicious code classification method and system based on RGB image and lightweight model

Technical Field

The invention belongs to the technical field of malicious code classification, and particularly relates to a method and a system for classifying malicious codes based on RGB images and a lightweight model.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The traditional malicious code static detection technology is simple and rapid, does not occupy large resources, but cannot detect unknown families, and is easy to be interfered by modes such as polymorphism, deformation and the like, so that missing report and false report are caused. The traditional static detection method of malicious code has failed under the condition that the faster the number of varieties increases. In recent years, the visualization technology is widely applied to the field of malicious code detection, and compared with the traditional static detection method, the visualization method can completely retain the characteristic information of malicious samples, can more intuitively observe different points of each malicious sample, and solves the influence caused by the confusion technology. In addition, the existing research usually adopts graying treatment on malicious binary files, features are extracted from gray images and combined with a neural network to train and obtain experimental results, but the gray images are single-channel data images, and the malicious code information features contained in the gray images are less and single, so that the gray images are not more visual and better in effect than three-channel data images in the neural network.

The deep learning has a network deep enough to accommodate more abundant semantic information, saves a large amount of manpower and material resources to analyze the characteristics, and the effect is improved continuously along with the continuous increase of the data quantity. But also has the defects of large calculation amount, high cost and the like, can not fully utilize the rich information contained in the malicious samples, has low deep learning network efficiency, and can not meet the requirements of instantaneity and accuracy of code classification.

Disclosure of Invention

Aiming at the problems that the existing known detection method cannot fully utilize rich information contained in a malicious sample and the deep learning network is low in efficiency, the invention discloses a malicious code classification method and a malicious code classification system based on an RGB image and a lightweight model, which can fully extract characteristic information in the malicious sample and greatly reduce the quantity of parameters and calculation under the condition of ensuring accuracy.

To achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

the first aspect of the invention provides a malicious code classification method based on an RGB image and a lightweight model, which comprises the following steps:

decompiling an original malicious code file to generate an asm file and a bytes file;

extracting an operation code sequence in the asm file, and respectively converting the extracted operation code sequence into a gray level image and a Markov image;

extracting byte sequences in the bytes file, and visualizing the byte sequences into Markov images;

carrying out feature fusion on a gray level image and a Markov image generated based on an operation code sequence and a Markov image generated based on a byte sequence to obtain a fused RGB image;

and inputting the fused RGB images into a trained lightweight model for classification.

A second aspect of the present invention provides a malicious code classification system based on RGB images and lightweight models, comprising:

a decompilation module configured to: decompiling an original malicious code file to generate an asm file and a bytes file;

an asm file extraction module configured to: extracting an operation code sequence in the asm file, and respectively converting the extracted operation code sequence into a gray level image and a Markov image;

a bytes file extraction module configured to: extracting byte sequences in the bytes file, and visualizing the byte sequences into Markov images;

a fusion module configured to: carrying out feature fusion on a gray level image and a Markov image generated based on an operation code sequence and a Markov image generated based on a byte sequence to obtain a fused RGB image;

a classification module configured to: and inputting the fused RGB images into a trained lightweight model for classification.

The one or more of the above technical solutions have the following beneficial effects:

(1) The invention extracts the operation code sequence and byte sequence of asm file and byte file to obtain gray scale image based on operation code frequency, markov image based on operation code sequence and Markov image based on byte sequence; the operation code sequence is visualized into a Markov image, so that the integrity of the extracted features is ensured to the maximum extent, and the generalization capability of the model is improved; the three generated images are all 256 multiplied by 256 in fixed size, so that the process of standardizing the images is omitted, three different single-channel images can be directly filled into three R, G, B channels respectively, the three single-channel images are fused into RGB images of 256 multiplied by 3, the characteristic information in a malicious sample can be fully extracted, and the advantage complementary effect among different characteristics is achieved.

(2) According to the invention, a malicious code detection framework combining a MobileNet V2 and a CBAM attention mechanism is designed, a CBAM module is added after point convolution of a seventh inverse residual error structure on the basis of an original MobileNet V2 model, key information in an RGB image can be effectively found, the weight of the key information is increased, and the problems that the existing neural network is more and more bulky and the calculated amount is obviously increased are solved.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

Fig. 1 is a flowchart of RGB image acquisition according to a first embodiment.

Fig. 2 is a structural diagram of a lightweight model of the first embodiment.

Fig. 3 is a schematic diagram of converting an extracted operation code sequence into a two-dimensional transition probability matrix according to the first embodiment.

Fig. 4 is a schematic diagram of converting the extracted byte sequence into a two-dimensional transition probability matrix according to the first embodiment.

Fig. 5 is a graph comparing performance of a lightweight model trained with different images according to the first embodiment.

Fig. 6 (a) and 6 (b) are graphs showing the training set accuracy and loss value variation of the first embodiment, respectively.

Fig. 7 (a) and 7 (b) are graphs showing the test set accuracy and loss value variation of the first embodiment, respectively.

Detailed Description

The general conception of the invention is as follows:

according to the invention, an IDA Pro tool is utilized to convert a PE file into an asm file and a bytes file, then an original operation code sequence in the asm file is extracted, the extracted operation code sequence is respectively converted into two different gray patterns by using two different methods, meanwhile, a byte sequence in the bytes file is extracted, the gray patterns are generated, and finally three different gray patterns are fused to generate an RGB color image. The generated RGB images are then input into a modified MobileNet V2 lightweight model for classification.

Example 1

As shown in fig. 1, the embodiment discloses a malicious code classification method based on an RGB image and a lightweight model, which includes:

step 1, decompiling an original malicious code file to generate an asm file and a bytes file;

step 2, extracting an operation code sequence in an asm file, and respectively converting the extracted operation code sequence into a gray level image and a Markov image;

step 3, extracting byte sequences in the bytes file, and visualizing the byte sequences into Markov images;

step 4, respectively filling a gray level image and a Markov image generated based on the operation code sequence and a Markov image generated based on the byte sequence into R, G, B channels to obtain a fused RGB image;

and 5, inputting the fused RGB images into a trained lightweight model for classification.

In step 1, an IDA tool is utilized to decompil an original malicious code file, an asm file and a bytes file are generated, and then characteristic information in the two files is extracted.

In step 2, considering that a single opcode in an asm file does not have an effect, but an opcode sequence consisting of multiple opcodes may be detrimental, the present invention may utilize an N-gram algorithm to extract an opcode sequence consisting of three consecutive opcodes as a feature. Experiments have shown that a subsequence of three opcodes processed using the N-gram algorithm is optimal, as a single opcode is of little significance.

On the other hand, because the N-gram algorithm lacks long-term dependence, only the first N-1 words can be modeled, and a part of characteristics are lost; in the process of extracting a significant operation code sequence by using an N-gram algorithm, the N is large in actual conditions, which can cause dimension disasters, so that the frequency of counting the operation code sequence formed by three operation codes is finally selected to generate a gray level diagram;

in the first R channel: firstly, dismantling an generated asm file by using a dismantling tool IDA Pro, and extracting an operation code sequence formed by three operation codes by using an N-gram algorithm as a characteristic;

extracting all operation codes of each malicious asm file, and selecting every three continuous operation codes as subsequences, wherein for example, all operation codes of each malicious asm sample are 'pop, push, mov, call', and the generated subsequences are 'pop, push, mov', 'push, mov, call';

then calculating the frequency of the subsequences, selecting the 256 subsequences with highest frequency in each malicious asm sample, which are smaller than the complement 0 of 256, filling 256×256 matrixes, and finally generating a gray image;

in the second G channel: the method extracts all operation code sequences of text fields in malicious asm files, generates a two-dimensional transition probability matrix, and finally visualizes the two-dimensional transition probability matrix into a Markov image. The invention visualizes the operation code sequence into the Markov image, ensures the integrity of the extracted features to the maximum extent, and improves the generalization capability of the model; the asm file is obtained by disassembling an executable file and has a structure consistent with that of the executable file, wherein the text field refers to a code section of the executable file and can be used for storing a memory area of program execution codes.

Specifically, because the operation code sequence with the most occurrence times of the extracted text field is counted, the operation code sequence represents that the effect of the operation codes in the whole is outstanding, the operation code sequence in a single malicious asm file is extracted, and the operation of searching for a difference value is performed, which is equivalent to the operation code sequence with the outstanding effect in a single malicious asm, and the operation code sequence is a characteristic selection process.

As shown in fig. 3, in the second G channel: firstly, extracting all operation code sequences of a text field of all malicious asm files, counting the first 255 operation code sequences with the largest occurrence frequency, putting the 255 operation code sequences in a one-dimensional array, filling the tail of the one-dimensional array with aaa, and naming the array as a; then, extracting an operation code sequence of a text field in a single malicious asm file and storing the operation code sequence in an array b; and then judging whether the types of the operation code sequences in the array a are the same as those in the array b, namely searching the difference value of the two arrays by using the setdif 1d function, if the types of the operation codes of the two arrays are the same, setting the result of the difference value to be null, otherwise outputting unique values in the array a but not in the array b, finally setting the unique values to be 'aaa', and outputting the operation code sequences of 256 types.

Finally, a sliding window with the size of 2 is set, the first operation code is regarded as a row, the second operation code is regarded as a column in two continuous operation codes, 1 is added to the corresponding position, the sliding is completed until the last operation code of a single file is finished, a two-dimensional transition probability matrix is generated, and finally the two-dimensional transition probability matrix is visualized as a Markov image.

As shown in fig. 4, in step 3, the third B channel: for unification with the image of the G channel, a matrix initialized to zero and 256×256 in size is also created, and hexadecimal numbers in malicious byte files are converted into decimal numbers of 0-255;

after conversion, setting a sliding window with the size of 2, and taking the first byte as a row and the second byte as a column in two continuous bytes, adding 1 at a corresponding position, sliding until the last byte is finished, and generating a two-dimensional transfer frequency matrix;

and then calculating the ratio of the frequency of each position on the matrix to the sum of the frequencies of each row, multiplying the ratio by 255 because the floating point type is complex, filling the ratio into a zero matrix after the integer is formed, generating a two-dimensional transition probability matrix, filling the value of each position of the two-dimensional transition probability matrix as the pixel point of the Markov image, and visualizing the Markov image.

In step 4, the invention considers the problem that the single characteristic images have different data sets and different expression conditions, in order to have good expression on all the data sets, three single characteristic images can be respectively filled into three RGB channels to be fused into a color image, and the effective information of the finally generated color image is three times of that of the single image.

The three images generated by the method are all 256 multiplied by 256 in fixed size, so that the process of standardizing the images is omitted, three different single-channel images can be directly filled into R, G, B three channels respectively, the images are fused into 256 multiplied by 3 RGB images, the characteristic information in a malicious sample is fully extracted, and the advantage complementary effect among different characteristics is achieved.

In step 5, the invention adds the CBAM module (CBAM (Convolutional Block Attention Module) is a lightweight attention module) after the point convolution of the seventh inverse residual error structure on the basis of the original MobileNet V2 model, and can effectively find the key features in the classification detection, increase the weight of the key features, effectively reduce the loss of information, improve the performance of the model and increase the interpretability of the neural network model.

The MobileNet V2 model structure comprises an input layer, seven inverse residual structures, two standard convolution layers, a global averaging pooling layer, a full connection layer and an output layer. Each inverse residual structure is first convolved by 1 x 1 with an activation function of Relu6; then, through point convolution, the convolution kernel is 3 multiplied by 3, and the activation function is Relu6; finally, through convolution processing of 1×1, the activation function is a Linear activation function. Table 1 shows the structure of the modified MobileNet V2 model of the present invention

TABLE 1

Wherein t is a spreading factor, and the spreading factor of a convolution kernel in the first layer 1×1 convolution; c is the number of output characteristic channels; n is the number of repetitions of the bottleneck; s represents stride, controlling the feature size.

As shown in table 2, the input image size was 256×256×3, the output image size after passing through one standard convolution layer was 128×128×32, the output image size after passing through the first inverse residual structure was 128×128×16, the output image size after passing through the second inverse residual structure was 64×64×24, the size of the image output after the third inverse residual structure is 32 multiplied by 32, the size of the image output after the fourth inverse residual structure is 16 multiplied by 64, the size of the image output after the fifth inverse residual structure is 16 multiplied by 96, the size of the image output after the sixth inverse residual structure is 8×8×160, the size of the image output after the seventh inverse residual structure is 8×8×320, then the size of the image output is 8×8×1280 through a convolution layer with a convolution kernel size of 1×1, after global flattening, the size of the image output is 1×1280, finally the data after training is output through a convolution layer, namely a full connection layer, wherein 1280 neurons output are fully connected with 1000 neurons in Softmax.

Training process: RGB images of size 256×256×3 were input into the modified MobileNet V2 model for training, and subjected to family classification by connecting Softmax layers after two convolutional layers, seven reverse residual structures, one average pooling layer and one full connection layer. Different batch sizes, loss functions, optimizers, training cycles and learning rates are set according to different malware samples. The modified MobileNet V2 model hyper-parameter settings are shown in figure 2.

TABLE 2

The embodiment is used for verifying on a public data set Big2015 data set and researching which family the unknown malicious file belongs to, and comparing the characteristics of the unknown malicious file extracted with the malicious files of the known families, so as to judge which family the malicious file belongs to.

In order to explore the detection performance of gray images and RGB images in a neural network, gray images based on operation code frequency, markov images based on byte sequences, markov images based on operation code sequences and fused RGB images are input into the improved lightweight model for classification training, whether the gray images and the color images are different in classification detection performance or not is explored, and an experimental result is shown in fig. 5.

As can be seen from the observation of FIG. 5, the effect of a single gray scale is different from that of the model training of the invention, but the overall effect is not as good as that of the fused RGB image, the accuracy can be up to 99.9% and the F1 score (F1-score) value can also be up to 99.5% by training with the RGB image.

In order to explore the effectiveness of the lightweight model proposed in this embodiment, the invention is compared with the traditional classification detection model, the change curves of the training set accuracy and the loss value along with the training wheel number are shown in fig. 6 (a) and fig. 6 (b), and the test set accuracy and the loss value are shown in fig. 7 (a) and fig. 7 (b).

As can be seen from the figure, compared with other traditional classification detection models, the model curve of the invention always tends to be stable, and has the highest accuracy and the lowest loss no matter in a training set or a test set, so that the performance of the lightweight model provided by the invention is higher than that of other models (alexnet, vgg, restnet50 and mobiletv 1), and the convergence speed is relatively high.

Example two

The embodiment discloses a malicious code classification system based on RGB image and lightweight model, includes:

The lightweight model is an improved MobileNet V2 model; the improvement to the MobileNet V2 model includes adding an attention module after the point convolution of the seventh inverse residual structure of the MobileNet V2 model.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented by general-purpose computer means, alternatively they may be implemented by program code executable by computing means, whereby they may be stored in storage means for execution by computing means, or they may be made into individual integrated circuit modules separately, or a plurality of modules or steps in them may be made into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims

1. A malicious code classification method based on an RGB image and a lightweight model, comprising:

fusing a gray level image and a Markov image generated based on the operation code sequence and a Markov image generated based on the byte sequence to obtain a fused RGB image;

2. The method of claim 1, wherein extracting the operation code sequence in the asm file, and converting the extracted operation code sequence into a gray scale map comprises:

extracting all operation codes for extracting each malicious sample in the asm file by using an N-gram algorithm;

selecting every three operation codes as subsequences, and calculating the frequency of the subsequences;

the 256 subsequences with highest frequency in each malicious sample are selected, the subsequences are smaller than the complement 0 of 256, the matrix of 256×256 is filled, and finally a gray level image is generated.

3. The method of claim 1, wherein extracting the operation code sequence in the asm file, converting the extracted operation code sequence into a markov image comprises:

extracting all operation code sequences of text fields;

counting the first 255 operation code sequences with the largest occurrence number, and filling "aaa" at the tail of the operation code sequences;

creating a matrix of size 256×256 initialized to zero;

judging whether the single file operation code type is the same as 256 operation codes, if so, setting the operation codes to be empty, otherwise, outputting the operation codes which are different;

judging whether the output operation code is the same as the single file operation code in type and is set as 'aaa' differently;

setting a sliding window with the size of 2, regarding the first operation code as a row and the second operation code as a column in two continuous operation codes, adding 1 to the corresponding position, sliding until the last operation code of a single file is finished, and then generating a two-dimensional transition probability matrix;

the two-dimensional transition probability matrix is visualized as a Markov image.

4. The method for classifying malicious code based on an RGB image and a lightweight model according to claim 1, wherein the extracting the byte sequence in the bytes file, visualizing the byte sequence as a markov image, comprises:

creating a matrix of size 256×256 initialized to zero;

converting hexadecimal numbers in the bytes file into decimal numbers of 0-255;

setting a sliding window with the size of 2, regarding the first byte as a row and the second byte as a column in two continuous bytes, adding 1 at a corresponding position, sliding until the last byte is finished, and generating a two-dimensional transfer frequency matrix;

calculating the ratio of the frequency number of each position on the two-dimensional transfer frequency matrix to the sum of the frequency numbers of each row;

multiplying the ratio by 255, forming integer, and filling the integer into a zero matrix to generate a two-dimensional transition probability matrix;

5. The method for classifying malicious codes based on RGB images and lightweight models according to claim 1, wherein the feature fusion of the gray level map and the markov image generated based on the operation code sequence and the markov image generated based on the byte sequence to obtain the fused RGB images comprises: and filling the gray level image and the Markov image generated based on the operation code sequence and the Markov image generated based on the byte sequence into an R channel, a G channel and a B channel respectively, and fusing the R channel, the G channel and the B channel into a colorful RGB image.

6. The method for classifying malicious code based on an RGB image and a lightweight model according to claim 1, wherein the lightweight model is an improved MobileNet V2 model;

the improvement to the MobileNet V2 model includes adding an attention module after the point convolution of the seventh inverse residual structure of the MobileNet V2 model.

7. The method for classifying malicious code based on an RGB image and a lightweight model according to claim 6, wherein the inputting the fused RGB image into the trained lightweight model for classification comprises: inputting the generated RGB image into the improved MobileNet V2 model for training to obtain a trained classification model, carrying out feature fusion on malicious files to be detected to generate the RGB image, and inputting the RGB image into the trained classification model to obtain a malicious family classification result.

8. A malicious code classification system based on RGB images and lightweight models, comprising:

9. A malicious code classification system based on RGB images and a lightweight model according to claim 8, wherein the lightweight model is an improved MobileNet V2 model.

10. The RGB image and lightweight model-based malicious code classification system of claim 9, wherein the modification to the mobilet V2 model includes adding an attention module after a point convolution of a seventh inverse residual structure of the mobilet V2 model.