CN117496246A

CN117496246A - Malicious software classification method based on convolutional neural network

Info

Publication number: CN117496246A
Application number: CN202311489175.9A
Authority: CN
Inventors: 魏林锋; 黎庭威; 何卓丰; 王宇; 宣建通; 陈佳韩; 李健; 李学明; 黄宇勤; 丁振杨; 欧燕; 杨子怡; 孙炜; 唐英展
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2023-11-09
Filing date: 2023-11-09
Publication date: 2024-02-02

Abstract

The invention discloses a malicious software classification method based on a convolutional neural network. The method comprises the following steps: collecting a malicious software sample data set; converting the malware sample into a gray scale image; increasing the local contrast of the gray images and the contrast between the gray images, and simultaneously inhibiting the amplification of noise; the gray level image is input into an Efficientnet-B0 model to obtain a more refined feature vector, regularized finally and input into a Softmax function to be classified so as to determine the malicious software family to which the gray level image belongs. According to the method, a malicious software sample is converted into a gray image, and an Efficientnet-B0 model with high accuracy and fewer parameters is used, so that families corresponding to malicious software can be identified efficiently, a certain unknown malicious software attack discovery capability is provided, and the method can be expanded to other platforms for identifying scenes of the malicious software.

Description

Malicious software classification method based on convolutional neural network

Technical Field

The invention belongs to the fields of computer system security, network security and artificial intelligence security application, and particularly relates to a malicious software classification method based on a convolutional neural network.

Background

Malware is software that is installed and run on a user device without permission from the user, infringing on the legal rights and interests of the user. Because the malicious software creates the malicious software variant by reusing the core code, the malicious software is easier to write, and an automatic malicious software and variant generation platform thereof are usually arranged, so that the number of the malicious software is large and widely spread, and the malicious software causes great threat to enterprises, governments, financial institutions and the like, and even damages a user software and hardware system to cause great economic loss. Since most of the malware is automatically generated or generated by conventional malware, the most important method for searching and protecting the malware is to efficiently classify and assign the malware to a conventional malware family, and then the existing searching and protecting method is adopted to prevent the attack of the malware. Aiming at the problems, the invention provides a malicious software classification technology and method based on a convolutional neural network.

The current malware detection methods are mainly divided into static analysis and dynamic analysis. Static analysis is to analyze its code files without executing the application, and although static analysis provides the most comprehensive code coverage at a faster rate and with less overhead, obfuscation and encryption techniques can affect its analysis performance and effectiveness. Dynamic analysis has better performance and effect in dealing with obfuscation techniques and encryption techniques by monitoring application running in a sandbox environment and collecting its behavior information, but dynamic analysis requires longer analysis time and higher memory overhead. Furthermore, by running an application in a sandbox does not cover all possible code execution paths and running scenarios, and some malware may detect the sandbox environment, malicious behavior may not appear during dynamic analysis.

The image-based malicious software detection does not need to extract features of an original sample, the image generation speed is high, and the image-based malicious software detection has better performance and effect on the malicious software detection by using the confusion technology and the encryption technology. The code of the malware generally has a plurality of bytes, and the image-based malware detection corresponds each byte to one pixel, so code execution instructions can be converted into a plurality of pixel values. Locating similar instruction sequences from different malware samples is equivalent to identifying regions with similar pixel values in their corresponding images. However, similar instruction sequences of different malware samples belonging to the same malware family may exist at different locations of their files, resulting in reduced classification model accuracy. Too many parameters of the traditional convolutional neural network can lead to low classification efficiency and poor generalization of malicious software based on the convolutional neural network.

In summary, the method for classifying the malicious software by using the convolutional neural network still needs to be further improved in terms of accuracy, generalization and efficiency by converting the malicious software sample into the gray level image.

Disclosure of Invention

The invention aims to solve the defects and the shortcomings of the existing classification schemes and provides a malicious software classification method based on a convolutional neural network. According to the method, the malicious software sample is converted into the gray image, so that a great amount of time expenditure in the feature extraction process is avoided; then, a self-adaptive histogram equalization method for limiting the contrast is used for enhancing the local contrast of the gray level image, and compared with the histogram method, the self-adaptive histogram equalization method for limiting the contrast can enhance the local contrast of the image, avoid amplifying the noise of the image and enhance the contrast effect between the images; finally, the obtained image is input into a classifier, the classifier removes the last full-connection layer of the Efficient net-B0 model, all layers before the full-connection layer are reserved, a global average pooling layer and a Softmax layer are added after the full-connection layer, the global average pooling layer has fewer parameters than the full-connection layer, the regularization function is achieved, and the Efficient net-B0 model can obtain higher accuracy with a small number of parameters and calculated amount. The method can obtain higher accuracy with less detection time.

The technical scheme adopted by the invention is as follows:

a malicious software classification method based on a convolutional neural network comprises the following steps:

s1) marking software samples in a malicious software data set, converting each byte into decimal numbers between [0,255] according to byte sequence of the software samples, converting the decimal numbers into a first gray scale image, dividing the first gray scale image into a training set and a testing set through a cross verification function, and ensuring that the proportion of each malicious software category corresponding to the training set and the testing set is consistent with that of an original data set;

s2) image enhancement: the first gray image is processed by adopting an adaptive histogram equalization method for limiting contrast,

obtaining a second gray scale image with enhanced local contrast;

s3) feature extraction: inputting the second gray level image into an EfficientNet-B0 model, extracting features, outputting more refined feature vectors with stronger expression capacity, and obtaining a third gray level image;

s4) image classification: the third gray level image is input to the global average pooling layer, one-dimensional vector is output, then the one-dimensional vector is input to the Softmax layer, the input one-dimensional vector is converted into probability distribution, each element of the output vector is between 0 and 1, the probability value that a sample belongs to a certain malicious software family is represented, the sum of all elements is 1, and the category with the highest probability is selected as a prediction result.

In some examples of malware classification methods, the image enhancement specifically includes:

a) Dividing an original image into a plurality of regions;

b) Calculating a cumulative distribution function CDF of pixel values in the image area;

c) Judging whether the frequency value of a certain pixel in the image area is higher than a preset frequency threshold value, if so, performing clipping operation by using an image threshold processing function, and randomly assigning the pixels higher than the preset frequency threshold value to [0,255]

Values within the range to ensure that no pixel has a frequency value above the threshold;

d) The interpolation method is used for converting each region, so that pixel values are related to each other, noise amplification is limited, and contrast of an image is enhanced.

In some examples of malware classification methods, the Efficient Net-B0 model is formed by optimization through compression and excitation methods using a series of MBConv modules.

In some examples of malware classification methods, the image thresholding function is that of a Python CV2 library.

In some examples of malware classification methods, a third party open source library CV2 of Python is used to convert decimal numbers into a first grayscale image.

In some examples of malware classification methods, during a clipping operation, portions of pixel values that occur more frequently than a frequency threshold are divided equally into 0-255, out of 256 packets, and if there are portions that are not allocated equally, equally spaced are inserted into the packets in sequence until all the excess portions are allocated to the corresponding packets.

In some examples of malware classification methods, the region length is determined by the instruction length of its samples, and the width of the region is determined by the average height of all malware samples of the same series.

In some examples of malware classification methods, the malware data set is a data set Malimg having multiple malware types.

In some examples of malware classification methods, malware is tagged with open source antivirus software, clamAV.

In some examples of malware classification methods, the cross-validation function is stratifiedfold.

In some examples of malware classification methods, a pre-set frequency threshold is set with reference to known publications.

The beneficial effects of the invention are as follows:

compared with the problems of low accuracy, poor generalization, low efficiency and the like of the prior classification technology, the invention has the following advantages:

(1) The classification efficiency is high: according to the method, the malicious software sample is converted into the image, the conversion speed is high, the time required by static feature extraction and dynamic feature extraction is saved, and the parameters and the calculated amount of the model are less.

(2) The accuracy is high: the local contrast of the original image is enhanced, and the accuracy of classification can be improved by using a model with high classification accuracy.

(3) High generalization: the full-connection layer of the EfficientNet-B0 model is removed, the generalization capability of the model is reduced due to excessive parameters of the full-connection layer, and the feature vector is simplified by using the global average pooling layer, so that the regularization effect is achieved, and the generalization capability of the model is enhanced.

(4) And (3) visualization: and the malicious software sample is converted into an image, so that the difference between different malicious software families can be observed conveniently and intuitively.

Drawings

FIG. 1 is a flow chart of a convolutional neural network-based malware classification method of the present invention.

FIG. 2 is a flow chart of an image enhancement process of the convolutional neural network-based malware classification method of the present invention.

FIG. 3 is a flow chart of an image classification process of the convolutional neural network-based malware classification method of the present invention.

Detailed Description

obtaining a second gray scale image with enhanced local contrast;

The source of the malicious software data set has no special requirement, and the sample is complete in variety and easy to obtain. In some examples of malware classification methods, the malware data set is a data set Malimg having multiple malware types.

Various marking software may be used to mark malware. In some examples of malware classification methods, malware is marked with open-source antivirus software, clamAV, taking into account accessibility of programs.

The decimal numbers may be converted into the first gray scale image using various well known algorithms. In some examples of malware classification methods, decimal numbers are converted to a first grayscale image using a third party open source library CV2 of Python, taking into account the accessibility of the program.

There is no special requirement for the cross-validation function, which in some examples of malware classification methods is stratifiedfold.

a) Dividing an original image into a plurality of regions;

The cumulative distribution function CDF can be calculated as follows:

where L is the total number of gray pixels, 256, n _j Is the probability value that the pixel value j occurs in the image area.

In some examples of malware classification methods, the Efficient Net-B0 model is formed by optimization through compression and excitation methods using a series of MBConv modules. Specifically, when constructing the afflicientnet-B0 model, a mobile rollover bottleneck convolution module in the MobileNet V2 is used as a main building block of the model, and on the basis, a multi-objective neural architecture is used for searching, so that a base network afflicientnet-B0 model is finally determined. The MBConv module in the Efficientnet-B0 model is formed by optimization using the compression and excitation method in SENet on the basis of a depth separable convolution. The Efficient net-B0 model can be regarded as an efficient feature extractor, and the image with enhanced local contrast outputs feature vectors which are more refined and have stronger expressive power after a series of operations such as convolution, pooling and activation.

The image thresholding function may be a variety of known functions. In some examples of malware classification methods, the image thresholding function is that of the Python CV2 library, considering the accessibility of the algorithm.

In some examples of the malware classification method, during the image classification operation, after the image is input into the EfficientNet-B0 model, a series of operations such as convolution, pooling and activation are performed, and then a feature vector with higher definition and higher expression capability is output. Then the regularization function is achieved through the global average pooling layer, it can simplify the three-dimensional input of the image width and length w×h×d to a one-dimensional output of which only length 1×1×d remains, and then input the output to the Softmax function. The Softmax function receives a vector z containing K real numbers and converts it to a K probability-forming probability distribution proportional to the exponent of the input number, the corresponding function being:

the Softmax function first indexes each element in the input vector z, i.e(z _i Representing the i-th element in z), and adding all elements to obtain a value representing the index sum +.>Normalizing each element by dividing the index value of each element by the sum of the indices to obtain an output Softmax (z _i ) Each element representing the vector represents a probability of a malware family.

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other, and the present invention will be further described in detail with reference to the drawings and the specific embodiments.

The invention provides a malicious software classification method based on a convolutional neural network. According to the method, the malicious software sample is converted into the gray image, so that the large time cost of feature extraction is avoided; then using a self-adaptive histogram equalization method for limiting contrast to enhance the local contrast of the gray image; and finally inputting the obtained image into an Efficientnet-B0 model with the last full-connection layer removed, and adding a global average pooling layer and a Softmax layer after the Efficientnet-B0 model to judge which malware family a malware sample belongs to. As shown in fig. 1 to 3, the method specifically comprises the following steps:

step one, obtaining a public malicious software data set Malimg, marking the data set by using ClamAV, and converting each byte into decimal numbers between [0,255] according to byte sequence. A third party open source library of Python was used to convert it into a two-dimensional gray scale image. The gray level image is divided into a training set and a testing set through a cross validation function StratitifiedKFOld, so that the proportion of each malicious software category corresponding to the training set and the testing set is ensured to be consistent with that of the original data set.

Step two,

The first step is to divide the gray image;

the second step is to calculate the Cumulative Distribution Function (CDF) frequency value of the image region, which is calculated as follows:

where L is the total number of gray pixels, 256.n is n _j Is the probability value that the pixel value j appears in the image area;

judging whether the frequency value cdf (i) of the pixel is higher than a preset frequency threshold value, if yes, performing clipping operation by using an image threshold processing function of a CV2 library, and randomly endowing a part of the pixels with values in the range of [0,255], so that the frequency value of the pixel can be ensured to be higher than the frequency threshold value of clipping limit;

the fourth step is to transform each region by using interpolation in order to correlate pixel values within each region, limit amplification of noise and enhance contrast of the image.

And thirdly, when constructing the Efficientnet-B0 model, using a mobile overturn bottleneck convolution module in the MobileNet V2 as a main building block of the model, and searching by using a multi-target neural architecture on the basis to finally determine a base line network Efficientnet-B0 model. The MBConv module in the Efficientnet-B0 model is formed by optimization using the compression and excitation method in SENet on the basis of a depth separable convolution. The Efficient net-B0 model can be regarded as an efficient feature extractor, and the image with enhanced local contrast outputs feature vectors which are more refined and have stronger expressive power after a series of operations such as convolution, pooling and activation.

And step four, inputting an image into an EfficientNet-B0 model, and outputting a more refined feature vector with stronger expression capability after a series of operations such as convolution, pooling, activation and the like. Then the regularization function is achieved through the global average pooling layer, it can simplify the three-dimensional input of the image width and length w×h×d to a one-dimensional output of which only length 1×1×d remains, and then input the output to the Softmax function. The Softmax function receives a vector z containing K real numbers and converts it to a K probability-forming probability distribution proportional to the exponent of the input number, the corresponding function being:

Comparison of classification efficiency for different algorithms:

in the ImageNet dataset, the accuracy of the efficentet-B0 model was higher than that of the ResNet50 and the densanenet 169, the amount of parameters was minimal, the amount of calculation was minimal FLPOS (floating point operations), and the efficentets were evaluated on 8 common migration learning datasets, the results indicated that the efficentets reached the currently optimal accuracy on 5 datasets therein, and the amount of parameters was greatly reduced, indicating that the efficentets had good accuracy, performance, and migration ability.

The above description of the present invention is further illustrated in detail and should not be taken as limiting the practice of the present invention. It is within the scope of the present invention for those skilled in the art to make simple deductions or substitutions without departing from the concept of the present invention.

Claims

1. A malicious software classification method based on a convolutional neural network comprises the following steps:

s2) image enhancement: processing the first gray level image by adopting a self-adaptive histogram equalization method for limiting contrast ratio to obtain a second gray level image for enhancing local contrast ratio;

2. The malware classification method according to claim 1, wherein the image enhancement specifically comprises:

a) Dividing an original image into a plurality of regions;

c) Judging whether the frequency value of a certain pixel in the image area is higher than a preset frequency threshold value, if so, performing clipping operation by using an image threshold processing function, and randomly assigning the value in the range of [0,255] to the pixel higher than the preset frequency threshold value so as to ensure that the frequency value of the pixel is higher than the threshold value;

3. The malware categorization method of claim 1, wherein the afflicientnet-B0 model is formed by optimization through compression and excitation methods using a series of MBConv modules.

4. The malware classification method of claim 2, wherein the image thresholding function is an image thresholding function of a Python CV2 library; and/or

The decimal numbers are converted into a first gray scale image using a third party open source library CV2 of Python.

5. The malware classification method according to claim 2 or 4, wherein, in the clipping operation, the portions where the frequency of occurrence of the pixel values exceeds the frequency threshold are divided equally into 0-255, and 256 packets in total, and if there are portions that are not distributed equally, the portions are inserted into the packets in equal intervals in order until all the excess portions are distributed to the corresponding packets.

6. The method of claim 1, wherein the length of the region is determined by the instruction length of the sample, and the width of the region is determined by the average height of all malware samples in the same series.

7. The malware classification method according to claim 1, wherein the malware data set is a data set Malimg having a plurality of malware types.

8. The malware classification method according to claim 1, wherein malware is marked using an open source antivirus software, clamAV.

9. The malware classification method of claim 1, wherein the cross-validation function is stratifiedfold.

10. The malware classification method according to claim 1, wherein the preset frequency threshold is set with reference to known publications.