CN111552964A

CN111552964A - A Malware Classification Method Based on Static Analysis

Info

Publication number: CN111552964A
Application number: CN202010264024.3A
Authority: CN
Inventors: 李静梅; 白丹; 彭弘; 薛迪
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2020-04-07
Filing date: 2020-04-07
Publication date: 2020-08-18

Abstract

The invention belongs to the technical field of computer security, and in particular relates to a static analysis-based malware classification method. The invention relates to converting malicious software into binary files and generating grayscale images, using a convolutional neural network model with a spatial pyramid pooling layer to train the grayscale images to obtain static classifiers, and classifying malware samples into families through the static classifiers Classification. The invention can use the grayscale image as a feature to classify malicious software, and effectively reduces the information loss caused by the image preprocessing stage. The present invention classifies the malicious software by analyzing the contour features of the malicious software, which can help professionals reduce the cost of identifying the malicious software.

Description

A Malware Classification Method Based on Static Analysis

技术领域technical field

本发明属于计算机安全技术领域，具体涉及一种基于静态分析的恶意软件分类方法。The invention belongs to the technical field of computer security, and particularly relates to a static analysis-based malware classification method.

背景技术Background technique

伴随着互联网行业的高速发展，人们对各种软件的依赖性也随之增强，这为恶意软件的攻击和传播带来极大的便利。由于各种自动化工具的层出不穷，恶意软件被人们发现的速度远低于在互联网上的的衍生速度。例如，2017年Kaspersky实验室检测到了15,714,700个恶意对象。2018年第1季度中McAfee实验室每天检测到790万个恶意文件，比2017年第4季度增加了450万。虽然恶意软件的衍生速度越来越快，但是绝大部分的恶意软件都是通过已知恶意软件的多态和变形演化而来。所以，发现样本中的同源关系对攻击组织溯源、运行环境还原以及攻击防范具有十分重要的作用。With the rapid development of the Internet industry, people's dependence on various software has also increased, which brings great convenience to the attack and spread of malware. Due to the emergence of various automated tools, malware is discovered at a rate much slower than its derivatives on the Internet. For example, in 2017 Kaspersky Lab detected 15,714,700 malicious objects. McAfee Labs detected 7.9 million malicious files per day in Q1 2018, an increase of 4.5 million from Q4 2017. Although the rate of derivation of malware is getting faster and faster, the vast majority of malware evolves through polymorphism and deformation of known malware. Therefore, discovering the homology relationship in the sample plays a very important role in the traceability of the attack organization, the restoration of the operating environment, and the attack prevention.

由于恶意软件技术的不断提升，而且应用软件使用人群越来越广泛，在执行某些操作过程中，恶意软件的扩散范围也在不断的增加，但是其中的绝大多数都是由已知恶意软件演化而来。虽然目前相关人员已经做出了大量的研究工作，但是恶意软件依旧泛滥肆行。动态分析方法的准确率较高但效率较差，在分析过程中会产生过多的分类成本。与动态分析方法相比，静态分析方法分类准确度较高，效率也优于动态分析方法。所以，研究基于静态分析的恶意软件分类方法是相当重要的课题。研究适用面广、实用性强的恶意软件分类技术来提高计算机系统的安全性具有非常重要的科学理论价值与实际应用意义。Due to the continuous improvement of malware technology and the wider use of application software, the spread of malware is also increasing during certain operations, but the vast majority of them are caused by known malware. evolved. Although relevant personnel have done a lot of research work, malware is still rampant. The dynamic analysis method has higher accuracy but lower efficiency, and will generate excessive classification cost in the analysis process. Compared with the dynamic analysis method, the static analysis method has higher classification accuracy and higher efficiency than the dynamic analysis method. Therefore, it is very important to study the malware classification method based on static analysis. It has very important scientific theoretical value and practical application significance to study the widely applicable and practical malware classification technology to improve the security of computer system.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供通过分析恶意软件的轮廓特征将恶意软件进行分类，并帮助专业人员降低识别恶意软件的成本的一种基于静态分析的恶意软件分类方法。The purpose of the present invention is to provide a malware classification method based on static analysis, which can classify the malware by analyzing the contour features of the malware, and help professionals reduce the cost of identifying the malware.

本发明的目的通过如下技术方案来实现：包括以下步骤：The object of the present invention is achieved through the following technical solutions: comprise the following steps:

步骤1：输入待分类的软件数据集，将待分类的软件数据集分为训练集和测试集；Step 1: Input the software data set to be classified, and divide the software data set to be classified into a training set and a test set;

步骤2：将训练集中软件样本转化成二进制文件，转化方法具体为：将待分析的Windows可执行文件.exe转化为.bytes格式的二进制流文件；Step 2: Convert the software samples in the training set into binary files, and the conversion method is specifically: converting the Windows executable file .exe to be analyzed into a binary stream file in .bytes format;

步骤3：以每8位字节分割二进制文件，并将每8位字节转换为灰度值，转换方案将字节值从0映射到255，其中0代表黑色，255代表白色；然后以顺序排列方式将灰度值转化成二维灰度矩阵，根据文件大小确定二维灰度矩阵的宽度和高度，从而对其可视化成灰度图像；Step 3: Split the binary file by each 8-bit byte and convert each 8-bit byte to grayscale value, the conversion scheme maps the byte value from 0 to 255, where 0 represents black and 255 represents white; then in order The arrangement method converts the gray value into a two-dimensional grayscale matrix, and determines the width and height of the two-dimensional grayscale matrix according to the file size, thereby visualizing it as a grayscale image;

步骤4：使用生成的灰度图像训练卷积神经网络模型生成静态分类器；所述的卷积神经网络模型包括输入层、卷积层、最大池化层、空间金字塔池化层和输出层；使用小窗口卷积滤波器处理灰度图像；卷积层都使用3×3的卷积核，步长设置为1；在卷积层中对输入特征图进行1像素的边缘填充；最大池化时使用2×2的滑动窗口，步长设置为2；最后一个池化层采用3层空间金字塔池化，将任意维的特征输入然后统一输出；卷积神经网络在每个池化层之后，使用概率为0.5的dropout防止过拟合现象；使用Leaky ReLU激活函数、均匀分布权重初始化和批量归一化；Step 4: use the generated grayscale image to train a convolutional neural network model to generate a static classifier; the convolutional neural network model includes an input layer, a convolutional layer, a maximum pooling layer, a spatial pyramid pooling layer and an output layer; Use small-window convolution filters to process grayscale images; convolution layers all use 3 × 3 convolution kernels with stride set to 1; 1-pixel edge padding is performed on the input feature map in the convolution layer; max pooling When using a 2×2 sliding window, the step size is set to 2; the last pooling layer adopts 3-layer spatial pyramid pooling, inputting the features of any dimension and then outputting them uniformly; after each pooling layer, the convolutional neural network, Use dropout with probability 0.5 to prevent overfitting phenomenon; use Leaky ReLU activation function, uniform weight initialization and batch normalization;

步骤5：将待分类的软件数据集的测试集输入到静态分类器中，根据分类器的分类结果判断恶意软件所属家族，完成对恶意软件的分类。Step 5: Input the test set of the software data set to be classified into the static classifier, and determine the family to which the malware belongs according to the classification result of the classifier, and complete the classification of the malware.

本发明的有益效果在于：The beneficial effects of the present invention are:

本发明将灰度图像作为特征，并使用具有空间金字塔池化层的卷积神经网络进行分类，有效减少了图像预处理阶段造成的信息损失。本发明在恶意软件检测领域具有更低的时间成本，具有检测速度快、检测效率高的优点。The present invention takes the grayscale image as a feature and uses a convolutional neural network with a spatial pyramid pooling layer for classification, which effectively reduces the information loss caused by the image preprocessing stage. The invention has lower time cost in the field of malware detection, and has the advantages of fast detection speed and high detection efficiency.

附图说明Description of drawings

图1为本发明的流程图。FIG. 1 is a flow chart of the present invention.

图2为本发明中生成恶意软件灰度图像流程图。FIG. 2 is a flowchart of generating a grayscale image of malware in the present invention.

图3为本发明中卷积神经网络结构图。FIG. 3 is a structural diagram of a convolutional neural network in the present invention.

具体实施方式Detailed ways

下面结合附图对本发明做进一步描述。The present invention will be further described below with reference to the accompanying drawings.

本发明提供一种基于静态分析的恶意软件分类系统，属于计算机安全领域。本发明涉及将恶意软件转化为二进制文件并生成灰度图像，采用具有空间金字塔池化层的卷积神经网络模型训练灰度图像从而得到静态分类器，通过静态分类器将恶意软件样本进行所属家族分类。本发明能够将灰度图像作为特征用来分类恶意软件，有效减少了图像预处理阶段造成的信息损失。本发明的目的在于通过分析恶意软件的轮廓特征将恶意软件进行分类并帮助专业人员降低识别恶意软件的成本。The invention provides a malware classification system based on static analysis, which belongs to the field of computer security. The invention relates to converting malicious software into binary files and generating grayscale images, using a convolutional neural network model with a spatial pyramid pooling layer to train the grayscale images to obtain a static classifier, and classifying malware samples into families through the static classifiers Classification. The invention can use the grayscale image as a feature to classify malicious software, and effectively reduces the information loss caused by the image preprocessing stage. The purpose of the present invention is to classify the malware by analyzing the contour features of the malware and help professionals reduce the cost of identifying the malware.

(1)分类系统通过静态分析的形式，把恶意软件样本转化成二进制文件进行处理。(1) The classification system converts malware samples into binary files for processing in the form of static analysis.

(2)分类系统以每8个字节为一块分割二进制文件，然后将一串灰度值流转化为0～255的灰度值存储在一维灰度数组中，再将一维向量转换成二维灰度矩阵，继而生成灰度图像。(2) The classification system divides the binary file with every 8 bytes as a block, and then converts a string of gray value streams into gray values ranging from 0 to 255 and stores it in a one-dimensional gray scale array, and then converts the one-dimensional vector into A two-dimensional grayscale matrix, which in turn generates a grayscale image.

(3)分类系统使用生成的灰度图像训练卷积神经网络模型生成静态分类器。该卷积神经网络模型包括输入层、卷积层、最大池化层、空间金字塔池化层和输出层。(3) The classification system uses the generated grayscale images to train a convolutional neural network model to generate a static classifier. The convolutional neural network model includes an input layer, a convolutional layer, a max pooling layer, a spatial pyramid pooling layer and an output layer.

(4)分类系统将测试集中的恶意软件样本输入到静态分类器中，根据分类器的分类结果判断恶意软件所属家族。(4) The classification system inputs the malware samples in the test set into the static classifier, and judges the family of the malware according to the classification result of the classifier.

分类系统中把恶意软件转化成二进制文件进一步包括：The classification system for converting malware into binaries further includes:

对恶意软件样本数据进行预处理，将待分析的Windows可执行文件.exe转化为.bytes格式的二进制流文件。The malware sample data is preprocessed, and the Windows executable file .exe to be analyzed is converted into a binary stream file in .bytes format.

分类系统中灰度图像的生成进一步包括：The generation of grayscale images in the classification system further includes:

分类系统将恶意软件当作二进制文件，以每8位字节分割二进制文件，并将每8位字节转换为灰度值，转换方案将字节值从0(黑色)映射到255(白色)。然后以顺序排列方式将灰度值转化成二维灰度矩阵，根据恶意代码文件大小确定二维灰度矩阵的宽度和高度，从而对其可视化成灰度图像。The classification system treats the malware as a binary file, splits the binary file at every 8-bit byte, and converts each 8-bit byte into a grayscale value, the conversion scheme maps the byte value from 0 (black) to 255 (white) . Then the gray values are converted into a two-dimensional grayscale matrix in a sequential manner, and the width and height of the two-dimensional grayscale matrix are determined according to the size of the malicious code file, thereby visualizing it as a grayscale image.

分类系统中涉及的卷积神经网络进一步包括：The convolutional neural networks involved in the classification system further include:

分类系统通过卷积神经网络，使用小窗口卷积滤波器处理灰度图像。卷积层都使用3×3的卷积核，步长设置为1。然后在卷积层中对输入特征图进行1像素的边缘填充。最大池化时使用2×2的滑动窗口，步长设置为2。最后一个池化层采用3层空间金字塔池化，将任意维的特征输入然后统一输出。The classification system processes grayscale images through a convolutional neural network using small-window convolutional filters. The convolutional layers all use 3×3 convolution kernels with stride set to 1. The input feature map is then subjected to 1-pixel edge padding in the convolutional layer. A 2×2 sliding window is used for max pooling, and the stride is set to 2. The last pooling layer adopts 3-layer spatial pyramid pooling, which inputs the features of any dimension and outputs them uniformly.

卷积神经网络优化进一步包括：Convolutional neural network optimization further includes:

卷积神经网络在每个池化层之后，使用概率为0.5的dropout防止过拟合现象。然后使用Leaky ReLU激活函数、均匀分布权重初始化和批量归一化。The convolutional neural network uses dropout with probability 0.5 after each pooling layer to prevent overfitting. Then use Leaky ReLU activation function, uniformly distributed weight initialization and batch normalization.

分类系统中涉及的静态分类器分类进一步包括：The static classifier classification involved in the classification system further includes:

卷积神经网络模型对每个恶意软件样本的分类即为静态分类器的分类结果。The classification of each malware sample by the convolutional neural network model is the classification result of the static classifier.

本发明提供一种基于静态分析的恶意软件分类系统，本发明能够将灰度图像作为特征用来分类恶意软件。本发明的目的在于通过分析恶意软件的轮廓特征将恶意软件进行分类，并帮助专业人员降低识别恶意软件的成本。The invention provides a malware classification system based on static analysis, and the invention can use grayscale images as features to classify malware. The purpose of the present invention is to classify the malware by analyzing the contour features of the malware, and help professionals reduce the cost of identifying the malware.

与现有技术相比，本发明的优势在于：Compared with the prior art, the advantages of the present invention are:

1.本发明提出一种基于静态分析的恶意软件分类系统，将灰度图像作为特征，并使用具有空间金字塔池化层的卷积神经网络进行分类，有效减少了图像预处理阶段造成的信息损失。1. The present invention proposes a malware classification system based on static analysis, which uses grayscale images as features and uses a convolutional neural network with a spatial pyramid pooling layer for classification, which effectively reduces the information loss caused by the image preprocessing stage. .

2.本发明提出一种基于静态分析的恶意软件分类系统，在恶意软件检测领域具有更低的时间成本。2. The present invention proposes a malware classification system based on static analysis, which has lower time cost in the field of malware detection.

3.本发明提出一种基于静态分析的恶意软件分类系统，具有检测速度快、检测效率高的优点。3. The present invention proposes a malware classification system based on static analysis, which has the advantages of fast detection speed and high detection efficiency.

图1为本发明提供的一种基于静态分析的恶意软件分类系统流程图。本发明包括以下四个方面。FIG. 1 is a flowchart of a malware classification system based on static analysis provided by the present invention. The present invention includes the following four aspects.

(2)分类系统以每8个字节为一块分割二进制文件，然后将一串灰度值流转化为0～255的灰度值存储在一维灰度数组中，再对一维向量转换成二维灰度矩阵，继而生成灰度图像。(2) The classification system divides the binary file with every 8 bytes as a block, and then converts a series of gray value streams into gray values ranging from 0 to 255 and stores them in a one-dimensional gray scale array, and then converts the one-dimensional vector into A two-dimensional grayscale matrix, which in turn generates a grayscale image.

结合图2灰度图生成流程图，分类系统将恶意软件当作二进制文件，以每8位字节分割二进制文件，并将每8位字节转换为灰度值，转换方案将字节值从0(黑色)映射到255(白色)。然后以顺序排列方式将灰度值转化成二维灰度矩阵，根据恶意代码文件大小确定二维灰度矩阵的宽度和高度，从而对其可视化成灰度图像。Combined with the grayscale image generation flow chart in Figure 2, the classification system treats the malware as a binary file, divides the binary file by every 8-bit byte, and converts every 8-bit byte into a grayscale value. The conversion scheme converts the byte value from 0 (black) maps to 255 (white). Then the gray values are converted into a two-dimensional grayscale matrix in a sequential manner, and the width and height of the two-dimensional grayscale matrix are determined according to the size of the malicious code file, thereby visualizing it as a grayscale image.

结合图3，分类系统通过卷积神经网络，使用小窗口卷积滤波器处理灰度图像。卷积层都使用3×3的卷积核，步长设置为1。然后在卷积层中对输入特征图进行1像素的边缘填充。最大池化时使用2×2的滑动窗口，步长设置为2。最后一个池化层采用3层空间金字塔池化，将任意维的特征输入然后统一输出。In conjunction with Figure 3, the classification system processes grayscale images through a convolutional neural network using small-window convolutional filters. The convolutional layers all use 3×3 convolution kernels with stride set to 1. The input feature map is then subjected to 1-pixel edge padding in the convolutional layer. A 2×2 sliding window is used for max pooling, and the stride is set to 2. The last pooling layer adopts 3-layer spatial pyramid pooling, which inputs the features of any dimension and outputs them uniformly.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. A malware classification method based on static analysis is characterized by comprising the following steps:

step 1: inputting a software data set to be classified, and dividing the software data set to be classified into a training set and a testing set;

step 2: converting the software sample in the training set into a binary file, wherein the conversion method specifically comprises the following steps: the Exe of the Windows executable file to be analyzed is converted into a binary stream file in a bytes format;

and step 3: partitioning the binary file with every 8-bit byte and converting every 8-bit byte into a gray value, the conversion scheme mapping byte values from 0 to 255, where 0 represents black and 255 represents white; secondly, converting the gray values into a two-dimensional gray matrix in a sequential arrangement mode, and determining the width and the height of the two-dimensional gray matrix according to the size of a file so as to visualize the two-dimensional gray matrix into a gray image;

and 4, step 4: training a convolutional neural network model by using the generated gray level image to generate a static classifier; the convolutional neural network model comprises an input layer, a convolutional layer, a maximum pooling layer, a spatial pyramid pooling layer and an output layer; processing the gray scale image using a small window convolution filter; convolution layers all use a convolution kernel of 3 × 3, and the step size is set to 1; performing 1-pixel edge filling on the input feature map in the convolutional layer; using a 2 multiplied by 2 sliding window when the pooling is maximum, and setting the step length to be 2; the last pooling layer adopts 3 layers of space pyramid pooling, and features of any dimension are input and then uniformly output; the convolutional neural network uses dropout with probability of 0.5 after each pooling layer to prevent the overfitting phenomenon; initializing and batch normalizing by using a Leaky ReLU activation function and uniformly distributed weights;

and 5: and inputting the test set of the software data set to be classified into a static classifier, judging the family to which the malicious software belongs according to the classification result of the classifier, and finishing classification of the malicious software.