CN104376260A

CN104376260A - Malicious code visualized analyzing method based on Shannon information entropy

Info

Publication number: CN104376260A
Application number: CN201410668073.8A
Authority: CN
Inventors: 任卓君; 孔德凤; 刘同洋; 乔国娟; 冯琪; 陈�光
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2014-11-20
Filing date: 2014-11-20
Publication date: 2015-02-25
Anticipated expiration: 2034-11-20
Also published as: CN104376260B

Abstract

本发明提供了一种基于香农信息熵的恶意代码可视化分析方法，包括：第一步：将恶意文件的二进制字节转换为“像素图”中像素点的黄色系明暗值，用绿色通道Ox50来标记像素值为Ox20-Ox7E的点；第二步：基于“像素图”的像素值来计算“像素图”中每个256字节块中像素值的局部熵，所述的局部熵按照如下的香农信息熵公式计算：其中，p_i代表字节(像素)值i出现的概率，i的取值范围为Ox00-OxFF，Entropy为局部熵；计算局部熵值Entropy的f(Entropy)值，其计算公式为：f(Entropy)＝2^Entropy-1；以f(Entropy)的计算结果生成“熵图”；第三步：对f(Entropy)的计算结果进行归一化处理，生成“熵归一化图”。本发明能有效区分各族样本，在进行同族恶意代码分析时，能比较容易发现潜在的区别，为掌握该族变种演化规律提供了依据。The present invention provides a malicious code visual analysis method based on Shannon information entropy, comprising: the first step: convert the binary bytes of the malicious file into the yellow value of the pixel point in the "pixel map", and use the green channel Ox50 to The point where the marked pixel value is 0x20-0x7E; second step: calculate the local entropy of the pixel value in each 256-byte block in the "pixmap" based on the pixel value of the "pixmap", and the local entropy is as follows Shannon information entropy formula calculation: Wherein, pi _represents the probability that byte (pixel) value i occurs, and the value range of i is 0x00-0xFF, and Entropy is local entropy; Calculate the f (Entropy) value of local entropy value Entropy, its calculation formula is: f( Entropy)=2 ^Entropy -1; an "entropy map" is generated from the calculation result of f(Entropy); the third step: the calculation result of f(Entropy) is normalized to generate an "entropy normalization map". The invention can effectively distinguish samples of various families, and can easily find potential differences when analyzing malicious codes of the same family, and provides a basis for grasping the variation evolution law of the family.

Description

A Visual Analysis Method of Malicious Code Based on Shannon Information Entropy

技术领域technical field

本发明涉及一种基于香农信息熵的恶意代码可视化分析方法。The invention relates to a malicious code visualization analysis method based on Shannon information entropy.

背景技术Background technique

Malware(Malicious Software)是一种用于破坏计算机操作系统、窃取敏感信息或非法访问隐私系统的软件，通常以代码、脚本、动态文本或其他软件形式出现。由于传统的恶意程序分析过程往往复杂耗时，即使是经验丰富的安全分析人员也很难发现潜在的攻击模式。为减轻认知负担、提高交互性，将信息可视化技术引入恶意代码分析领域，即Malware安全可视化，正是近年来网络安全研究中的前沿热点。Malware (Malicious Software) is a software used to destroy computer operating systems, steal sensitive information, or illegally access private systems, usually in the form of code, scripts, dynamic text, or other software. Due to the complex and time-consuming traditional malware analysis process, it is difficult for even experienced security analysts to discover potential attack modes. In order to reduce the cognitive burden and improve interactivity, introducing information visualization technology into the field of malicious code analysis, that is, Malware security visualization, is the frontier hotspot in network security research in recent years.

2008年，美国西点军校(United States Military Academy West Point)的Gregory Conti等人在其设计的可视化分析系统(如图1)中首次提出了灰度图(Gray-scale Images)的思想，以独立于文本的分析视角来快速识别文件和剖析未知文件格式。如图1所示，该系统用户界面的d区和g区分别对应被分析文件的ASCII格式字符串和十六进制格式命令行；c区(Byteview)的像素(pixel)值与g区字节(Byte)的二进制数对应，以灰度图的形式呈现文件的内在特征；b区(Byte Presence)根据c区每行中扩展ASCII码值(0-255)的存在与否来标识自身区域中对应行所在的列，通过这样的映射操作帮助用户掌握文件规律、发现其中异常；f区(Dot Plot)利用文件本身的字节序列矩阵比较文件间的相似度，在用户分类时提供判断依据；其它a、e、h区集成了多种与用户互动的辅助功能。In 2008, Gregory Conti et al. of the United States Military Academy West Point proposed the idea of gray-scale images for the first time in their designed visual analysis system (as shown in Figure 1), to be independent of Text analysis perspective to quickly identify files and dissect unknown file formats. As shown in Figure 1, the d area and g area of the user interface of the system correspond to the ASCII format string and the hexadecimal format command line of the analyzed file respectively; the pixel (pixel) value of c area (Byteview) and the g area word The binary number of the section (Byte) corresponds to the internal characteristics of the file in the form of a grayscale image; the b area (Byte Presence) identifies its own area according to the presence or absence of the extended ASCII code value (0-255) in each line of the c area The column where the corresponding row is located in, through such a mapping operation, helps users grasp the rules of the file and find abnormalities in it; the f area (Dot Plot) uses the byte sequence matrix of the file itself to compare the similarity between files, and provides a basis for judgment when users classify ; Other areas a, e, and h integrate a variety of auxiliary functions that interact with users.

然而，在分析文件相似性进行分类研究方面，Gregory Conti，Erik Dean，Matthew Sinda and Benjamin Sangster.Visual Reverse Engineering of Binaryand Data Files[C].VizSec 2008Symposium on Visualization for CyberSecurity(VizSEC2008)的方法使得计算量与文件大小成正比，分析的自动化程度受计算机硬件性能的制约；同时在呈现文件内在特征方面，将字节所对应的c区像素值与反映ASCII码值存在情况的b区割裂开来展示，不利于对被分析文件特征的全面理解。However, Gregory Conti, Erik Dean, Matthew Sinda and Benjamin Sangster.Visual Reverse Engineering of Binary and Data Files[C].VizSec 2008Symposium on Visualization for CyberSecurity(VizSEC2008) method makes the calculation amount and The size of the file is proportional to the size of the file, and the degree of automation of the analysis is restricted by the performance of the computer hardware. At the same time, in terms of presenting the internal characteristics of the file, the pixel value of the c area corresponding to the byte is separated from the b area reflecting the existence of the ASCII code value. Facilitate a comprehensive understanding of the characteristics of the analyzed files.

发明内容Contents of the invention

本发明的目的是提供一种能较全面的研究分析恶意代码的方法。The purpose of the present invention is to provide a method for more comprehensive research and analysis of malicious codes.

为了达到上述目的，本发明提供了一种基于香农信息熵的恶意代码可视化分析方法，其特征在于，包括：In order to achieve the above object, the present invention provides a method for visual analysis of malicious code based on Shannon information entropy, which is characterized in that, comprising:

第一步：将恶意文件的二进制字节转换为“像素图”中像素点的黄色系明暗值，用绿色通道0x50(显示效果可能会因硬件设备不同存在细微差别)来标记像素值为0x20-0x7E的点(即ASCII码中的可打印字符)；Step 1: Convert the binary bytes of the malicious file into the yellow shade value of the pixel in the "pixel map", and use the green channel 0x50 (the display effect may have slight differences due to different hardware devices) to mark the pixel value 0x20- The point of 0x7E (that is, the printable character in ASCII code);

第二步：基于“像素图”的像素值来计算“像素图”中每个256字节块中像素值的局部熵，所述的局部熵按照如下的香农信息熵公式计算：Step 2: Calculate the local entropy of the pixel values in each 256-byte block in the "pixel map" based on the pixel values of the "pixel map", and the local entropy is calculated according to the following Shannon information entropy formula:

$Entropy Entropy = = - - {Σ Σ}_{i i = = 00}^{255255} {p p}_{i i} \times \times lo lo {g g}_{22} {p p}_{i i}$

其中，p_i代表字节(像素)值i出现的概率，i的取值范围为0x00-0xFF，Entropy为局部熵；Among them, p _i represents the probability of occurrence of byte (pixel) value i, the value range of i is 0x00-0xFF, and Entropy is local entropy;

计算局部熵Entropy的f(Entropy)值，其计算公式为：Calculate the f(Entropy) value of the local entropy Entropy, and its calculation formula is:

f(Entropy)＝2^Entropy-1；f(Entropy)= ^2Entropy -1;

以f(Entropy)的计算结果生成“熵图”；Generate an "entropy map" from the calculation result of f(Entropy);

第三步：对f(Entropy)的计算结果进行归一化处理，生成“熵归一化图”。Step 3: Normalize the calculation result of f(Entropy) to generate an "entropy normalized graph".

优选地，所述的基于香农信息熵的恶意代码可视化分析方法还包括：Preferably, the described malicious code visual analysis method based on Shannon information entropy also includes:

第四步：对第一步中的“像素图”进行归一化处理，生成“像素归一化图”。Step 4: Normalize the "pixel map" in the first step to generate a "pixel normalized map".

本发明借鉴灰度图的思想，结合香农信息熵的定义，利用K-NearestNeighbor(KNN)分类算法，给出了一种新的研究恶意代码谱系分类的可视化方法。具体技术方案为：先将待检测的二进制文件转换为黄、绿两色通道的“像素图”；在此基础上，通过计算“像素图”的局部熵值来生成蓝绿色的“熵图”；“熵图”再经过归一化处理，形成明暗不同的绿色点阵分布，即“熵归一化图”。该方法利用局部熵计算及滑窗归一化处理机制，不仅能大幅度减少大型文件在相似性分析时的运算量，而且能提高恶意代码族分类的可视化效果。The present invention draws on the idea of the grayscale image, combines Shannon's definition of information entropy, and uses the K-Nearest Neighbor (KNN) classification algorithm to provide a new visualization method for researching malicious code pedigree classification. The specific technical solution is: first convert the binary file to be detected into a "pixel map" of yellow and green channels; on this basis, generate a blue-green "entropy map" by calculating the local entropy value of the "pixel map" ; The "entropy map" is then normalized to form a green dot matrix distribution with different light and shade, that is, the "entropy normalized map". This method uses local entropy calculation and sliding window normalization processing mechanism, which can not only greatly reduce the amount of computation in similarity analysis of large files, but also improve the visualization effect of malicious code family classification.

本发明采用Python语言编程实现，在Windows和Linux环境下均可运行。The present invention adopts Python language programming to realize, can run under Windows and Linux environment.

与现有的安全可视化技术相比，本发明的有益效果是：Compared with the existing security visualization technology, the beneficial effects of the present invention are:

1、从视觉效果上，能有效地区分各类恶意代码族；1. From the perspective of visual effects, it can effectively distinguish various malicious code families;

2、在进行同族恶意代码分析时，能比较地容易发现潜在的区别，为掌握该族变种演化规律提供了依据；2. When analyzing malicious codes of the same family, it is relatively easy to find potential differences, which provides a basis for grasping the evolution law of this family of variants;

3、将“熵图”、“熵归一化图”、“像素归一化图”三种可视化方法结合起来，能较全面的研究分析恶意代码的整体及局部特征。3. Combining the three visualization methods of "entropy map", "entropy normalized map" and "pixel normalized map", it can comprehensively study and analyze the overall and local characteristics of malicious code.

4、本发明能通过建立人与数据间的图像通信，在单位时间内提供更全面的信息感知，不仅提高了网络安全分析人员的工作效率，还能降低分析难度和对分析员技术水平和经验的要求。4. The present invention can provide more comprehensive information perception per unit of time by establishing image communication between people and data, which not only improves the work efficiency of network security analysts, but also reduces the difficulty of analysis and the technical level and experience of analysts. requirements.

5、本发明实现简单并可用于自动化操作，由于采用了降维映射和快速相似度比较算法，使得图片生成时间开销小、相似度比较效率高5. The present invention is simple to implement and can be used for automatic operation. Due to the use of dimension reduction mapping and fast similarity comparison algorithm, the time cost of image generation is small and the similarity comparison efficiency is high.

附图说明Description of drawings

图1Gregory Conti设计的可视化系统示意图；Figure 1 Schematic diagram of the visualization system designed by Gregory Conti;

图2为二进制文件转换为“像素图”的显示结果示例图；Fig. 2 is an example diagram of display results converted from a binary file to a "pixel map";

图3为“像素图”经局部熵计算转换为“熵图”的示例图；Fig. 3 is an example diagram of "pixel map" converted into "entropy map" by local entropy calculation;

图4a为“熵图”经归一化处理转换为“熵归一化图”的示例图；Figure 4a is an example diagram of "entropy map" transformed into "entropy normalized map" after normalization;

图4b为局部熵值与像素值的映射关系图；Fig. 4b is a mapping relationship diagram between local entropy value and pixel value;

图5为“像素图”经归一化处理转换为“像素归一化图”的示例图；Fig. 5 is an example diagram of converting a "pixel map" into a "pixel normalized map" after normalization processing;

图6为Email-Worm.joleee.av样本的“熵图”；Figure 6 is the "entropy map" of the Email-Worm.joleee.av sample;

图7为Email-Worm.joleee.aw样本的“熵图”；Figure 7 is the "entropy map" of the Email-Worm.joleee.aw sample;

图8为Email-Worm.joleee.ba样本的“熵图”。Figure 8 is the "entropy map" of the Email-Worm.joleee.ba sample.

具体实施方式Detailed ways

为使本发明更明显易懂，兹以一优选实施例，并配合附图作详细说明如下。(用于本发明测试的来自59个族的473个有害样本均从VX Heavens官方网站下载，所有样本均采用卡巴斯基命名法)In order to make the present invention more comprehensible, a preferred embodiment is described in detail below with accompanying drawings. (473 harmful samples from 59 families that are used for testing of the present invention are all downloaded from VX Heavens official website, and all samples adopt Kaspersky nomenclature)

实施例Example

一种基于香农信息熵的恶意代码可视化分析方法，具体为：A method for visual analysis of malicious code based on Shannon information entropy, specifically:

步骤1：将恶意文件(Trojan.Regrun.rk)的二进制字节转换为“像素图”中像素点的黄色系明暗值，用绿色通道0x50(显示效果可能会因硬件设备不同存在细微差别)来标记像素值为0x20-0x7E的点(即ASCII码中的可打印字符)；如图2所示，其中黑色的部分是背景颜色，即二进制字节为0值。Step 1: Convert the binary bytes of the malicious file (Trojan.Regrun.rk) into the yellow shade value of the pixel in the "pixel map", and use the green channel 0x50 (the display effect may have slight differences due to different hardware devices) Mark points with pixel values of 0x20-0x7E (that is, printable characters in ASCII code); as shown in Figure 2, the black part is the background color, that is, the binary byte is 0 value.

步骤2：基于“像素图”的像素值来计算“像素图”中每个256字节块中像素值的局部熵，所述的局部熵按照如下的香农信息熵公式计算：Step 2: Calculate the local entropy of the pixel values in each 256-byte block in the "pixel map" based on the pixel values of the "pixel map", and the local entropy is calculated according to the following Shannon information entropy formula:

通常，局部熵值Entropy的计算结果在[0，8]之间，如果直接以实际数值输出，则由于亮度太低导致可视效果很差。因此为形成与像素亮度值[0，255]相映射的图像，将局部熵值Entropy按函数f(Entropy)＝2^Entropy-1计算结果后输出，这样处理的目的是使高熵值的亮度显示得更加明显。局部熵值Entropy以函数f(Entropy)计算结果后生成“熵图”，如图3所示，本发明采用蓝绿色方案展示被分析文件的“熵图”。同样，黑色的部分是背景颜色，即熵值为0。Usually, the calculation result of the local entropy value Entropy is between [0, 8]. If it is directly output as the actual value, the visual effect will be poor because the brightness is too low. Therefore, in order to form an image mapped with the pixel brightness value [0, 255], the local entropy value Entropy is calculated according to the function f(Entropy)=2 ^Entropy -1 and then output. The purpose of this processing is to display the brightness of the high entropy value more obvious. The local entropy value Entropy generates an "entropy map" after calculating the result with the function f(Entropy). As shown in Figure 3, the present invention uses a blue-green scheme to display the "entropy map" of the analyzed file. Similarly, the black part is the background color, that is, the entropy value is 0.

步骤3：由于在进行相似性分析时应有统一的评判标准，而被分析的恶意代码文件大小又是各不相同，因此需要对上述经函数f(Entropy)计算的结果进行归一化处理，生成“熵归一化图”。本发明提出的归一化算法采用窗口大小为2个字节、移动步长为1个字节的滑窗机制，同一窗口下的前后两个字节，分别作为“熵归一化图”中点的位置坐标(x，y)。(x，y)组合出现的次数与该“熵归一化图”中点的亮度值成正比，“熵归一化图”显示为256*256大小的方图，用单纯的绿色渲染，与“熵图”的调色方案相区别，如图4a所示。其中，局部熵值与像素值的映射关系如图4b所示。Step 3: Since there should be a unified judgment standard when performing similarity analysis, and the size of the analyzed malicious code files is different, it is necessary to normalize the above-mentioned results calculated by the function f(Entropy), Generate an "entropy normalized plot". The normalization algorithm proposed by the present invention adopts a sliding window mechanism with a window size of 2 bytes and a moving step size of 1 byte. The two bytes before and after the same window are used as the "entropy normalization map" respectively. The position coordinates (x, y) of the point. The number of occurrences of (x, y) combinations is proportional to the brightness value of the point in the "entropy normalization map", and the "entropy normalization map" is displayed as a square map with a size of 256*256, rendered in pure green, and The color scheme of the "entropy map" is different, as shown in Figure 4a. Wherein, the mapping relationship between the local entropy value and the pixel value is shown in Fig. 4b.

步骤4：基于同样原因，本发明还提供了“像素图”的归一化显示操作，其归一化算法与“熵图”归一化处理的方式相同，如图5所示。“像素归一化图”用单纯的黄色渲染，与“像素图”的调色方案相区别。Step 4: Based on the same reason, the present invention also provides a normalized display operation of the "pixel map", the normalization algorithm of which is the same as the normalization process of the "entropy map", as shown in Figure 5. The "Pixel Normalized Map" is rendered in pure yellow, which is different from the color scheme of the "Pixel Map".

在未知分类的情况下，采用KNN分类算法，本发明能正确区分所下载样本中的59个类，且结合样本的先验知识统计得出437个样本中仅有28个样本分类归属错误，即平均分类正确率为93.59％。In the case of unknown classification, using the KNN classification algorithm, the present invention can correctly distinguish 59 classes in the downloaded samples, and combined with the prior knowledge of the samples, only 28 of the 437 samples are classified incorrectly, namely The average classification accuracy is 93.59%.

本发明中，“熵归一化图”的生成时间与样本文件的尺寸大小成正比，且“熵归一化图”的平均生成时间为0.91ms；“熵归一化图”的相似度比较时间与局部熵块的个数成正比，且“熵归一化图”的相似度平均比较时间为0.56ms。所得的时间数据为100次采样后计算的平均值。这些结果说明用本发明实现的分类比较时间效率高，且采用的Python编程结构设计合理。In the present invention, the generation time of the "entropy normalization map" is proportional to the size of the sample file, and the average generation time of the "entropy normalization map" is 0.91ms; the similarity comparison of the "entropy normalization map" The time is proportional to the number of local entropy blocks, and the average similarity comparison time of the "entropy normalized map" is 0.56ms. The resulting time data is an average calculated after 100 samples. These results show that the time efficiency of classification and comparison realized by the present invention is high, and the Python programming structure design adopted is reasonable.

实施例2Example 2

采用实施例1所述的基于香农信息熵的恶意代码可视化分析方法分析恶意样本Email-Worm.joleee.av、Email-Worm.joleee.aw和Email-Worm.joleee.ba生成的“熵图”如图6-8所示，本发明在进行同族恶意代码分析时，能比较容易发现潜在的区别，为掌握该族变种演化规律提供了依据。The malicious code visual analysis method based on Shannon information entropy described in Embodiment 1 is used to analyze the "entropy map" generated by malicious samples Email-Worm.joleee.av, Email-Worm.joleee.aw and Email-Worm.joleee.ba as follows As shown in Figures 6-8, when the present invention analyzes malicious codes of the same family, it is relatively easy to find potential differences, which provides a basis for grasping the evolution law of the variants of the same family.

Claims

1. A method for visual analysis of malicious code based on Shannon information entropy, characterized in that, comprising:

Step 1: Convert the binary bytes of the malicious file into the yellow shades of the pixels in the "pixel map", and use the green channel 0x50 to mark the points with pixel values 0x20-0x7E;

Step 2: Calculate the local entropy of the pixel values in each 256-byte block in the "pixel map" based on the pixel values of the "pixel map", and the local entropy is calculated according to the following Shannon information entropy formula:

Entropy Entropy = = - - {Σ Σ}_{i i = = 00}^{255255} {p p}_{i i} \times \times {log log}_{22} {p p}_{i i}

Among them, p _i represents the probability of occurrence of byte value i, the value range of i is 0x00-0xFF, and Entropy is local entropy;

Calculate the f(Entropy) value of the local entropy value Entropy, and its calculation formula is:

f(Entropy)= ^2Entropy -1;

Generate an "entropy map" from the calculation result of f(Entropy);

Step 3: Normalize the calculation result of f(Entropy) to generate an "entropy normalized graph".

2. the malicious code visual analysis method based on Shannon information entropy as claimed in claim 1, is characterized in that, the described malicious code visual analysis method based on Shannon information entropy also comprises:

Step 4: Normalize the "pixel map" in the first step to generate a "pixel normalized map".