CN116644422A

CN116644422A - Malicious code detection method based on malicious block labeling and image processing

Info

Publication number: CN116644422A
Application number: CN202310606050.3A
Authority: CN
Inventors: 张乾坤; 张伯瑜
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2023-05-23
Filing date: 2023-05-23
Publication date: 2023-08-25

Abstract

The invention discloses a malicious code detection method based on malicious block labeling and image processing, which belongs to the field of malicious code detection and comprises the following steps: the method comprises the steps of (S1) dividing a binary file of malicious codes to be detected into a plurality of basic blocks, detecting whether each basic block is a malicious block, and marking the position of the malicious block in the binary file; malicious blocks are basic blocks related to malicious functions; (S2) converting the binary file into a gray level image, and improving the local contrast of a part of images corresponding to malicious blocks in the gray level image to obtain a target gray level image; and (S3) inputting the target gray level diagram into a trained malicious code classification model so as to predict the probability that the malicious code belongs to each family class, and determining the family class with the highest probability as the family class to which the malicious code belongs. The method and the device can enhance the influence degree of the content related to the malicious function in the malicious code on the classification result, thereby improving the accuracy of the classification of the malicious code.

Description

Malicious code detection method based on malicious block labeling and image processing

Technical Field

The invention belongs to the field of malicious code detection, and particularly relates to a malicious code detection method based on malicious block labeling and image processing.

Background

The network security industry is constantly striving to prevent and treat the attack behavior of malicious codes, and an attacker can utilize the malicious codes to infect victim equipment to achieve the purpose of destroying confidentiality and integrity of data resources of users and enterprises, so that the malicious codes are accurately detected and corresponding measures are taken, and the network security method has very important significance for guaranteeing network security.

Traditionally, malware detection or classification is performed by signature-based or heuristic methods. Signature-based methods deploy signatures for different malware families and variants, serve as prototypes, allow corresponding classification of newly discovered malware files, determine corresponding family categories, and can take corresponding countermeasures according to the characteristics of the family malicious codes. Over the past few years, nataraj et al introduced a static malware analysis technique called malware visualization, wherein malware visualization was a technique that represented the contents of a malware binary file in some form as an image, specifically, the original bytes of the malware binary file were read as 8-bit unsigned integers and stored into vectors that were reshaped into a matrix and then could be visualized as a grayscale image.

The malicious software analysis method for the malicious software visualization effectively solves the problem of malicious software classification. However, because the malicious functions of the whole malicious software are nested in other non-malicious functions, that is, a considerable part of content of the malicious software is irrelevant to the malicious functions, when the gray level image obtained by converting the whole malicious software binary file is directly classified, the content irrelevant to the malicious functions can influence the result of the whole classification, and the final classification accuracy cannot be ensured.

Disclosure of Invention

Aiming at the defects and improvement demands of the prior art, the invention provides a malicious code detection method based on malicious block labeling and image processing, which aims to enhance the influence degree of content related to malicious functions in malicious codes on classification results, so as to improve the accuracy of classification of the malicious codes and facilitate correct cognition and analysis of unknown malicious codes.

To achieve the above object, according to one aspect of the present invention, there is provided a malicious code detection method based on malicious block annotation and image processing, comprising the steps of:

the method comprises the steps of (S1) dividing a binary file of malicious codes to be detected into a plurality of basic blocks, detecting whether each basic block is a malicious block, and marking the position of the malicious block in the binary file; malicious blocks are basic blocks related to malicious functions;

(S2) converting the binary file into a gray level image, and improving the local contrast of a part of images corresponding to malicious blocks in the gray level image to obtain a target gray level image;

(S3) inputting the target gray level diagram into a trained malicious code classification model to predict the probability that the malicious code belongs to each family class, and determining the family class with the highest probability as the family class to which the malicious code belongs;

the malicious code classification model is a neural network model and is used for predicting the probability that the malicious code corresponding to the input gray level image belongs to each family class.

Further, in step (S1), for any basic block, it is detected whether it is a malicious block, in a manner including:

extracting code features of the basic blocks and converting the code features into feature vectors; code features include structural features, arithmetic instruction features, branch instruction features, and API call features;

inputting the feature vector into a trained malicious block detection model, and carrying out feature extraction and reconstruction on the feature vector by using the malicious block detection module to obtain a reconstructed feature;

if the difference between the reconstructed feature and the feature vector output by the malicious block detection module is greater than a preset threshold value, judging the basic block as a malicious block; otherwise, judging that the basic block is not a malicious block;

the malicious block detection model is a neural network model and is used for extracting and reconstructing characteristics of an input basic block, and the training mode comprises the following steps:

collecting a binary file irrelevant to malicious functions, dividing the binary file into basic blocks, and extracting code features of the basic blocks as benign samples to obtain a benign sample set;

initializing a malicious block detection model, training the malicious block detection model by using a benign sample set with the aim of minimizing reconstruction loss, and obtaining a trained malicious block detection model after training is finished.

Further, the structural features include: the number of children and intermediate values of the basic block; the arithmetic instruction features include the number of basic math, displacement instructions, and logical operations contained by the basic block; the transfer instruction features include the number of stack operations, register operations, and port operations within the basic block; the API call feature includes the number of calls to the APIs associated with dll, process, service, system information within the basic block.

Further, the malicious block detection model is a self-encoder model.

Further, in step (S2), the local contrast of the partial image corresponding to the malicious block in the gray scale image is improved by limiting the contrast adaptive histogram equalization algorithm.

Further, the malicious code classification model is a Vision Transformer model.

Further, in step (S2), converting the binary file into a gray scale map includes:

converting each 8 bits as a unit into an unsigned integer according to a code sequence, and storing the values into an unsigned integer vector;

converting the unsigned integer vector into a matrix, taking each element in the matrix as a pixel, and taking the numerical value of the element as the gray value of the corresponding pixel to obtain a gray image.

According to still another aspect of the present invention, there is provided a computer-readable storage medium comprising: a stored computer program; when the computer program is executed by the processor, the device where the computer readable storage medium is located is controlled to execute the malicious code detection method based on malicious block labeling and image processing.

In general, through the above technical solutions conceived by the present invention, the following beneficial effects can be obtained:

(1) According to the method, the binary file of the malicious code to be detected is divided into the basic blocks, the malicious blocks in the basic blocks, namely the basic blocks related to the malicious function, are detected, the positions of the basic blocks are marked in the whole binary file, the positioning of the malicious blocks is realized, when the subsequent classification is carried out based on the visualization technology, the image processing is carried out on partial images corresponding to the malicious blocks in the gray level map obtained by converting the binary file based on the position marking results of the malicious blocks, so that the local contrast of the partial images is improved, the weight of related content of the malicious blocks on the classification results can be effectively improved based on the processing, the influence of irrelevant content of the malicious function on the classification results is weakened, and the detection accuracy of the malicious code is effectively improved.

(2) In the preferred scheme of the invention, a neural network model is used as a malicious block detection model, the model is used for extracting and reconstructing characteristics of an input basic block, the function of anomaly detection can be realized, the model is trained by benign samples irrelevant to malicious functions, the input and output differences of the model are smaller for benign basic blocks, and the input and output differences of the model are larger for malicious blocks, so that the judgment and positioning of the malicious blocks in binary codes can be accurately completed based on the method.

(3) In a preferred embodiment of the present invention, when detecting a malicious block, the code features of the extracted basic block include a structural feature, an arithmetic instruction feature, a branch instruction feature, and an API call feature, where the structural feature includes: the number of children and intermediate values of the basic block; the arithmetic instruction features include the number of basic math, displacement instructions, and logical operations contained by the basic block; the transfer instruction features include the number of stack operations, register operations, and port operations within the basic block; the API calling features comprise the calling quantity of the APIs related to dll, process, service and system information in the basic block, and can comprehensively and accurately reflect the functions realized by the basic block.

(4) In the preferred scheme of the invention, the local contrast of the partial image corresponding to the malicious block in the gray level image is improved by using the adaptive histogram equalization algorithm (Contrast Limited Adaptive Histogram Equalization, CLAHE) with limited contrast, so that the noise amplification problem can be reduced while the local contrast is improved.

(5) In the preferred scheme of the invention, a Vision Transformer model is specifically used for realizing a malicious code classification model, the model can divide an input image into a plurality of sub-blocks and form the sub-blocks into linear embedded sequences, then the linear embedded sequences are used as inputs of a transducer to simulate word group sequence input in the NLP field, and the method has a good classification effect on gray level images obtained by conversion of malicious code binary files based on the model.

Drawings

FIG. 1 is a flowchart of a malicious code detection method based on malicious block labeling and image processing provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of image generation, image processing, model training, and model verification according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a malicious block detection model according to an embodiment of the present invention;

FIG. 4 is a diagram of an implementation of limiting contrast adaptive histogram equalization on gray scale images according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a bilinear interpolation method according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

In the present invention, the terms "first," "second," and the like in the description and in the drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

In order to solve the technical problems that the classification result of the existing malicious code detection method is interfered by content irrelevant to malicious functions and the classification accuracy is low, the invention provides a malicious code detection method based on malicious block labeling and image processing, and the whole thought of the malicious code detection method is as follows: positioning and marking malicious blocks in a binary file of a malicious code to be detected, and performing image processing on partial images corresponding to the malicious blocks in a gray level diagram obtained by converting the binary file to improve local contrast, so that the importance of related content of a malicious function on a classification result is improved, the influence of unrelated content of the malicious function is weakened, and the classification accuracy is improved.

The following are examples.

Example 1:

a malicious code detection method based on malicious block annotation and image processing is shown in fig. 1 and 2, and comprises the following steps:

the method comprises the steps of (S1) dividing a binary file of malicious codes to be detected into a plurality of basic blocks, detecting whether each basic block is a malicious block, and marking the position of the malicious block in the binary file; malicious blocks are basic blocks related to malicious functions.

The basic block, i.e. the sequence of instructions that are executed sequentially, comprises only one input and one output. According to the embodiment, the binary file is divided into the basic blocks, and whether the basic blocks are related to malicious functions or not is judged, so that the positioning and labeling of the malicious blocks can be effectively realized.

Optionally, in step (S1) of the present embodiment, for any basic block, whether it is a malicious block is detected by a method including:

Because the malicious block detection model can extract and reconstruct the characteristics of the input basic blocks, the function of anomaly detection can be realized, and the benign samples irrelevant to the malicious function are specifically utilized to train the malicious block detection model, the input and output differences of the model are smaller for benign basic blocks, and the input and output differences of the model are larger for malicious blocks, so that the judgment and positioning of the malicious blocks in the binary code can be accurately completed based on the characteristics; optionally, in this embodiment, a U-Net model is selected to implement a malicious block detection model, where the U-Net model is a self-encoder (AutoEncoder) model, and the self-encoder is an artificial neural network used in semi-supervised learning and non-supervised learning, and has a function of performing feature learning on input information by taking the input information as a learning target;

the structure of the U-Net model is shown in FIG. 3, the model contains an encoder g and a decoder f, and when we input an x, we can get an output x' after going through the entire neural network, namely:

f(g(x))＝x′

the automatic encoder uses the reconstruction loss x '-x as the loss, and continuously learns to gradually reduce the difference between x and x', so that after learning by using a large number of benign samples, the difference between x and x 'is smaller for benign basic blocks, and the difference between x and x' is larger for malicious basic blocks, so that possible malicious basic blocks can be effectively detected according to the difference.

It is easy to understand that, in order to ensure the training effect of the model, this embodiment collects a large number of binary files irrelevant to malicious functions to make benign samples for training the malicious block detection model; after the malicious block detection model is trained, malicious samples are also manufactured by using malicious code basic blocks for realizing malicious functions, and the training effect of the model obtained by training is tested to ensure that the detection accuracy of the detection model meets the requirement, as shown in fig. 2; in addition, in other embodiments of the present invention, the malicious block detection model may be implemented based on other models that can perform feature extraction and reconstruction.

In order to accurately identify whether the function implemented by the basic block is a malicious function, in this embodiment, the extracted code features of the basic block specifically include: the number of children and intermediate values of the basic block; the arithmetic instruction features specifically include the number of basic mathematics, displacement instructions and logical operations contained in the basic block; the transfer instruction features include, in particular, the number of stack operations, register operations, and port operations within the basic block; the API call feature specifically includes the number of calls to the API associated with dll, process, service, system information within the basic block. The four types of features considered in the embodiment can fully and accurately reflect the functions realized by the basic blocks, and the embodiment takes the features as the input of a malicious block detection model to accurately identify whether the functions of the basic blocks are malicious functions or not, so that the detection of the malicious blocks is accurately completed. In practical application, the code features of the basic blocks can be extracted directly by using the BinaryNinja tool.

Through step (S1), the present embodiment can accurately complete positioning and labeling of malicious blocks in the binary file, and on this basis, the present embodiment further includes the steps of:

and (S2) converting the binary file into a gray level image, and improving the local contrast of a part of images corresponding to the malicious blocks in the gray level image to obtain a target gray level image.

In this embodiment, the specific way to convert the binary file into the gray scale map is:

It is easy to understand that in the gray map conversion process, the numerical range of 8-bit unsigned integer number corresponding to each pixel is 0-255, corresponding to the gray value of the pixel, 0-255, wherein 0 corresponds to black and 255 corresponds to white.

According to the labeling result in the step (S1), a partial image corresponding to a malicious block in the converted gray level image can be positioned, and the local contrast of the partial image can be improved by means of image processing, so that as a preferred implementation manner, in this embodiment, the local contrast of the partial image corresponding to the malicious block in the gray level image can be improved by limiting the contrast adaptive histogram equalization algorithm (Contrast Limited Adaptive Histogram Equalization, CLAHE), and the noise amplification problem can be reduced while the local contrast is improved; the process of improving local contrast based on the CLAHE is shown in fig. 4, and includes the following steps:

(S21) according to the position of the malicious block in the image, taking a local regular image, and determining to divide the local image into non-overlapping sub-blocks with equal size;

(S22) calculating a sub-block histogram from the image;

(S23) calculating clipLimit from the sub-block histogram provided in the sub-operation S22;

(S24) intercepting pixels exceeding the clipLimit value in the gray level histogram of each sub-block image, and uniformly distributing the intercepted pixels to each gray level;

(S25) reconstructing gray values of the pixel points by using a bilinear interpolation method, and finally realizing histogram equalization.

As shown in fig. 5, the abscissa of each pixel point in the gray scale map represents the current pixel value, and the ordinate represents the transformed pixel value. After the image is segmented according to the method, the gray scale transformation function adopted by each sub-block in equalization is different. The method comprises the following steps:

1) First, according to pixel values, the entire graphic region may be divided into three types of regions A, B, C, which represent a quadrangular region, an edge region, and a central region, respectively.

2) And judging the pixel points in the image one by one, and determining which type of region the pixel points belong to. The pixels in different areas are processed in different ways.

3) If the pixel belongs to the A-class region, the pixel is not subjected to any interpolation operation, and a gray level transformation function is directly applied to perform gray level transformation:

where cdf (x) represents the cumulative distribution value for pixel value x in the subgraph. cdf _min With cdf _max The minimum and maximum values in the sub-pixel cumulative distribution are represented, respectively. L represents the total number of gray levels, typically 256;

4) If the pixel point belongs to the B-type region, respectively marking the transformation functions corresponding to the A-type regions adjacent to the pixel point asAnd->And takes two points M, N belonging to two class a regions so that MN and the pixel point are on the same horizontal line. The pixel points of M and N points are respectively marked as x ₁ And x ₂ The pixel applies a linear interpolation transform:

5) If the pixel point belongs to the class C region, referring to point P in fig. 4, bilinear interpolation transform is applied to point P:

after image processing by the CLAHE, the images are unified in size by means of equidistant scaling sampling, for example.

Through the above step (S2), the malicious code binary file is converted into a gray scale image, and the local contrast of the partial image corresponding to the malicious block is effectively improved, based on which the embodiment further includes:

(S3) inputting the target gray level diagram into a trained malicious code classification model to predict the probability that the malicious code belongs to each family class, and determining the family class with the highest probability as the family class to which the malicious code belongs; the malicious code classification model is a neural network model and is used for predicting the probability that the malicious code corresponding to the input gray level image belongs to each family class;

as a preferred implementation manner, in this embodiment, the malicious code classification model is a Vision Transformer model.

The Transformer is an end-to-end NLP model proposed by Google team in 2017 that foregoes using traditional RNN sequential structure and employs self-intent mechanism to enable model parallelization training and knowledge of global information. Vision Transformer can be regarded as a graphic version of a transducer, the standard transducer model is directly migrated to the image field to become a Vision Transformer model under the condition of least transformation, the Vision Transformer model can divide an input image into a plurality of sub-blocks and form the sub-blocks into linear embedded sequences, then the linear embedded sequences are used as input of the transducer to simulate word group sequence input in the NLP field, and under the application scene of the embodiment, the gray level graph obtained by binary file conversion of malicious codes has a better classifying effect based on the model.

It should be noted that other image classification models may be used where classification accuracy meets requirements.

In general, the embodiment locates the malicious blocks in the malicious code binary file, and then improves the local contrast of the malicious block part in the image, so that the classification accuracy can be effectively improved.

Example 2:

a computer-readable storage medium, comprising: a stored computer program; when the computer program is executed by the processor, the device where the computer readable storage medium is located is controlled to execute the malicious code detection method based on malicious block labeling and image processing provided in the above embodiment 1.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A malicious code detection method based on malicious block labeling and image processing is characterized by comprising the following steps:

the method comprises the steps of (S1) dividing a binary file of malicious codes to be detected into a plurality of basic blocks, detecting whether each basic block is a malicious block, and marking the position of the malicious block in the binary file; the malicious blocks are basic blocks related to malicious functions;

(S3) inputting the target gray level graph into a trained malicious code classification model to predict the probability that the malicious code belongs to each family class, and determining the family class with the highest probability as the family class to which the malicious code belongs;

the malicious code classification model is a neural network model and is used for predicting the probability that malicious codes corresponding to an input gray level graph belong to each family class.

2. The malicious code detection method based on malicious block labeling and image processing according to claim 1, wherein in the step (S1), for any one basic block, whether it is a malicious block is detected by:

extracting code features of the basic blocks and converting the code features into feature vectors; the code features include a structural feature, an arithmetic instruction feature, a branch instruction feature, and an API call feature;

inputting the feature vector into a trained malicious block detection model, and carrying out feature extraction and reconstruction on the feature vector by the malicious block detection module to obtain a reconstructed feature;

if the difference between the reconstructed feature output by the malicious block detection module and the feature vector is greater than a preset threshold, judging that the basic block is a malicious block; otherwise, judging that the basic block is not a malicious block;

initializing a malicious block detection model, training the malicious block detection model by using the benign sample set with the aim of minimizing reconstruction loss, and obtaining a trained malicious block detection model after training is finished.

3. The malicious code detection method based on malicious block annotation and image processing according to claim 2, wherein the structural features include: the number of children and intermediate values of the basic block; the arithmetic instruction features comprise the number of basic mathematics, displacement instructions and logic operations contained in basic blocks; the transfer instruction features include a number of stack operations, register operations, and port operations within a basic block; the API call feature includes the number of calls to dll, process, service, systeminformation related APIs within the basic block.

4. A malicious code detection method based on malicious block annotation and image processing according to claim 3, wherein said malicious block detection model is a self-encoder model.

5. The malicious code detection method based on malicious block labeling and image processing according to any one of claims 1 to 4, wherein in the step (S2), local contrast of a portion of the image corresponding to the malicious block in the gray scale map is improved by limiting a contrast adaptive histogram equalization algorithm.

6. The malicious code detection method based on malicious block annotation and image processing according to any one of claims 1-4, wherein the malicious code classification model is a Vision Transformer model.

7. The malicious code detection method based on malicious block annotation and image processing according to any one of claims 1 to 4, wherein in the step (S2), converting the binary file into a grayscale image comprises:

and converting the unsigned integer vector into a matrix, taking each element in the matrix as a pixel, and taking the numerical value of the element as the gray value of the corresponding pixel to obtain the gray image.

8. A computer-readable storage medium, comprising: a stored computer program; when the computer program is executed by a processor, the device where the computer readable storage medium is located is controlled to execute the malicious code detection method based on malicious block marking and image processing according to any one of claims 1 to 7.