CN117560177A

CN117560177A - Malicious code visual classification method, system, equipment and medium

Info

Publication number: CN117560177A
Application number: CN202311443545.5A
Authority: CN
Inventors: 张瑜; 陈溢爽; 石元泉; 陈桂宏; 彭景慧; 陈艺芳; 王春安
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2023-11-02
Filing date: 2023-11-02
Publication date: 2024-02-13

Abstract

The application relates to the technical field of network information security, in particular to a malicious code visual classification method, a system, equipment and a medium, and the technical scheme is as follows: extracting a plurality of API sequences of malicious codes to construct an API data set; according to the API data set, calculating to obtain a word vector two-dimensional matrix, and inputting the word vector two-dimensional matrix into the first channel; converting a plurality of API sequences of the API data set into ASCII codes, mapping the converted ASCII codes to obtain a structural two-dimensional matrix, and inputting the structural two-dimensional matrix into a second channel; performing information entropy calculation on a plurality of API sequences of the API data set to obtain an API call information entropy, and inputting a two-dimensional matrix of the API call information entropy into a third channel; and the information of the first channel, the second channel and the third channel is subjected to visual processing, so that the malicious code is detected more intuitively and accurately, and the detection efficiency of the malicious code is improved.

Description

Malicious code visual classification method, system, equipment and medium

Technical Field

The application relates to the technical field of network information security, in particular to a malicious code visual classification method, a system, equipment and a medium.

Background

With the rapid development of digital economy and the deep fusion of the digital economy and the entity economy society, the network attack quantity with malicious codes as tools shows a rapid trend, and serious threat is brought to security situation.

At present, the detection mode of the malicious code mainly comprises static analysis and dynamic analysis, and the static analysis and the dynamic analysis achieve certain effects in practical application, but the traditional methods often consume a great deal of time and energy and occupy a great deal of storage space. With the rapid increase in the amount of malicious code, the conventional methods have a great deal of compromise in coping with new malicious code. The static analysis mainly utilizes the static structure and codes of the program to judge whether the program has maliciousness, and although the static method can accurately capture the static characteristics of the malicious codes, the malicious codes can escape detection due to the technologies of single type, confusion or shell adding and the like, so that the detection effect is reduced. The dynamic detection is mainly to detect and monitor malicious behaviors by codes, but the dynamic analysis is poor in timeliness of detecting malicious codes, and the problems are to be solved.

Disclosure of Invention

In order to make malicious code detection more visual and accurate and improve malicious code detection efficiency, the application provides a method, a system, equipment and a medium, and adopts the following technical scheme:

in a first aspect, the present application provides a malicious code visual classification method, including:

extracting a plurality of API sequences of malicious codes to construct an API data set;

according to the API data set, calculating to obtain a word vector two-dimensional matrix, and inputting the word vector two-dimensional matrix into the first channel;

converting a plurality of API sequences of the API data set into ASCII codes, mapping the converted ASCII codes to form a structure two-dimensional matrix, and inputting the structure two-dimensional matrix into a second channel;

performing information entropy calculation on a plurality of API sequences of the API data set to obtain an API call information entropy, and inputting a two-dimensional matrix of the API call information entropy into a third channel;

and carrying out visualization processing on the information of the first channel, the second channel and the third channel.

Preferably, the step of calculating a word vector two-dimensional matrix according to the API data set and inputting the word vector two-dimensional matrix into the first channel includes:

acquiring two upper words and two lower words of each API call in an API sequence to form a five-element tuple comprising the two upper words, the two lower words and a target API;

inputting the five-element tuple into a CBOW model, and calculating an embedding layer of the CBOW model to obtain an embedding weight;

calculating according to the embedded weights to obtain a word vector two-dimensional matrix;

a word vector two-dimensional matrix is input to a first lane.

Preferably, the specific steps of converting the plurality of API sequences of the API data set into ASCII codes, mapping the converted ASCII codes to form a structural two-dimensional matrix, and inputting the structural two-dimensional matrix into the second channel are as follows:

the API sequence is regarded as a long character string, and each character is converted into an integer vector in an ASCII code form according to the relation between the characters of the long character string and the ASCII code;

mapping the integer vector to pixel values in a set range to form a two-dimensional pixel matrix, and inputting the two-dimensional pixel matrix into the second channel.

Preferably, the calculating the information entropy of the plurality of API sequences of the API data set to obtain an API call information entropy, and inputting the two-dimensional matrix of the API call information entropy into the third channel further includes:

taking pixels of the two-dimensional matrix as units, wherein the pixels of the two-dimensional matrix and a plurality of pixels around the pixels are local; according to the local information entropy, calculating to obtain an API call information entropy, and filling the API call information entropy into a total information entropy matrix to obtain a two-dimensional matrix of the API call information entropy;

and inputting the two-dimensional matrix of the API call information entropy into a third channel.

Preferably, the method further comprises:

inputting the visual image into a trained classifier for family classification.

Preferably, the specific steps of inputting the visual image into a trained classifier for family classification are as follows:

visualization processing is carried out to obtain a visualized image;

forming a visual image dataset from a number of visual images;

taking 80% of RGB images as training images to be input into a CNN convolutional neural network for training, and obtaining a trained classifier; and taking the rest 20% of RGB images as verification images, and inputting the verification images into a trained CNN classifier to obtain a family classification result.

Preferably, the specific steps of extracting the API sequences of the plurality of malicious codes are as follows:

the dynamic analysis report is obtained by dynamically analyzing the original malicious code data in a sandbox environment, and the required dynamic characteristic API is extracted from each report.

In a second aspect, the present application provides a malicious code visualization classification system, comprising:

the data set construction module: the method comprises the steps of extracting API sequences of a plurality of malicious codes and constructing an API data set;

a first output module: the method comprises the steps of obtaining a word vector two-dimensional matrix through calculation according to an API data set, and inputting the word vector two-dimensional matrix into a first channel;

and a second output module: the method comprises the steps of converting a plurality of API sequences of an API data set into ASCII codes, mapping the converted ASCII codes to form a structural two-dimensional matrix, and inputting the structural two-dimensional matrix into a second channel;

and a third output module: the method comprises the steps of performing information entropy calculation on a plurality of API sequences of an API data set to obtain an API call information entropy, and inputting a two-dimensional matrix of the API call information entropy into a third channel;

and the visualization processing module is used for: and the method is used for visualizing the information of the first channel, the second channel and the third channel.

In a third aspect, the present application provides a malicious code visual classification device comprising a memory storing a computer program and a processor arranged to run the computer program to perform a malicious code visual classification method as described previously.

In a fourth aspect, the present application provides a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the malicious code visualization classification method as described above when run.

To sum up, compared with the prior art, the beneficial effects brought by the technical scheme provided by the application at least include: according to the method, the API data set is constructed through the API sequence of the malicious code, the two-dimensional word vector matrix, the two-dimensional structural matrix and the two-dimensional matrix of the API call information entropy are obtained according to the API data set, the two-dimensional word vector matrix is input to the first channel, the two-dimensional structural matrix is input to the second channel, the two-dimensional matrix of the API call information entropy is input to the third channel, and therefore visualization processing is carried out, feature processing problems are converted into image recognition problems through a visualization method, the workload and complexity of complicated feature engineering or reverse engineering in traditional detection are saved, meanwhile detection and classification are simplified, and the visual image is obtained, so that malicious code detection is more visual and accurate, and the malicious code detection efficiency is improved.

Drawings

Fig. 1 is a flow chart of a malicious code visual classification method according to an embodiment of the present application.

FIG. 2 is a schematic diagram of a CBOW model architecture according to an embodiment of the present application.

Fig. 3 is a schematic diagram of a convolutional neural network architecture according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a malicious code visual classification system according to an embodiment of the present application.

Reference numerals illustrate:

1. a data set construction module; 2. a first output module; 3. a second output module; 4. a third output module; 5. and a visual processing module.

Detailed Description

The following further details the application in connection with fig. 1-4, and the terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting.

The malicious code is presented in the form of the image, so that malicious detection becomes visual and concise, the defect that the traditional detection flow is complex and time-consuming is overcome, and the detection classification efficiency is effectively improved. However, in the existing research, researchers apply visualization to static detection more, strive to further improve detection efficiency on static analysis without consuming time, neglect instantaneity and procedural performance in dynamic analysis, and as malicious code changes and changes, a novel attack mode needs to pay more attention to dynamic characteristics during running, so that application universality is improved. The single-channel gray level image is widely applied due to simpler operation, but the content of the single-channel gray level image can be displayed is single, so the application provides a three-channel image visualization classification scheme.

Referring to fig. 1, a malicious code visual classification method according to the present application specifically includes:

step S1: and extracting a plurality of API sequences of malicious codes to construct an API data set.

Step S2: according to the API data set, a word vector two-dimensional matrix is obtained through calculation, and the word vector two-dimensional matrix is input to the first channel.

Step S3: converting a plurality of API sequences of the API data set into ASCII codes, mapping the converted ASCII codes to form a structure two-dimensional matrix, and inputting the structure two-dimensional matrix into the second channel.

Step S4: and carrying out information entropy calculation on a plurality of API sequences of the API data set to obtain an API call information entropy, and inputting a two-dimensional matrix of the API call information entropy into the third channel.

Step S5: and carrying out visualization processing on the information of the first channel, the second channel and the third channel.

Specifically, in the embodiment of the application, an API data set is constructed by acquiring an API sequence of a malicious code, a word vector two-dimensional matrix, a structure two-dimensional matrix and a two-dimensional matrix of API call information entropy are obtained according to the API data set, the word vector two-dimensional matrix is input to a first channel, the structure two-dimensional matrix is input to a second channel, and the two-dimensional matrix of the API call information entropy is input to a third channel, so that visualization processing is performed. The visualized image obtained by the visualization processing is specifically an RGB image. The feature processing problem is converted into the image recognition problem through the visualization method, so that the workload and complexity of complicated feature engineering or reverse engineering in the traditional detection are saved, meanwhile, the detection and classification are simpler, the detection of malicious codes is more visual and accurate, and the detection efficiency of the malicious codes is improved. Compared with a single-channel gray level image, the method and the device select more information of RGB images. The RGB image is generated by adopting a color three-channel principle, and besides texture features and gray levels, more obvious and visual color features are also embodied, and the picture can cover more effective information quantity and has more identification degree, so that the classification accuracy is enhanced to a certain extent, and the classification efficiency is improved.

As one implementation mode, the specific step of extracting the API sequences of a plurality of malicious codes is to dynamically analyze the original malicious code data in a sandbox environment to obtain dynamic analysis reports, and then extract the required dynamic characteristic APIs from each report.

Specifically, the API sequences of all samples in the original dataset are extracted, and an API dataset is constructed. In the embodiment of the application, the original data contains a plurality of malicious code samples of 13 families, after a JSON report of each sample is obtained through dynamic analysis by a sandbox, a dynamic characteristic API sequence is extracted from the report, an API characteristic data set is constructed, each sample is an API sequence, and each sequence is composed of a plurality of API calls. The API data set is constructed, so that original data of malicious codes can be obtained more comprehensively, and initial data is provided for accurate classification.

As one embodiment, according to the API data set, the specific steps of calculating a word vector two-dimensional matrix, and inputting the word vector two-dimensional matrix into the first channel are: and taking the context window as 2, acquiring 4 context words of each API call in the API sequence as model input, and forming a word vector two-dimensional matrix by model training and calculating the embedding weight of the acquired model.

Specifically, referring to fig. 2, a word vector two-dimensional matrix is obtained by computing a CBOW model, which is a continuous word bag model, and the process is mainly a word embedding process, so as to obtain embedding weights of the model as the word vector two-dimensional matrix. Each API sequence in the sample is regarded as two word taking context windows, the target word is marked as w (t), the corresponding two context words are w (t-2) and w (t-1), the corresponding two context words are w (t+1) and w (t+2), the context words are input as a CBOW model, and the target word is output.

A CBOW model is defined for representing API words as dense low-dimensional vectors. The model contains an embedding layer for converting word indices into embedding vectors. Forward propagation is achieved by the forwoard method, and the embedded vector average of all context words is obtained by averaging.

And reading and preprocessing data, and constructing a word list and a training set. And reading each API word, preprocessing by removing underlines and hyphens in part of the words, counting the occurrence times of each word, and sequencing the words according to the occurrence frequency to construct a word table. Given a context window size of 2, the target word and its context are each two words into a tuple of five elements, and so on, the tuples of all words are added to the training set, building the training set. Meanwhile, a word-to-index mapping dictionary and an index-to-word mapping dictionary are constructed to make a mapping relationship.

And (5) training a CBOW model to obtain the embedded weight. The loss function is defined as a negative log-likelihood loss function and the optimizer is defined as a random gradient descent SGD. The target word and the context word are converted into indexes by iterating training data using a mapping dictionary, then forward computation, loss computation and back propagation of the model are performed, and model parameters are updated using an optimizer. In the training process, the function accumulates total loss values, calculates the number of each training round and the total loss value thereof, and finally returns the embedded weight obtained by training. After training and debugging, the optimal parameters are selected, the batch is set to 23995, the training round number is set to 20, the learning rate is set to 0.0001, and the gradient descent is set to 0.4. After training, model embedding weights are obtained by using model.

And (5) normalization treatment. And filling or cutting off the embedded weight according to the target dimension to make the embedded weight be a word vector with the same dimension. The word vector is normalized, scaled to the range of [0,1], mapped to the pixel range of [0,255], and converted into a required word vector two-dimensional matrix with the size of 64×64, which is used as the first channel of the RGB image.

As one embodiment, converting a plurality of API sequences of the API data set into ASCII codes, and mapping the converted ASCII codes to form a structural two-dimensional matrix, wherein the structural two-dimensional matrix is input into the second channel. And mapping the integer vector to pixel values in a set range to form a two-dimensional pixel matrix.

Specifically, a structural two-dimensional matrix is obtained by adopting an improved B2Img algorithm. The original B2Img algorithm binary converts malicious codes into 8-bit unsigned integer vectors and then maps the 8-bit unsigned integer vectors into gray images. By utilizing the algorithm thought, the embodiment of the application splits the API into single character strings, converts each character string into ASCII codes, uniformly processes the API call named as unknown into 0, thereby converting the API into integer, then maps the ASCII codes into a two-dimensional matrix with pixel values within the range of [0,255] through a mapping relation, and uses a nearest neighbor interpolation algorithm to normalize the matrix to obtain a 64 multiplied by 64 two-dimensional matrix which is used as a second channel of the RGB image.

As one embodiment, performing information entropy calculation on a plurality of API sequences of the API data set to obtain an API call information entropy, and inputting a two-dimensional matrix of the API call information entropy to the third channel further includes regarding pixels of the two-dimensional matrix as a unit, where a pixel point of the two-dimensional matrix and a plurality of pixel points around the pixel point are local. And according to the local information entropy, calculating to obtain an API call information entropy, and filling the API call information entropy into a total information entropy matrix to obtain a two-dimensional matrix of the API call information entropy.

Specifically, an information entropy two-dimensional matrix is obtained through shannon entropy calculation, wherein shannon entropy is the information entropy. The information entropy is used as a quantization index of the information content of a system and reflects the uncertainty of the characteristics. The following is the information entropy formula:

wherein p (x) _i ) Is the probability of occurrence of event xi, -lovp (x _i ) H (X) is the average information amount of the random variable X, which is the information amount of the event xi.

And calculating the local information entropy value of the API call in each API sequence sample according to the information entropy formula.

According to the generated two-dimensional matrix of the structure initialization entropy, calculating local entropy values of nine pixel points around each pixel point of the structure two-dimensional matrix by taking each pixel of the structure two-dimensional matrix as a unit, filling the calculated entropy values into the two-dimensional matrix of the entropy to obtain the two-dimensional matrix filled with all entropy values, namely, API call information entropy, and taking the two-dimensional matrix as a third channel of the RGB image.

As one embodiment, the specific steps of synthesizing the two-dimensional matrix of the first channel, the second channel and the third channel to obtain the RGB visual image are as follows: normalizing the two-dimensional matrixes of the first channel, the second channel and the third channel to 64 multiplied by 64, respectively inputting the 64 multiplied by 64 two-dimensional matrixes into corresponding color channels, and obtaining an RGB visual image by each pixel value in the synthesized two-dimensional matrixes containing 3 elements.

Specifically, after the information input steps of the first channel, the second channel and the third channel are completed, the obtained two-dimensional matrix is subjected to visualization processing, namely, three matrices are synthesized to obtain a two-dimensional matrix with pixel values containing three elements, and then an RGB image is visually synthesized. According to the RGB three-channel principle, the word vector two-dimensional matrix, the structure two-dimensional matrix and the information entropy two-dimensional matrix are respectively transmitted into three channels of a red channel, a green channel and a blue channel to be visualized into RGB images, so that an RGB image data set is constructed to be input as a convolutional neural network.

As one embodiment, the visual image is input into a trained classifier for family classification, and the specific steps are that a visual image data set is formed according to a plurality of visual images. And taking 80% of RGB images as a to-be-trained image to be input into a CNN convolutional neural network for training, and obtaining a trained classifier. And taking the rest 20% of RGB images as verification images, and inputting the verification images into a trained CNN classifier to obtain a family classification result.

Specifically, referring to fig. 3, the convolutional neural network CNN employed in the embodiments of the present application contains three sets of convolutional layers and pooling layers, then linked with two fully-connected layers, and output with a softmax layer. The convolution kernel number of each convolution layer in the neural network is 20,50 and 100 respectively, and the convolution kernels are 3*3 in size and are used for extracting the characteristics of an input image. The pooling layers are all maximum pooling layers with the size of 2 x 2 and the step length of 2 and are used for feature selection and downsampling. After the characteristic processing of the three groups of convolution layers and the pooling layer, the full-connection layer carries out family classification on the samples, and finally, the classification result is output by softmax.

Dividing the RGB image data set into a training set and a testing set according to the proportion of 8:2, transmitting the training set into a convolutional neural network for training to obtain a CNN classifier, and transmitting the testing set into the trained classifier for calculation to obtain a classification result.

Referring to fig. 4, a malicious code visual classification system is provided for an embodiment of the present application, the system including:

data set construction module 1: the method comprises the steps of extracting API sequences of a plurality of malicious codes and constructing an API data set;

the first output module 2: the method comprises the steps of obtaining a word vector two-dimensional matrix through calculation according to an API data set, and inputting the word vector two-dimensional matrix into a first channel;

the second output module 3: the method comprises the steps of converting a plurality of API sequences of an API data set into ASCII codes, mapping the converted ASCII codes to form a two-dimensional matrix, and inputting the two-dimensional matrix into a second channel;

the third output module 4: the method comprises the steps of performing information entropy calculation on a plurality of API sequences of an API data set to obtain an API call information entropy, and inputting a two-dimensional matrix of the API call information entropy into a third channel;

visualization processing module 5: and the method is used for visualizing the information of the first channel, the second channel and the third channel.

And a classification module: for inputting the visual image into a trained classifier for family classification. Specifically, the specific step of extracting the API sequences of the plurality of malicious codes by the data set construction module 1 is to dynamically analyze the original malicious code data in a sandbox environment to obtain dynamic analysis reports, and then extract the required dynamic feature APIs from each report.

The first output module 2 obtains the two upper words and the two lower words of each API call in the API sequence to form a five-element tuple comprising the two upper words, the two lower words and the target API. Five-element tuples are input into a CBOW model, and an embedding layer of the CBOW model calculates to obtain embedding weights. And calculating according to the embedded weight to obtain a word vector two-dimensional matrix. A word vector two-dimensional matrix is input to a first lane.

The second output module 3 regards the API sequence as a long string, and converts each character into an integer vector in the form of ASCII code according to the relationship between the characters of the long string and the ASCII code. Mapping the integer vector to pixel values in a set range to form a two-dimensional pixel matrix, and inputting the two-dimensional pixel matrix into the second channel.

The third output module 4 regards the pixels of the two-dimensional matrix as a unit, and the pixels of the two-dimensional matrix and a plurality of pixels around the pixels are a part. And according to the local information entropy, calculating to obtain an API call information entropy, and filling the API call information entropy into a total information entropy matrix to obtain a two-dimensional matrix of the API call information entropy. And inputting the two-dimensional matrix of the API call information entropy into a third channel.

The visualization processing module 5 performs visualization processing to obtain a visualized image. And inputting the visualized image into a classification module for family classification. In particular, a visual image dataset is formed from a number of visual images. And taking 80% of RGB images as a to-be-trained image to be input into a CNN convolutional neural network for training, and obtaining a trained classifier. And taking the rest 20% of RGB images as verification images, and inputting the verification images into a trained CNN classifier to obtain a family classification result.

Embodiments of the present application provide a malicious code visual classification device comprising a memory storing a computer program and a processor arranged to run the computer program to perform the malicious code visual classification method as described above.

Embodiments of the present application provide a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform a malicious code visualization classification method as described above when run.

It will be clear to those skilled in the art that, for convenience and brevity of description, the specific working procedures of the apparatus and the product described above may refer to the corresponding procedures in the foregoing method embodiments, which are not described herein again.

In the several embodiments provided herein, it should be understood that the disclosed methods, systems, apparatus, and program products may be embodied in other ways.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method for visually classifying malicious code, comprising:

2. The method for visually classifying malicious codes according to claim 1, wherein the specific steps of calculating a word vector two-dimensional matrix according to the API data set and inputting the word vector two-dimensional matrix into the first channel are as follows:

a word vector two-dimensional matrix is input to a first lane.

3. The method for visually classifying malicious codes according to claim 1, wherein the specific steps of converting a plurality of API sequences of the API data set into ASCII codes, mapping the converted ASCII codes to form a structural two-dimensional matrix, and inputting the structural two-dimensional matrix to the second channel are as follows:

4. The method for visually classifying malicious codes according to claim 3, wherein the step of performing information entropy calculation on the plurality of API sequences of the API data set to obtain an API call information entropy, and the step of inputting a two-dimensional matrix of the API call information entropy into the third channel further comprises:

5. The malicious code visualization classification method of claim 1, further comprising:

visualization processing is carried out to obtain a visualized image;

inputting the visual image into a trained classifier for family classification.

6. The malicious code visual classification method according to claim 5, wherein the specific steps of inputting the visual image into a trained classifier for family classification are:

forming a visual image dataset from a number of visual images;

7. The method for visually classifying malicious codes according to claim 5, wherein the specific steps of extracting the API sequences of the plurality of malicious codes are as follows:

8. A malicious code visualization classification system, comprising:

9. A malicious code visual classification device comprising a memory storing a computer program and a processor arranged to run the computer program to perform the malicious code visual classification method of any one of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, wherein the computer program is arranged to execute the malicious code visual classification method according to any of claims 1-7 at run-time.