CN115374439A

CN115374439A - Malicious code detection method and device and computer equipment

Info

Publication number: CN115374439A
Application number: CN202211027991.3A
Authority: CN
Inventors: 刘迪; 崔逸群; 杨东; 毕玉冰; 燕前; 白发琪; 刘超飞; 朱博迪; 肖力炀; 刘骁; 刘鹏飞; 王文庆; 邓楠轶
Original assignee: Xian Thermal Power Research Institute Co Ltd; Huaneng Power International Inc
Current assignee: Xian Thermal Power Research Institute Co Ltd; Huaneng Power International Inc
Priority date: 2022-08-25
Filing date: 2022-08-25
Publication date: 2022-11-22

Abstract

The invention discloses a malicious code detection method, a malicious code detection device and computer equipment, wherein the malicious code detection method comprises the following steps: the method comprises the steps of obtaining file information of an executable file, wherein the file information comprises various types of characteristic information of the executable file, so that the executable file can be accurately described according to the file information, further, the characteristic information in the executable file can be completely converted into a two-dimensional matrix through a preset algorithm to be embodied, the two-dimensional matrix is further converted into a multichannel RGB image to be visually displayed, each characteristic of the RGB image is predicted through a preset model, the problem of incomplete characteristic extraction is solved, and finally, whether malicious codes are included in the executable file or not can be accurately judged according to a prediction result.

Description

Malicious code detection method and device and computer equipment

Technical Field

The invention relates to the technical field of internet security, in particular to a malicious code detection method and device and computer equipment.

Background

With the rapid development of the internet, the mobile internet, the internet of things and the industrial control network, information technology gradually permeates all walks of life, brings convenience to life and simultaneously brings the problem of network security. Wherein the detection technique of malicious foreign codes is an important part of preventing the network from being complete. In the prior art, most of detection technologies for malicious codes adopt methods such as a random forest algorithm and an SVM (support vector machine) improved in machine learning to perform two-classification on intrusion detection, so that the intrusion behavior is recognized and early-warned. However, in the detection method in the prior art, local optimization is easy to be involved in, and the feature extraction of the malicious code is not comprehensive.

Disclosure of Invention

Therefore, to overcome the defects in the prior art, embodiments of the present invention provide a method and an apparatus for detecting malicious codes, and a computer device.

According to a first aspect, an embodiment of the present invention discloses a malicious code detection method, including:

acquiring file information of an executable file;

converting file information into a two-dimensional matrix through a preset algorithm;

converting the two-dimensional matrix into an RGB image;

predicting the RGB image by using a preset model to obtain a prediction result;

determining whether the executable file includes malicious code according to the prediction result.

Optionally, the prediction result includes a category of each first feature in the RGB image, and determining whether the executable file includes a malicious code according to the prediction result specifically includes:

determining the probability of the category by using a preset algorithm;

and when the probability meets a preset condition, the executable file comprises malicious codes.

Optionally, converting the file information into a two-dimensional matrix through a preset algorithm specifically includes:

converting the file information into decimal data to obtain an R channel data matrix;

disassembling the file information to obtain code information of the executable file;

obtaining a G channel data matrix according to the code information;

obtaining a B channel data matrix according to the data structure of the file information;

and obtaining a two-dimensional matrix according to the R channel data matrix, the G channel data matrix and the B channel data matrix.

Optionally, obtaining the G-channel data matrix according to the code information specifically includes:

extracting first feature information from the code information;

and determining G-channel data according to the first characteristic information.

Optionally, the preset models comprise a first preset model and a second preset model,

predicting the RGB image by using a preset model to obtain a prediction result, which specifically comprises the following steps:

performing feature extraction on the RGB image by using a first preset model to obtain at least one piece of second feature information;

and respectively identifying each second characteristic information by using a second preset model to obtain a category corresponding to each second characteristic information.

Optionally, the categories include a first category and a second category, and when it is determined that the executable file includes malicious code according to the prediction result, the method further includes:

and determining the position of the malicious code in the executable file according to second characteristic information corresponding to a second category, wherein the first category is used for indicating that the second characteristic information is normal characteristic information, and the second category is used for indicating that the second characteristic information is abnormal characteristic information.

According to a second aspect, an embodiment of the present invention further discloses a malicious code detection apparatus, including:

the acquisition module is used for acquiring file information of the executable file;

the first conversion module is used for converting the file information into a two-dimensional matrix through a preset algorithm;

the second conversion module is used for converting the two-dimensional matrix into an RGB image;

the prediction module is used for predicting the RGB image by using a preset model to obtain a prediction result;

and the determining module is used for determining whether the executable file comprises malicious codes according to the prediction result.

Optionally, the prediction result includes a category of each first feature in the RGB image, and the determining module specifically includes:

the probability determining module is used for determining the probability of the category by utilizing a preset algorithm;

and the judging module is used for judging that the executable file comprises the malicious codes when the probability meets the preset condition.

According to a third aspect, an embodiment of the present invention further discloses a computer device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the malicious code detection method according to the first aspect or any of the alternative embodiments of the first aspect.

According to a fourth aspect, the embodiments of the present invention also disclose a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of the malicious code detection method according to the first aspect or any optional embodiment of the first aspect.

The technical scheme of the invention has the following advantages:

the invention provides a malicious code detection method, a malicious code detection device and computer equipment, wherein the malicious code detection method comprises the following steps: the method comprises the steps of obtaining file information of an executable file, wherein the file information comprises various types of characteristic information of the executable file, so that the executable file can be accurately described according to the file information, further, the characteristic information in the executable file can be completely converted into a two-dimensional matrix through a preset algorithm to be embodied, the two-dimensional matrix is further converted into a multichannel RGB image to be visually displayed, each characteristic of the RGB image is predicted through a preset model, the problem of incomplete characteristic extraction is solved, and finally, whether malicious codes are included in the executable file or not can be accurately judged according to a prediction result.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic diagram of a specific example of a malicious code detection method in an embodiment of the present invention;

FIG. 2 is a flowchart of a specific example of a malicious code detection method according to an embodiment of the present invention;

FIG. 3 is a flowchart of a specific example of a malicious code detection method according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a specific example of a malicious code detection method in the embodiment of the present invention;

fig. 5 is a schematic diagram of a specific example of a malicious code detection method in the embodiment of the present invention;

fig. 6 is a schematic diagram of a specific example of a malicious code detection method in the embodiment of the present invention;

FIG. 7 is a schematic block diagram of a specific example of a malicious code detection apparatus according to an embodiment of the present invention;

FIG. 8 is a diagram of a specific example of a computer device in an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; the two elements may be directly connected or indirectly connected through an intermediate medium, or may be communicated with each other inside the two elements, or may be wirelessly connected or wired connected. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

When the executable file is mapped from a disk to a memory, the malicious code is written into each blank of the memory mapping, as indicated by the arrow in fig. 1, and the address pointed by the executable file is changed by the malicious code, so that the file first runs the malicious code during running, and therefore, the attack on the malicious code is very important for identifying the malicious code. In this embodiment, an industrial control network is taken as an example for introduction, wherein the industrial control network mainly includes a management layer, a monitoring layer, and a device layer from top to bottom, the management layer usually adopts a TCP/IP protocol, and the connection with an external network is realized through an ethernet, and the connection is usually attacked by an external malicious code.

To solve the technical problem mentioned in the background art, an embodiment of the present application provides a malicious code detection method, specifically referring to fig. 2, the method includes the following steps:

step 201, file information of the executable file is obtained.

The file information is, for example, information of an executable file, where the executable file is stored as binary information, so that the file information is also binary information of the executable file, where the binary information includes all information of the executable file.

Step 202, converting the file information into a two-dimensional matrix through a preset algorithm.

Illustratively, the preset algorithm may be a B2M algorithm, the B2M algorithm is used to convert binary information into a corresponding two-dimensional matrix, that is, a grayscale map, and a specific implementation process may be implemented by using Python. The essence of the files stored on the computer is binary information which is represented by 01010 \8230, the process of the B2M algorithm is to read 8 bits as an unsigned integer (namely, the integer can be converted between the binary information and a decimal number), form a vector (256 is defined as a line width vector) according to a specified line width, then represent the whole binary file as an array formed by the vector, and each element in the array represents a two-dimensional matrix with the range of [0,255] (0 represents black and 255 represents white).

In an optional embodiment, as shown in fig. 3, converting the file information into a two-dimensional matrix through a preset algorithm specifically includes:

step 2021, convert the file information into decimal data to obtain an R channel data matrix.

Illustratively, the binary information is represented by 01010 \8230, whereas the files are usually hexadecimal numbers, corresponding to the conversion of hexadecimal into decimal numbers. And finally obtaining a matrix of R channel data located in [0,255], namely an R channel data matrix, by converting hexadecimal into decimal through a B2M algorithm.

Step 2022, disassembling the file information to obtain code information of the executable file.

Illustratively, the content of the disassembly belongs to the basic theory of computers, namely, a machine language (binary information) is converted into an assembly language, the conversion from a low-level language to a high-level language (namely, a language which can be executed by a computer, a program written in a c language and the like belongs to the high-level language, so that the program is convenient for programmers to program, and the computer cannot recognize the program), and because the machine language at the bottommost layer is only 01010 \8230, the characteristics of malicious codes cannot be shown, the program is disassembled into the machine language (code information), and the subsequent process is easier to perform.

Step 2023, obtaining a G channel data matrix according to the code information.

Illustratively, after disassembling the code information, extracting information (such as an assembler instruction address and the like, namely first characteristic information) representative of the code information, and copying the assembler instruction address and the like into a value of [0,255] into a two-dimensional matrix, thereby obtaining data of the G channel.

Step 2024, obtaining a B-channel data matrix according to the data structure of the file information.

Illustratively, the file information results in a data structure, which may be a series of information such as file header, section table, resource section, code section, entry point, etc. And converting the data structures into data in a two-dimensional matrix to obtain a B channel data matrix.

Step 2025, obtaining a two-dimensional matrix according to the R channel data matrix, the G channel data matrix, and the B channel data matrix.

Illustratively, the two-dimensional matrix corresponding to the RGB image can be obtained by adding the B-channel data matrix, the R-channel data matrix, and the G-channel data matrix.

Step 203, converting the two-dimensional matrix into an RGB image.

Illustratively, the implementation process that also converts to RGB images can be implemented using Python. The gray-scale image of a single channel is not comprehensive enough in description of malicious code characteristics, on the basis, the concept of an RGB image is added, the characteristics of a malignant code file are described from multiple dimensions by utilizing the multi-channel characteristics of a color image, and the detection efficiency is improved.

After the RGB image is obtained, the RGB image can be expanded to improve the accuracy of the prediction result, wherein the specific method for expanding can be used for expanding the RGB image by turning, rotating, translating, sharpening and the like, so that the accuracy of the prediction result is improved.

And 204, predicting the RGB image by using a preset model to obtain a prediction result.

Illustratively, the prediction model may be various types of neural network models, for example, a convolutional neural network, a BP neural network, an SVM, and the like may be used to predict the RGB image, so as to obtain whether malicious code is included in an executable file corresponding to the RGB image.

Preferably, in an alternative embodiment, the preset models include a first preset model and a second preset model,

and performing feature extraction on the RGB image by using a first preset model to obtain at least one piece of second feature information.

For example, in the present embodiment, when the RGB image is recognized according to the preset model, since the malicious code is not uniformly distributed, it is difficult to fix the size of the RGB image generated by the malicious code, and therefore, when the prediction model is selected, a problem of the size of the input image needs to be considered.

Based on the above problem, in the present embodiment, as shown in fig. 4, when the SPP-Net is used for prediction, the problem of the size inconsistency of the input images can be solved in the first stage. In the second stage, the SPP-Net network includes a Res-Net network (the first preset model, i.e., the backbone network of the SPP-Net) and a pyramid pooling layer (the second preset model). The ResNet network is used for extracting the features of the RGB images to obtain feature maps (feature maps), the pyramid pooling layer is used for classifying the feature maps, and then each feature candidate area is classified through the full-connection layer to obtain the category corresponding to each feature.

The ResNet can eliminate the problems of gradient disappearance, gradient explosion and the like of the network along with the deepening of the layer number, the residual error network can be developed to a deeper level, more features are extracted, and the classification accuracy of the RGB image malicious codes is improved. SENET is introduced into each residual module of ResNet, and the characteristics of the salient malicious codes in the RGB images are obtained through global average pooling and maximum pooling. The SEnet part can be seen from fig. 5, the obtained feature map respectively obtains a vector of 1 × 1 × C (C is the number of channels after convolution feature extraction) through global average pooling and maximum pooling, and then obtains a weight vector through two layers of fully connected layers. According to the method, SENet is added on the basis of ResNet, and the guide model can focus on more main information in a channel domain during feature extraction, so that at least one piece of second feature information of the malicious code is obtained more accurately.

Step 205, determining whether the executable file includes malicious code according to the prediction result.

Illustratively, the prediction result includes a category of each feature in the input RGB image. In an optional embodiment, the prediction result includes a category of each first feature in the RGB image, and determining whether the executable file includes malicious code according to the prediction result specifically includes:

the probability of the class is determined using a preset algorithm.

Illustratively, as shown in fig. 5, after each feature map is obtained, the fully-connected layer changes the features of all the channels into a column of vectors, and the probability that the feature belongs to a certain class is judged through a softmax function, and a high probability indicates that the feature is activated and is malicious code found by the model.

In an optional embodiment, the classes include a first class and a second class, and when it is determined from the prediction that the executable file includes malicious code, the method further includes:

Illustratively, the first category is that the malicious code is not included, the second category is that the malicious code is included, when the prediction result is that the malicious code is included, the position of the malicious code may be determined through a regression function, and the corresponding position may also be determined according to code information of a feature corresponding to the malicious code, so as to solve the problem of the malicious code.

When a preset model is used for predicting malicious codes, firstly, the preset model needs to be trained, wherein the training process is to train according to an RGB image which is known to include the malicious codes, when an evaluation index of the prediction model reaches a preset standard, the model training is completed, wherein the evaluation index can be an accuracy and recall curve and a loss value, the loss value is a difference between a true probability distribution and a prediction probability distribution, and the accuracy can be calculated by the following formula:

wherein TP is the number of the correct type predicted to be correct, TN is the number of the wrong type predicted to be wrong, FP is the number of the wrong type predicted to be correct, and FN is the number of the correct type predicted to be wrong.

According to the evaluation indexes of the model, the recall ratio (P-R curve) and the loss value between ResNet-SENEt and ResNet are compared, so that the accuracy of ResNet-SENEt is improved, and the loss value is reduced. Through experiments of ResNet-SENEt on a gray-scale image and an RGB color image obtained through a B2M algorithm, the improvement of the quasi-removing rate and the reduction of the loss value can be obtained. The integral model is compared with classification algorithms (SVM, random forest and the like) commonly used in an industrial control network, and the accuracy and the loss value of a training data set and the accuracy and the loss value of a test set are improved to a certain degree.

As shown in fig. 6, the identification process for malicious code in all the embodiments described above is performed, where the PE file is an executable file, and the entire process includes: and performing RGB image conversion on the executable file containing the malicious codes, performing preprocessing expansion on RGB data after the executable file is converted into the RGB image, using a part of data for training a model, using a part of data for testing the trained model, and finally predicting the executable file according to the trained data.

By the method, the file information of the executable file is obtained, wherein the file information comprises various types of characteristic information of the executable file, so that the executable file can be accurately described according to the file information, further, the characteristic information in the executable file can be completely converted into a two-dimensional matrix through a preset algorithm to be embodied, the two-dimensional matrix is further converted into a multichannel RGB image to be visually displayed, each characteristic of the RGB image is predicted through a preset model, the problem of incomplete characteristic extraction is solved, and finally, whether malicious codes exist in the executable file can be accurately judged according to a prediction result.

In the above, for the embodiments of the malicious code detection method provided by the present application, other embodiments of the malicious code detection provided by the present application are introduced and described below, specifically referring to the following.

The embodiment of the invention also discloses a malicious code detection device, as shown in fig. 7, the device comprises:

an obtaining module 701, configured to obtain file information of an executable file;

a first conversion module 702, configured to convert the file information into a two-dimensional matrix through a preset algorithm;

a second conversion module 703, configured to convert the two-dimensional matrix into an RGB image;

the prediction module 704 is configured to predict the RGB image by using a preset model to obtain a prediction result;

a determining module 705, configured to determine whether the executable file includes malicious code according to the prediction result.

In an optional embodiment, the prediction result includes a category of each first feature in the RGB image, and the determining module specifically includes:

In an alternative embodiment, the first conversion module is specifically configured to:

obtaining a G channel data matrix according to the code information;

In an optional embodiment, the first conversion module is further specifically configured to:

extracting first feature information from the code information;

In an optional embodiment, the preset model includes a first preset model and a second preset model, and the prediction module is specifically configured to:

In an alternative embodiment, the categories include a first category and a second category, and when the determination module determines that the executable file includes malicious code, the apparatus is further configured to:

The functions executed by each component in the malicious code detection device provided by the embodiment of the present invention have been described in detail in any of the above method embodiments, and therefore, are not described herein again.

By executing the device, the file information of the executable file is obtained, wherein the file information comprises various types of characteristic information of the executable file, so that the executable file can be accurately described according to the file information, further, the characteristic information in the executable file can be completely converted into a two-dimensional matrix through a preset algorithm to be embodied, the two-dimensional matrix is further converted into a multichannel RGB image to be visually displayed, each characteristic of the RGB image is predicted through a preset model, the problem of incomplete characteristic extraction is solved, and finally, whether malicious codes exist in the executable file can be accurately judged according to a prediction result.

An embodiment of the present invention further provides a computer device, as shown in fig. 8, the computer device may include a processor 801 and a memory 802, where the processor 801 and the memory 802 may be connected through a bus or in another manner, and fig. 8 takes the example of connection through a bus as an example.

Processor 801 may be a Central Processing Unit (CPU). The Processor 801 may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof.

The memory 802 may be used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the exercise training method in the embodiments of the present invention. The processor 801 executes various functional applications and data processing of the processor by running non-transitory software programs, instructions and modules stored in the memory 802, so as to implement the exercise training method in the above method embodiment.

The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor 801, and the like. Further, the memory 802 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes memory located remotely from the processor 801, which may be connected to the processor 801 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more modules are stored in memory 802 that, when executed by processor 801, perform a method of athletic training as in the embodiment shown in FIG. 1.

The details of the computer device may be understood with reference to the corresponding related description and effects in the embodiment shown in fig. 2, and are not described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, and the program can be stored in a computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk Drive (Hard Disk Drive, abbreviated as HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A malicious code detection method, comprising:

acquiring file information of an executable file;

converting the file information into a two-dimensional matrix through a preset algorithm;

converting the two-dimensional matrix into an RGB image;

predicting the RGB image by using a preset model to obtain a prediction result;

and determining whether the executable file comprises malicious code according to the prediction result.

2. The method as claimed in claim 1, wherein the prediction result includes a category of each first feature in the RGB image, and the determining whether the executable file includes malicious code according to the prediction result includes:

determining the probability of the category by using a preset algorithm;

and when the probability meets a preset condition, the executable file comprises the malicious code.

3. The method according to claim 1, wherein the converting the document information into a two-dimensional matrix by a preset algorithm specifically includes:

obtaining a G channel data matrix according to the code information;

and obtaining the two-dimensional matrix according to the R channel data matrix, the G channel data matrix and the B channel data matrix.

4. The method according to claim 3, wherein obtaining the G-channel data matrix according to the code information specifically includes:

extracting first feature information from the code information;

and determining the G channel data according to the first characteristic information.

5. The method according to any one of claims 1 to 4, wherein the predetermined models comprise a first predetermined model and a second predetermined model,

predicting the RGB image by using a preset model to obtain a prediction result, wherein the prediction result specifically comprises the following steps:

performing feature extraction on the RGB image by using the first preset model to obtain at least one piece of second feature information;

and respectively identifying each second characteristic information by using the second preset model to obtain a category corresponding to each second characteristic information.

6. The method of claim 2, wherein the categories include a first category and a second category, and wherein when the executable file is determined to include malicious code based on the prediction, the method further comprises:

and determining the position of the malicious code in the executable file according to second characteristic information corresponding to the second category, wherein the first category is used for indicating that the second characteristic information is normal characteristic information, and the second category is used for indicating that the second characteristic information is abnormal characteristic information.

7. An apparatus for malicious code detection, the apparatus comprising:

a determining module, configured to determine whether the executable file includes malicious code according to the prediction result.

8. The apparatus according to claim 7, wherein the prediction result includes a category of each first feature in the RGB image, and the determining module specifically includes:

the probability determining module is used for determining the probability of the category by using a preset algorithm;

and the judging module is used for judging that the executable file comprises the malicious codes when the probability meets a preset condition.

9. A computer device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the malicious code detection method as claimed in any one of claims 1 to 6.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the malicious code detection method according to any one of claims 1 to 6.