CN112966267A

CN112966267A - Malicious file detection method and system based on machine learning

Info

Publication number: CN112966267A
Application number: CN202110231625.9A
Authority: CN
Inventors: 王卓超; 于金龙; 王智民; 王高杰
Original assignee: Beijing 6Cloud Technology Co Ltd; Beijing 6Cloud Information Technology Co Ltd
Current assignee: Beijing 6Cloud Technology Co Ltd; Beijing 6Cloud Information Technology Co Ltd
Priority date: 2021-03-02
Filing date: 2021-03-02
Publication date: 2021-06-15

Abstract

The invention provides a malicious file detection method and system based on machine learning, and belongs to the technical field of information security. The method comprises the following steps: identifying the file type of a file to be detected; extracting the characteristics of the file to be detected; and inputting the characteristics of the file to be detected into a trained classifier corresponding to the file type of the file to be detected for classification calculation to obtain a classification result of the file to be detected. By using the method, the malicious file can be still identified under the condition that the file is variant and confused, the occupied resource is less, and the detection result can be quickly obtained.

Description

Malicious file detection method and system based on machine learning

Technical Field

The invention relates to the technical field of information security, in particular to a malicious file detection method based on machine learning and a malicious file detection system based on machine learning.

Background

Malicious file detection is an important subject in the field of network security, in recent years, the number of malicious files is exponentially increased, and the traditional processing mode cannot timely and effectively process and identify massive data.

The traditional malicious file detection technologies are not limited to two, one is detection analysis based on static characteristics, for example, a detection method based on an application programming interface sequence extracts the programming interface characteristics of a file, and establishes a characteristic library for detecting the malicious file. Although this type of detection technique is relatively fast in analyzing software code, malicious files can counter these detections by code obfuscation, mutation, etc. Therefore, static detection has the problems of high false alarm rate, incapability of identifying files of confusion and variety and easiness in bypassing.

The other is a detection mode based on dynamic characteristics, for example, malicious file detection based on sandbox, and whether a file is malicious or not is determined by observing behavior characteristics of the malicious file in the sandbox environment. Although the method can solve the problem of insufficient accuracy in static detection, the algorithm consumes a large amount of resources and consumes more time. Therefore, the dynamic detection has the problems of low detection efficiency and serious resource consumption.

Disclosure of Invention

The embodiment of the invention aims to provide a malicious file detection method and system based on machine learning, the malicious file can be still identified under the condition that the file is changed and confused, the occupied resources are less, and the detection result can be quickly obtained.

In order to achieve the above object, a first aspect of the present invention provides a malicious file detection method based on machine learning, including:

identifying the file type of a file to be detected;

extracting the characteristics of the file to be detected;

and inputting the characteristics of the file to be detected into a trained classifier corresponding to the file type of the file to be detected for classification calculation to obtain a classification result of the file to be detected.

Further, the identifying the file type of the file to be tested includes:

acquiring file header data of a file to be detected;

and identifying the file type of the file to be detected according to the file header data. The header data is used for displaying the actual usage of the file, and is not easy to be modified manually, and the file type determined according to the header data is more accurate.

Optionally, the features include: the statistical characteristics of the entropy sequence of the file to be tested, the character proportion of each character in the file to be tested and the number of https fields in the file to be tested.

Further, the extracting the features of the file to be tested includes:

converting the file to be tested into binary data;

dividing the binary data into data blocks with preset lengths;

calculating the information entropy of each data block to obtain an entropy sequence of the file to be detected;

calculating the statistical characteristics of the entropy sequence;

calculating the character ratio of each character in the file to be tested;

and calculating the number of the https fields in the file to be tested. By dividing the file to be detected into a plurality of blocks, calculating the information entropy of each data block and then counting the information entropy of all the data blocks of the whole file, shorter malicious segments in the file to be detected can be effectively detected, and the accuracy is higher.

Optionally, the statistical characteristics include: mean, variance, maximum, and minimum.

Optionally, the training process of the trained classifier includes:

collecting a certain number of training data files;

identifying a file type of a training data file;

classifying the training data files according to the file types;

for each class of training data file:

analyzing and determining malicious files and non-malicious files in the training data files, and marking;

extracting the characteristics of the marked training data files of the type;

inputting the characteristics of the training data file into a classifier for training to obtain a trained classifier corresponding to the training data file.

Optionally, the classifier includes: a GBDT classifier, a random forest classifier, and a SVM classifier.

A second aspect of the present invention provides a malicious file detection system based on machine learning, the system including:

the file type identification unit is used for identifying the file type of the file to be detected;

the characteristic extraction unit is used for extracting the characteristics of the file to be detected; and

and the file classification unit is used for inputting the characteristics of the file to be detected into a trained classifier corresponding to the file type of the file to be detected for classification calculation to obtain a classification result of the file to be detected. The system has a simple structure, can still identify the malicious files under the condition that the files are variant and confused, occupies less resources and can quickly obtain a detection result.

Further, the feature extraction unit includes:

the file conversion module is used for converting the file to be tested into binary data;

the data dividing module is used for dividing the binary data into data blocks with preset lengths;

the statistical characteristic calculation module is used for calculating the information entropy of each data block to obtain the entropy sequence of the file to be detected and calculating the statistical characteristics of the entropy sequence;

the character ratio calculation module is used for calculating the character ratio of each character in the file to be tested;

and the https field number calculating module is used for calculating the number of the https fields in the file to be tested.

In another aspect, the present invention provides a machine-readable storage medium having stored thereon instructions for causing a machine to perform the method for machine learning-based malicious file detection described herein.

By the technical scheme, the malicious file can be still identified under the condition that the file is variant and confused, the occupied resource is less, and the detection result can be quickly obtained.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:

FIG. 1 is a flowchart of a malicious file detection method based on machine learning according to an embodiment of the present invention;

fig. 2 is a block diagram of a malicious file detection system based on machine learning according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

Fig. 1 is a flowchart of a malicious file detection method based on machine learning according to an embodiment of the present invention. As shown in fig. 1, the method includes:

identifying the file type of a file to be detected;

extracting the characteristics of the file to be detected;

File extensions are typically used to identify file types, but file extensions can be easily modified by human beings. All files are stored in binary form, and at the beginning of each file there is actually a field to show the actual usage of the file, which is the header flag, which is usually expressed in hexadecimal. The correct type of file can be judged by reading the header data. For example: the file beginning with "4D 5A 90" is a dll file.

In some embodiments of the present invention, the identifying the file type of the file to be tested includes:

acquiring file header data of a file to be detected;

The types identified by header data are very detailed, e.g. pictures alone include png, JPEG, JPG, gif, bmp, etc. in order to reduce the number of classifiers appropriately, in some embodiments specific file types are classified, similar types are merged, e.g. png, JPEG, JPG, gif, bmp are all classified as picture classes. In these embodiments, the same file type may include at least one header data, and after detecting the header data, the header data may be compared with header data included in different types to determine a file type corresponding to the file.

Entropy is a concept used to describe the source uncertainty, i.e., the degree of disorder of elements in a data set. If the appearance frequency of a certain element in the data set is very high, the entropy tends to 0; the entropy value is maximum if the occurrence of each element in the set is the same or similar. The reason why the inverse engineer focuses on entropy is that: some malicious codes have information compression or encryption conditions, and the entropy value is usually high. Thus, to identify potential encryption constants or keys, and even the encrypted content itself, a local entropy calculation is performed on the file under test.

In some embodiments of the present invention, the extracting the feature of the file to be tested includes:

converting the file to be tested into binary data;

dividing the binary data into data blocks with preset length, wherein in one embodiment of the invention, the preset length is 256 bytes, and the data blocks with less than 256 bytes are complemented with 0 at the end;

calculating the information entropy of each data block to obtain the entropy sequence of the file to be detected, wherein the information entropy is calculated by the following formula:

wherein X represents a data sequence consisting of binary data in each data block, X represents each binary data in the data sequence, and p (X) represents the probability of X in X;

calculating the statistical characteristics of the entropy sequence;

calculating the character proportion of each character in the file to be tested, wherein the character proportion is the number of each character in the file to be tested divided by the total number of the characters in the file to be tested;

Optionally, the training process of the trained classifier includes:

collecting a certain number of training data files;

identifying a file type of a training data file;

classifying the training data files according to the file types;

for each class of training data file:

extracting the characteristics of the marked training data files of the type;

In a specific embodiment of the present invention, the classifier adopts a GBDT classifier, and after analyzing and determining the malicious files and the non-malicious files in the training data file, marks the non-malicious files as 0 and marks the malicious files as 1. The GBDT classifier can have a variety of distinguishing features and combinations of features.

Fig. 2 is a block diagram of a malicious file detection system based on machine learning according to an embodiment of the present invention. As shown in fig. 2, the system includes:

and the file classification unit is used for inputting the characteristics of the file to be detected into a trained classifier corresponding to the file type of the file to be detected for classification calculation to obtain a classification result of the file to be detected. The system has a simple structure, can still identify the malicious files under the condition that the files are variant and confused, occupies less resources and can quickly obtain a detection result. The detection system can be arranged in various environments such as a personal host computer, a detection platform and the like.

Further, the feature extraction unit includes:

The embodiment of the invention also provides a machine-readable storage medium, which stores instructions for enabling a machine to execute the malicious file detection method based on machine learning.

Those skilled in the art will appreciate that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, which is stored in a storage medium and includes several instructions to enable a single chip, a chip, or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the embodiments of the present invention have been described in detail with reference to the accompanying drawings, the embodiments of the present invention are not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the embodiments of the present invention within the technical idea of the embodiments of the present invention, and the simple modifications are within the scope of the embodiments of the present invention. It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, the embodiments of the present invention will not be described separately for the various possible combinations.

In addition, any combination of the various embodiments of the present invention is also possible, and the same should be considered as disclosed in the embodiments of the present invention as long as it does not depart from the spirit of the embodiments of the present invention.

Claims

1. A malicious file detection method based on machine learning, which is characterized by comprising the following steps:

identifying the file type of a file to be detected;

extracting the characteristics of the file to be detected;

2. The machine learning-based malicious file detection method according to claim 1, wherein the identifying the file type of the file to be detected comprises:

acquiring file header data of a file to be detected;

and identifying the file type of the file to be detected according to the file header data.

3. The machine-learning based malicious file detection method according to claim 1, wherein the features comprise: the statistical characteristics of the entropy sequence of the file to be tested, the character proportion of each character in the file to be tested and the number of https fields in the file to be tested.

4. The machine learning-based malicious file detection method according to claim 3, wherein the extracting the feature of the file to be detected comprises:

converting the file to be tested into binary data;

dividing the binary data into data blocks with preset lengths;

calculating the statistical characteristics of the entropy sequence;

calculating the character ratio of each character in the file to be tested;

and calculating the number of the https fields in the file to be tested.

5. The machine-learning based malicious file detection method according to claim 4, wherein the statistical features comprise: mean, variance, maximum, and minimum.

6. The machine-learning-based malicious file detection method according to claim 1, wherein the training process of the trained classifier comprises:

collecting a certain number of training data files;

identifying a file type of a training data file;

classifying the training data files according to the file types;

for each class of training data file:

extracting the characteristics of the marked training data files of the type;

7. The machine-learning based malicious file detection method according to claim 1, wherein the classifier comprises: a GBDT classifier, a random forest classifier, and a SVM classifier.

8. A machine learning based malicious file detection system, the system comprising:

the characteristic extraction unit is used for extracting the characteristics of the file to be detected;

and the file classification unit is used for inputting the characteristics of the file to be detected into a trained classifier corresponding to the file type of the file to be detected for classification calculation to obtain a classification result of the file to be detected.

9. The machine-learning based malicious file detection system according to claim 8, wherein the feature extraction unit includes:

10. A machine-readable storage medium having stored thereon instructions for causing a machine to perform the method for machine learning based malicious file detection as claimed in any of claims 1-7 of the present application.