CN112966267A - Malicious file detection method and system based on machine learning - Google Patents

Malicious file detection method and system based on machine learning Download PDF

Info

Publication number
CN112966267A
CN112966267A CN202110231625.9A CN202110231625A CN112966267A CN 112966267 A CN112966267 A CN 112966267A CN 202110231625 A CN202110231625 A CN 202110231625A CN 112966267 A CN112966267 A CN 112966267A
Authority
CN
China
Prior art keywords
file
detected
tested
data
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110231625.9A
Other languages
Chinese (zh)
Inventor
王卓超
于金龙
王智民
王高杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing 6Cloud Technology Co Ltd
Beijing 6Cloud Information Technology Co Ltd
Original Assignee
Beijing 6Cloud Technology Co Ltd
Beijing 6Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing 6Cloud Technology Co Ltd, Beijing 6Cloud Information Technology Co Ltd filed Critical Beijing 6Cloud Technology Co Ltd
Priority to CN202110231625.9A priority Critical patent/CN112966267A/en
Publication of CN112966267A publication Critical patent/CN112966267A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Virology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Storage Device Security (AREA)

Abstract

The invention provides a malicious file detection method and system based on machine learning, and belongs to the technical field of information security. The method comprises the following steps: identifying the file type of a file to be detected; extracting the characteristics of the file to be detected; and inputting the characteristics of the file to be detected into a trained classifier corresponding to the file type of the file to be detected for classification calculation to obtain a classification result of the file to be detected. By using the method, the malicious file can be still identified under the condition that the file is variant and confused, the occupied resource is less, and the detection result can be quickly obtained.

Description

Malicious file detection method and system based on machine learning
Technical Field
The invention relates to the technical field of information security, in particular to a malicious file detection method based on machine learning and a malicious file detection system based on machine learning.
Background
Malicious file detection is an important subject in the field of network security, in recent years, the number of malicious files is exponentially increased, and the traditional processing mode cannot timely and effectively process and identify massive data.
The traditional malicious file detection technologies are not limited to two, one is detection analysis based on static characteristics, for example, a detection method based on an application programming interface sequence extracts the programming interface characteristics of a file, and establishes a characteristic library for detecting the malicious file. Although this type of detection technique is relatively fast in analyzing software code, malicious files can counter these detections by code obfuscation, mutation, etc. Therefore, static detection has the problems of high false alarm rate, incapability of identifying files of confusion and variety and easiness in bypassing.
The other is a detection mode based on dynamic characteristics, for example, malicious file detection based on sandbox, and whether a file is malicious or not is determined by observing behavior characteristics of the malicious file in the sandbox environment. Although the method can solve the problem of insufficient accuracy in static detection, the algorithm consumes a large amount of resources and consumes more time. Therefore, the dynamic detection has the problems of low detection efficiency and serious resource consumption.
Disclosure of Invention
The embodiment of the invention aims to provide a malicious file detection method and system based on machine learning, the malicious file can be still identified under the condition that the file is changed and confused, the occupied resources are less, and the detection result can be quickly obtained.
In order to achieve the above object, a first aspect of the present invention provides a malicious file detection method based on machine learning, including:
identifying the file type of a file to be detected;
extracting the characteristics of the file to be detected;
and inputting the characteristics of the file to be detected into a trained classifier corresponding to the file type of the file to be detected for classification calculation to obtain a classification result of the file to be detected.
Further, the identifying the file type of the file to be tested includes:
acquiring file header data of a file to be detected;
and identifying the file type of the file to be detected according to the file header data. The header data is used for displaying the actual usage of the file, and is not easy to be modified manually, and the file type determined according to the header data is more accurate.
Optionally, the features include: the statistical characteristics of the entropy sequence of the file to be tested, the character proportion of each character in the file to be tested and the number of https fields in the file to be tested.
Further, the extracting the features of the file to be tested includes:
converting the file to be tested into binary data;
dividing the binary data into data blocks with preset lengths;
calculating the information entropy of each data block to obtain an entropy sequence of the file to be detected;
calculating the statistical characteristics of the entropy sequence;
calculating the character ratio of each character in the file to be tested;
and calculating the number of the https fields in the file to be tested. By dividing the file to be detected into a plurality of blocks, calculating the information entropy of each data block and then counting the information entropy of all the data blocks of the whole file, shorter malicious segments in the file to be detected can be effectively detected, and the accuracy is higher.
Optionally, the statistical characteristics include: mean, variance, maximum, and minimum.
Optionally, the training process of the trained classifier includes:
collecting a certain number of training data files;
identifying a file type of a training data file;
classifying the training data files according to the file types;
for each class of training data file:
analyzing and determining malicious files and non-malicious files in the training data files, and marking;
extracting the characteristics of the marked training data files of the type;
inputting the characteristics of the training data file into a classifier for training to obtain a trained classifier corresponding to the training data file.
Optionally, the classifier includes: a GBDT classifier, a random forest classifier, and a SVM classifier.
A second aspect of the present invention provides a malicious file detection system based on machine learning, the system including:
the file type identification unit is used for identifying the file type of the file to be detected;
the characteristic extraction unit is used for extracting the characteristics of the file to be detected; and
and the file classification unit is used for inputting the characteristics of the file to be detected into a trained classifier corresponding to the file type of the file to be detected for classification calculation to obtain a classification result of the file to be detected. The system has a simple structure, can still identify the malicious files under the condition that the files are variant and confused, occupies less resources and can quickly obtain a detection result.
Further, the feature extraction unit includes:
the file conversion module is used for converting the file to be tested into binary data;
the data dividing module is used for dividing the binary data into data blocks with preset lengths;
the statistical characteristic calculation module is used for calculating the information entropy of each data block to obtain the entropy sequence of the file to be detected and calculating the statistical characteristics of the entropy sequence;
the character ratio calculation module is used for calculating the character ratio of each character in the file to be tested;
and the https field number calculating module is used for calculating the number of the https fields in the file to be tested.
In another aspect, the present invention provides a machine-readable storage medium having stored thereon instructions for causing a machine to perform the method for machine learning-based malicious file detection described herein.
By the technical scheme, the malicious file can be still identified under the condition that the file is variant and confused, the occupied resource is less, and the detection result can be quickly obtained.
Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:
FIG. 1 is a flowchart of a malicious file detection method based on machine learning according to an embodiment of the present invention;
fig. 2 is a block diagram of a malicious file detection system based on machine learning according to an embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
Fig. 1 is a flowchart of a malicious file detection method based on machine learning according to an embodiment of the present invention. As shown in fig. 1, the method includes:
identifying the file type of a file to be detected;
extracting the characteristics of the file to be detected;
and inputting the characteristics of the file to be detected into a trained classifier corresponding to the file type of the file to be detected for classification calculation to obtain a classification result of the file to be detected.
File extensions are typically used to identify file types, but file extensions can be easily modified by human beings. All files are stored in binary form, and at the beginning of each file there is actually a field to show the actual usage of the file, which is the header flag, which is usually expressed in hexadecimal. The correct type of file can be judged by reading the header data. For example: the file beginning with "4D 5A 90" is a dll file.
In some embodiments of the present invention, the identifying the file type of the file to be tested includes:
acquiring file header data of a file to be detected;
and identifying the file type of the file to be detected according to the file header data. The header data is used for displaying the actual usage of the file, and is not easy to be modified manually, and the file type determined according to the header data is more accurate.
The types identified by header data are very detailed, e.g. pictures alone include png, JPEG, JPG, gif, bmp, etc. in order to reduce the number of classifiers appropriately, in some embodiments specific file types are classified, similar types are merged, e.g. png, JPEG, JPG, gif, bmp are all classified as picture classes. In these embodiments, the same file type may include at least one header data, and after detecting the header data, the header data may be compared with header data included in different types to determine a file type corresponding to the file.
Optionally, the features include: the statistical characteristics of the entropy sequence of the file to be tested, the character proportion of each character in the file to be tested and the number of https fields in the file to be tested.
Entropy is a concept used to describe the source uncertainty, i.e., the degree of disorder of elements in a data set. If the appearance frequency of a certain element in the data set is very high, the entropy tends to 0; the entropy value is maximum if the occurrence of each element in the set is the same or similar. The reason why the inverse engineer focuses on entropy is that: some malicious codes have information compression or encryption conditions, and the entropy value is usually high. Thus, to identify potential encryption constants or keys, and even the encrypted content itself, a local entropy calculation is performed on the file under test.
In some embodiments of the present invention, the extracting the feature of the file to be tested includes:
converting the file to be tested into binary data;
dividing the binary data into data blocks with preset length, wherein in one embodiment of the invention, the preset length is 256 bytes, and the data blocks with less than 256 bytes are complemented with 0 at the end;
calculating the information entropy of each data block to obtain the entropy sequence of the file to be detected, wherein the information entropy is calculated by the following formula:
Figure BDA0002958590880000061
wherein X represents a data sequence consisting of binary data in each data block, X represents each binary data in the data sequence, and p (X) represents the probability of X in X;
calculating the statistical characteristics of the entropy sequence;
calculating the character proportion of each character in the file to be tested, wherein the character proportion is the number of each character in the file to be tested divided by the total number of the characters in the file to be tested;
and calculating the number of the https fields in the file to be tested. By dividing the file to be detected into a plurality of blocks, calculating the information entropy of each data block and then counting the information entropy of all the data blocks of the whole file, shorter malicious segments in the file to be detected can be effectively detected, and the accuracy is higher.
Optionally, the statistical characteristics include: mean, variance, maximum, and minimum.
Optionally, the training process of the trained classifier includes:
collecting a certain number of training data files;
identifying a file type of a training data file;
classifying the training data files according to the file types;
for each class of training data file:
analyzing and determining malicious files and non-malicious files in the training data files, and marking;
extracting the characteristics of the marked training data files of the type;
inputting the characteristics of the training data file into a classifier for training to obtain a trained classifier corresponding to the training data file.
Optionally, the classifier includes: a GBDT classifier, a random forest classifier, and a SVM classifier.
In a specific embodiment of the present invention, the classifier adopts a GBDT classifier, and after analyzing and determining the malicious files and the non-malicious files in the training data file, marks the non-malicious files as 0 and marks the malicious files as 1. The GBDT classifier can have a variety of distinguishing features and combinations of features.
Fig. 2 is a block diagram of a malicious file detection system based on machine learning according to an embodiment of the present invention. As shown in fig. 2, the system includes:
the file type identification unit is used for identifying the file type of the file to be detected;
the characteristic extraction unit is used for extracting the characteristics of the file to be detected; and
and the file classification unit is used for inputting the characteristics of the file to be detected into a trained classifier corresponding to the file type of the file to be detected for classification calculation to obtain a classification result of the file to be detected. The system has a simple structure, can still identify the malicious files under the condition that the files are variant and confused, occupies less resources and can quickly obtain a detection result. The detection system can be arranged in various environments such as a personal host computer, a detection platform and the like.
Further, the feature extraction unit includes:
the file conversion module is used for converting the file to be tested into binary data;
the data dividing module is used for dividing the binary data into data blocks with preset lengths;
the statistical characteristic calculation module is used for calculating the information entropy of each data block to obtain the entropy sequence of the file to be detected and calculating the statistical characteristics of the entropy sequence;
the character ratio calculation module is used for calculating the character ratio of each character in the file to be tested;
and the https field number calculating module is used for calculating the number of the https fields in the file to be tested.
The embodiment of the invention also provides a machine-readable storage medium, which stores instructions for enabling a machine to execute the malicious file detection method based on machine learning.
Those skilled in the art will appreciate that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, which is stored in a storage medium and includes several instructions to enable a single chip, a chip, or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the embodiments of the present invention have been described in detail with reference to the accompanying drawings, the embodiments of the present invention are not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the embodiments of the present invention within the technical idea of the embodiments of the present invention, and the simple modifications are within the scope of the embodiments of the present invention. It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, the embodiments of the present invention will not be described separately for the various possible combinations.
In addition, any combination of the various embodiments of the present invention is also possible, and the same should be considered as disclosed in the embodiments of the present invention as long as it does not depart from the spirit of the embodiments of the present invention.

Claims (10)

1. A malicious file detection method based on machine learning, which is characterized by comprising the following steps:
identifying the file type of a file to be detected;
extracting the characteristics of the file to be detected;
and inputting the characteristics of the file to be detected into a trained classifier corresponding to the file type of the file to be detected for classification calculation to obtain a classification result of the file to be detected.
2. The machine learning-based malicious file detection method according to claim 1, wherein the identifying the file type of the file to be detected comprises:
acquiring file header data of a file to be detected;
and identifying the file type of the file to be detected according to the file header data.
3. The machine-learning based malicious file detection method according to claim 1, wherein the features comprise: the statistical characteristics of the entropy sequence of the file to be tested, the character proportion of each character in the file to be tested and the number of https fields in the file to be tested.
4. The machine learning-based malicious file detection method according to claim 3, wherein the extracting the feature of the file to be detected comprises:
converting the file to be tested into binary data;
dividing the binary data into data blocks with preset lengths;
calculating the information entropy of each data block to obtain an entropy sequence of the file to be detected;
calculating the statistical characteristics of the entropy sequence;
calculating the character ratio of each character in the file to be tested;
and calculating the number of the https fields in the file to be tested.
5. The machine-learning based malicious file detection method according to claim 4, wherein the statistical features comprise: mean, variance, maximum, and minimum.
6. The machine-learning-based malicious file detection method according to claim 1, wherein the training process of the trained classifier comprises:
collecting a certain number of training data files;
identifying a file type of a training data file;
classifying the training data files according to the file types;
for each class of training data file:
analyzing and determining malicious files and non-malicious files in the training data files, and marking;
extracting the characteristics of the marked training data files of the type;
inputting the characteristics of the training data file into a classifier for training to obtain a trained classifier corresponding to the training data file.
7. The machine-learning based malicious file detection method according to claim 1, wherein the classifier comprises: a GBDT classifier, a random forest classifier, and a SVM classifier.
8. A machine learning based malicious file detection system, the system comprising:
the file type identification unit is used for identifying the file type of the file to be detected;
the characteristic extraction unit is used for extracting the characteristics of the file to be detected;
and the file classification unit is used for inputting the characteristics of the file to be detected into a trained classifier corresponding to the file type of the file to be detected for classification calculation to obtain a classification result of the file to be detected.
9. The machine-learning based malicious file detection system according to claim 8, wherein the feature extraction unit includes:
the file conversion module is used for converting the file to be tested into binary data;
the data dividing module is used for dividing the binary data into data blocks with preset lengths;
the statistical characteristic calculation module is used for calculating the information entropy of each data block to obtain the entropy sequence of the file to be detected and calculating the statistical characteristics of the entropy sequence;
the character ratio calculation module is used for calculating the character ratio of each character in the file to be tested;
and the https field number calculating module is used for calculating the number of the https fields in the file to be tested.
10. A machine-readable storage medium having stored thereon instructions for causing a machine to perform the method for machine learning based malicious file detection as claimed in any of claims 1-7 of the present application.
CN202110231625.9A 2021-03-02 2021-03-02 Malicious file detection method and system based on machine learning Pending CN112966267A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110231625.9A CN112966267A (en) 2021-03-02 2021-03-02 Malicious file detection method and system based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110231625.9A CN112966267A (en) 2021-03-02 2021-03-02 Malicious file detection method and system based on machine learning

Publications (1)

Publication Number Publication Date
CN112966267A true CN112966267A (en) 2021-06-15

Family

ID=76277500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110231625.9A Pending CN112966267A (en) 2021-03-02 2021-03-02 Malicious file detection method and system based on machine learning

Country Status (1)

Country Link
CN (1) CN112966267A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106295337A (en) * 2015-06-30 2017-01-04 安恒通(北京)科技有限公司 For detecting the malice method of leak file, device and terminal
CN108710797A (en) * 2018-06-15 2018-10-26 四川大学 A kind of malice document detection method based on entropy information distribution
CN109299609A (en) * 2018-08-08 2019-02-01 北京奇虎科技有限公司 A kind of ELF file test method and device
CN110826062A (en) * 2019-10-18 2020-02-21 北京天融信网络安全技术有限公司 Malicious software detection method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106295337A (en) * 2015-06-30 2017-01-04 安恒通(北京)科技有限公司 For detecting the malice method of leak file, device and terminal
US20170004306A1 (en) * 2015-06-30 2017-01-05 Iyuntian Co., Ltd. Method, apparatus and terminal for detecting a malware file
CN108710797A (en) * 2018-06-15 2018-10-26 四川大学 A kind of malice document detection method based on entropy information distribution
CN109299609A (en) * 2018-08-08 2019-02-01 北京奇虎科技有限公司 A kind of ELF file test method and device
CN110826062A (en) * 2019-10-18 2020-02-21 北京天融信网络安全技术有限公司 Malicious software detection method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
罗森林 等: ""3.4.6 拒绝服务攻击技术"", 《网络信息安全与对抗 第2版》 *
齐晓霞: ""基于特征集合的XSS漏洞安全研究"", 《西华大学学报(自然科学版)》 *

Similar Documents

Publication Publication Date Title
CN111832019B (en) Malicious code detection method based on generation countermeasure network
Makandar et al. Malware class recognition using image processing techniques
CN109936582B (en) Method and device for constructing malicious traffic detection model based on PU learning
CN108549814A (en) A kind of SQL injection detection method based on machine learning, database security system
JP6188976B2 (en) Method, apparatus and computer-readable recording medium for detecting text contained in an image
CN111368289B (en) Malicious software detection method and device
KR101054107B1 (en) A system for exposure retrieval of personal information using image features
KR102302484B1 (en) Method for mobile malware classification based feature selection, recording medium and device for performing the method
CN112435137A (en) Cheating information detection method and system based on community mining
CN112329012A (en) Detection method for malicious PDF document containing JavaScript and electronic equipment
O’Shaughnessy Image-based malware classification: A space filling curve approach
CN111488574B (en) Malicious software classification method, system, computer equipment and storage medium
CN107688744B (en) Malicious file classification method and device based on image feature matching
CN114510716A (en) Document detection method, model training method, device, terminal and storage medium
Nandanwar et al. Forged text detection in video, scene, and document images
CN114372267A (en) Malicious webpage identification and detection method based on static domain, computer and storage medium
KR102246405B1 (en) TF-IDF-based Vector Conversion and Data Analysis Apparatus and Method
CN112966267A (en) Malicious file detection method and system based on machine learning
CN112016088A (en) Method and device for generating file detection model and method and device for detecting file
CN116994167A (en) Website security monitoring method based on machine learning algorithm
CN116541841A (en) Classification method, classification device and storage medium for malicious software
Sakkatos et al. Analysis of text-based CAPTCHA images using Template Matching Correlation technique
CN112163217B (en) Malware variant identification method, device, equipment and computer storage medium
CN116192462A (en) Malicious software analysis method and device based on PE file format
CN115567224A (en) Method for detecting abnormal transaction of block chain and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210615

RJ01 Rejection of invention patent application after publication