WO2019242441A1 - 一种基于动态特征的恶意软件识别方法、系统及相关装置 - Google Patents

一种基于动态特征的恶意软件识别方法、系统及相关装置 Download PDF

Info

Publication number
WO2019242441A1
WO2019242441A1 PCT/CN2019/087560 CN2019087560W WO2019242441A1 WO 2019242441 A1 WO2019242441 A1 WO 2019242441A1 CN 2019087560 W CN2019087560 W CN 2019087560W WO 2019242441 A1 WO2019242441 A1 WO 2019242441A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
malicious
malicious file
risk
file operation
Prior art date
Application number
PCT/CN2019/087560
Other languages
English (en)
French (fr)
Inventor
章明星
Original Assignee
深信服科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深信服科技股份有限公司 filed Critical 深信服科技股份有限公司
Publication of WO2019242441A1 publication Critical patent/WO2019242441A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/565Static detection by checking file integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Definitions

  • the present application relates to the field of malware identification, and in particular, to a method, system, device, and computer-readable storage medium for identifying malware based on dynamic characteristics.
  • the purpose of this application is to provide a method for identifying malware based on dynamic features.
  • the HOOK technology is also used to obtain a preliminary high level determined by the machine learning algorithm. Risk file operations and determine whether the file operations to be performed match the file operations normally performed by malicious files. This method not only retains the ability to identify fresh samples brought by the generalization ability, but also monitors this preliminary high.
  • the dynamic characteristics of the file operation to be performed by the risk file are used for the secondary determination of the malicious file, which significantly reduces the probability of misjudgment of fresh samples, and the malicious file is more accurately identified.
  • Another object of the present application is to provide a malware identification system, device, and computer-readable storage medium based on dynamic characteristics.
  • the present application provides a method for identifying malware based on dynamic characteristics, which method includes:
  • the preliminary high-risk file is a malicious file, and the malicious file is isolated and an alarm message is sent through a preset path.
  • a malicious file recognition model based on machine learning algorithms to identify the software under test to obtain preliminary high-risk files, including:
  • determining whether the file operation matches any malicious file operation included in a preset malicious file operation set includes:
  • determining whether the file operation matches any malicious file operation included in a preset malicious file operation set includes:
  • the method further includes:
  • the corresponding preliminary high-risk file is determined as the malicious file.
  • determining whether the file operation matches any malicious file operation included in a preset malicious file operation set includes:
  • the preset malicious IP address set contains the same IP address as the data communication IP; wherein the malicious IP address set is an item in the malicious file operation set.
  • constructing a malicious file classification model based on the machine learning algorithm includes:
  • the malicious file classification model is constructed based on a clustering algorithm.
  • the method further includes:
  • the present application also provides a malware identification system based on dynamic characteristics.
  • the system includes:
  • Machine learning recognition unit which is used to identify the software under test by using a malicious file recognition model constructed based on machine learning algorithms to obtain preliminary high-risk files;
  • a to-be-executed file operation obtaining unit configured to use HOOK technology to obtain a file operation to be performed on the preliminary high-risk file
  • An operation matching unit configured to determine whether the file operation matches any malicious file operation included in a preset malicious file operation set
  • a malicious file determination and processing unit is configured to determine that the preliminary high-risk file is a malicious file when the file operation matches the malicious file operation, isolate the malicious file, and send alarm information through a preset path.
  • the machine learning recognition unit includes:
  • a classification model construction subunit configured to construct a malicious file classification model based on the machine learning algorithm
  • a generalization threshold setting subunit configured to set a generalization threshold of a preset size for the malicious file classification model to obtain a generalization classification model
  • a malicious file classifier and determination unit is configured to use the generalized classification model to classify files included in the software under test for malicious file classification, and determine the obtained malicious file as the preliminary high-risk file.
  • the operation matching unit includes:
  • a time feature extraction subunit configured to obtain the order time and execution time of a corresponding preliminary high-risk file from the file operation; wherein the order time is located before the execution time on a time axis;
  • a difference calculation subunit configured to calculate a time difference between the execution time and the order time
  • the time characteristic judging subunit is configured to determine whether the time difference is within a preset time difference range of the malicious file; wherein the time difference range of the malicious file is an item in the malicious file operation set.
  • the operation matching unit includes:
  • a history file modification feature extraction subunit configured to obtain a corresponding preliminary high-risk file from the file operation to modify the history file of the history file
  • the historical file modification feature judging subunit is configured to determine whether the number of modification operations exceeds a preset number of malicious file modification operations; wherein the number of malicious file modification operations is one of the set of malicious file operations.
  • the system also includes:
  • a decoy file distribution and modification times obtaining unit is used to randomly distribute a preset number of bait files and obtain the number of modification operations to the bait files according to the file operations; wherein the bait files have a lower dictionary order and Access to normal software is low;
  • a malicious file determination unit based on a bait file is configured to determine a corresponding preliminary high-risk file as the malicious file when the number of modification operations to the bait file exceeds a malicious modification threshold.
  • the operation matching unit includes:
  • a data communication IP extraction subunit configured to extract a data communication IP of a corresponding preliminary high-risk file from the file operation
  • the malicious IP address judging subunit is configured to determine whether a preset malicious IP address set includes an IP address with the same data communication IP; wherein the malicious IP address set is one of the malicious file operation sets.
  • the machine learning recognition unit further includes:
  • the monitoring mark appending subunit is configured to add a monitoring mark to the preliminary high-risk file to determine a target monitoring file according to the monitoring mark.
  • classification model construction subunit includes:
  • a clustering algorithm model construction module is configured to obtain the malicious file classification model based on the clustering algorithm.
  • the system also includes:
  • a new malicious file operation collection unit configured to collect a new malicious file operation that the malicious file exhibits in an isolation environment after isolating the malicious file
  • the malicious file operation set update unit is configured to update the malicious file operation set by using the new malicious file operation.
  • the present application also provides a malware identification device based on dynamic characteristics, the device includes:
  • a processor configured to implement the steps of the malware identification method as described above when the computer program is executed.
  • the present application also provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the malware identification as described above is implemented. Method steps.
  • a method for identifying malware based on dynamic features is to use a malicious file recognition model based on a machine learning algorithm to identify the software under test to obtain a preliminary high-risk file; use HOOK technology to obtain the preliminary high-risk file
  • the file operation to be performed by the file; determining whether the file operation matches any of the malicious file operations included in the preset malicious file operation set; if they match, determining that the preliminary high-risk file is a malicious file, and isolating the file Malicious files and send alert information through preset paths.
  • the algorithm solution provided in the present application retains the generalization ability of the machine learning algorithm to recognize the fresh samples, and uses HOOK technology to obtain the file operation determined by the machine learning algorithm as a preliminary high-risk file. Determine whether the file operation to be performed matches the file operation normally performed by a malicious file.
  • This method not only retains the ability to identify fresh samples brought by the generalization ability, but also monitors the file operation to be performed by the preliminary high-risk file.
  • This dynamic feature makes the secondary determination of malicious files, which significantly reduces the chance of misjudgment of fresh samples and makes malicious file identification more accurate.
  • This application also provides a malware identification system, device, and computer-readable storage medium based on dynamic characteristics, which have the above-mentioned beneficial effects, and are not repeated here.
  • FIG. 1 is a flowchart of a method for identifying malware based on dynamic features according to an embodiment of the present application
  • FIG. 2 is a flowchart of determining a file operation in a method for identifying malware based on dynamic features according to an embodiment of the present application
  • FIG. 3 is a flowchart of another method for discriminating a file operation in a method for identifying malware based on dynamic features according to an embodiment of the present application
  • FIG. 4 is a flowchart of another method for discriminating a file operation in a method for identifying malware based on dynamic features according to an embodiment of the present application
  • FIG. 5 is a structural block diagram of a malware identification system based on dynamic features provided by an embodiment of the present application.
  • the core of the present application is to provide a method, system, device and computer-readable storage medium for identifying malware based on dynamic features, and on the basis of retaining the generalization ability of the machine learning algorithm for the recognition result of fresh samples, while using HOOK
  • the technology obtains the file operations determined by the machine learning algorithm as preliminary high-risk files, and determines whether the file operations to be performed match the file operations normally performed by malicious files. This method not only retains fresh samples brought by the generalization ability
  • the recognition ability of the malicious file is also determined by monitoring the dynamic characteristics of the file operation to be performed on the preliminary high-risk file, which significantly reduces the chance of misjudgment of fresh samples and makes the malicious file identification more accurate.
  • FIG. 1 is a flowchart of a method for identifying malware based on dynamic features provided by an embodiment of the present application.
  • S101 Use a malicious file recognition model based on a machine learning algorithm to identify the software under test to obtain preliminary high-risk files;
  • This step is to first build a malicious file recognition model based on the machine learning algorithm, and identify the malicious file identified by the malicious file recognition model as a preliminary high-risk file. Based on this, accurate identification is combined with subsequent discriminating steps to achieve reduction. Purpose of false positive rate.
  • a specific implementation step is as follows:
  • machine learning algorithms include regression algorithms (Regression Algorithms), instance-based algorithms (Instance-based Algorithms), decision tree algorithms (Decision Tree Algorithms), clustering algorithms (Clustering Algorithms) and other types of specific algorithms.
  • regression Algorithms regression Algorithms
  • instance-based algorithms instance-based algorithms
  • Decision Tree Algorithms decision tree algorithms
  • clustering Algorithms clustering Algorithms
  • Each class has its own characteristics, and each class also has different algorithms that are more finely divided.
  • a monitoring mark may be added to the preliminary high-risk file to determine a target monitoring file according to the monitoring mark, which is convenient for subsequent monitoring of the target monitoring file.
  • this step aims to use HOOK technology to obtain a file operation that is identified as a preliminary high-risk file by a malicious file recognition model based on a machine learning algorithm.
  • HOOK technology programming under windows system, message delivery is throughout. This message can be simply understood as an integer with a specific meaning, just like the code sign "Yangtze River, Yangtze River, I am the Yellow River". For beginners, the messages defined in windows seem to be “innumerable”. Some common messages are defined in the winuser.h header file. Hooks are very closely related to messages. Its Chinese meaning is “hook”. In this way, it is not difficult to understand that "hook is a link in message processing. It is used to monitor the transmission of messages in the system, and when these messages arrive Process some specific messages before the final message processing process. " This is also the reason why hooks are divided into different types, including API hooks, IAT hooks, Inline hooks, ssdt hooks, etc. The specific content of this technology is well known to those skilled in the art and will not be repeated here.
  • this step aims to match the obtained file operation of a software under test with any of the malicious file operations included in the preset malicious file operation set, that is, the malicious file operation set contains information that has been identified as malicious.
  • the malicious file operations extracted from the file include the time characteristics that can describe the time relationship of the file performing various operations, including the action characteristics of what operations the file can perform, and can also include information such as whether to communicate with a malicious IP, whether to perform Some special operations, whether other system call characteristics of some special functions are called, and so on.
  • the purpose of this step is to determine whether the preliminary high-risk file determined by the malicious file recognition model in S101 has been misjudged as a malicious file through analysis of file operations. , Make a second determination of the preliminary high-risk file, and only identify the files that are still determined as malicious files after the second malicious file determination, so it can greatly reduce the original judgment caused by machine learning algorithms. High false positive rate, with more accurate malicious file identification results.
  • S104 Determine that the preliminary high-risk file is a malicious file, isolate the malicious file, and send alarm information through a preset path.
  • This step is based on the judgment result of S103 that the file operation matches the malicious file operation contained in the malicious file operation set. Therefore, the preliminary high-risk file can be determined as a true malicious file, and the malicious file can be determined based on this. The file is subsequently processed to prevent the malicious file from harming the user.
  • the method of isolating malicious files can be adopted. Specifically, it can also be placed in a sandbox, so that it can be further verified according to the file operations it performs in the sandbox. At the same time, the file operation characteristics of malicious files can be obtained continuously. The newly discovered malicious file operation is added to the preset malicious file operation set.
  • you can also use other same or similar methods to isolate malicious files such as using specific virtual machines, specific virtualized containers, non-networked computers and computer hardware, etc., and choose the appropriate one according to the different use methods. To observe a series of subsequent operations performed by the malicious file in an isolated environment to obtain new malicious file operations and supplement the malicious file operation set.
  • the preset path for sending alarm information may include email, various instant messaging software and other channels, which are not specifically limited here.
  • a method for identifying malware based on dynamic features is based on retaining the generalization ability of the machine learning algorithm and the recognition result of fresh samples, and using HOOK technology to obtain machine learning
  • the algorithm determines the file operation of the preliminary high-risk file and determines whether the file operation to be performed matches the file operation normally performed by a malicious file.
  • This method not only retains the ability to identify fresh samples brought by the generalization ability, but also By monitoring the dynamic characteristics of the file operation to be performed by the preliminary high-risk file, the secondary determination of the malicious file significantly reduces the probability of misjudgment of fresh samples, and the malicious file identification is more accurate.
  • FIG. 2 is a flowchart of determining a file operation in a method for identifying malware based on dynamic features provided by an embodiment of the present application.
  • malware will have the following time characteristics: (1) it is executed shortly after the order is placed, and it tries to access another file that exists locally before it is placed; (2) ) Read and write document files and file traversal operations at a higher frequency after execution. Therefore, the embodiment of the present application aims to explain the specific steps of judging and executing by taking the order time and execution time extracted from the file operation as an example, that is, starting from time characteristics.
  • the ordering time refers to the time when the file reaches the machine through downloading or external copying
  • the execution time refers to the time when the file has been executed. Under normal circumstances, the execution time is in The axis is located after this order time.
  • S203 Calculate the time difference between the execution time and the order time
  • the preset malicious file time difference range is calculated based on the difference between the execution time and the ordering time of the file that has been identified as malicious, and is one of the preset malicious file operation sets.
  • S205 Determine the preliminary high-risk file as a malicious file, isolate the malicious file, and send the alarm information through a preset path.
  • This step is based on the judgment result of S204 that the time difference is within the preset time difference range of the malicious file, and the preliminary high-risk file can be determined to be a malicious file.
  • FIG. 3 is a flowchart of another method for determining a file operation in a method for identifying malware based on dynamic features provided by an embodiment of the present application.
  • ransomware In order to cause enough damage to users, ransomware will modify or delete a sufficient number of historical files, because ransomware usually uses a specific encryption algorithm to encrypt a large number of historical files, and the encrypted historical files cannot be processed by conventional means. Decryption, so there will be a large number of historical file modification operations in this process. Therefore, this embodiment uses the characteristics of the access file mode (the number of modification operations to the historical file) as an example to explain the specific steps of determining execution, that is, starting from the characteristics of the access file mode.
  • the preset number of malicious file modification operations is calculated according to the characteristics of the file access mode that has been identified as a malicious file, and is one of the preset malicious file operation sets.
  • S304 Determine the preliminary high-risk file as a malicious file, isolate the malicious file, and send the alarm information through a preset path.
  • This step is based on the determination result of S303 that the number of modification operations exceeds the preset number of malicious file modification operations, and the preliminary high-risk file can be determined to be a malicious file.
  • a bait file with a lower dictionary order and a lower probability of access by normal software can also be distributed locally to make the ransomware malicious
  • the software first performs various modification operations on these decoy files after the file traversal operation, and when the above situation is detected in a certain data decoy file, it can complete the determination of the malicious file, which can effectively protect other normal historical files.
  • FIG. 4 is a flowchart of another method for discriminating a file operation in a method for identifying malware based on dynamic features according to an embodiment of the present application.
  • S103 other system call characteristics, such as data communication IP, mailbox, special system port, special system functions, etc., which are different from the characteristics of time and access file mode, are described. Operation.
  • one of the data communication IPs is taken as an example to explain the specific steps of determination and execution, that is, starting from other system call features.
  • S403 Determine whether the preset malicious IP address set contains the same IP address as the data communication IP;
  • the preset malicious IP address set is obtained by synthesizing a malicious IP that has been identified as a malicious file for data communication, and is one of the preset malicious file operation sets.
  • S404 Determine the preliminary high-risk file as a malicious file, isolate the malicious file, and send the alarm information through a preset path.
  • This step is based on the determination result of S403 that the malicious IP address set contains the same IP address as the data communication IP, so that the preliminary high-risk file can be determined as a malicious file.
  • a method for identifying malware based on dynamic features is based on retaining the generalization ability of the machine learning algorithm and the recognition result of fresh samples, and using HOOK technology to obtain machine learning
  • the algorithm determines the file operation of the preliminary high-risk file and determines whether the file operation to be performed matches the file operation normally performed by a malicious file.
  • This method not only retains the ability to identify fresh samples brought by the generalization ability, but also By monitoring the dynamic characteristics of the file operation to be performed by the preliminary high-risk file, the secondary determination of the malicious file significantly reduces the probability of misjudgment of fresh samples, and the malicious file identification is more accurate.
  • Embodiments two, three, and four start with three different types of file operation characteristics.
  • Three different examples are used to illustrate the steps for determining a preliminary high-risk file.
  • New malicious file operations will gradually appear. In actual situations, only one of them can be used for matching. Of course, you can also use multiple types of matching to match the accuracy of the matching conclusion according to the actual situation.
  • the specific implementation method can be parallel. It can also be serial. The ultimate purpose is to make multiple judgments through multiple characteristics. As long as a preliminary high-risk file meets at least one of the above-mentioned file operation characteristics, it can be truly identified as a malicious file. All types of file operation characteristics are judged to have mismatched results, and the possibility of being a malicious file can be gradually ruled out after a long period of file operation monitoring.
  • FIG. 5 is a structural block diagram of a malware identification system based on dynamic features provided by an embodiment of the present application.
  • the malware identification system can include:
  • the machine learning recognition unit 100 is configured to use a malicious file recognition model constructed based on a machine learning algorithm to identify the software under test to obtain a preliminary high-risk file;
  • the to-be-executed file operation obtaining unit 200 is configured to obtain a file operation to be performed by a preliminary high-risk file using HOOK technology;
  • An operation matching unit 300 configured to determine whether a file operation matches any malicious file operation included in a preset malicious file operation set
  • the malicious file determination and processing unit 400 is configured to determine that the preliminary high-risk file is a malicious file when the file operation matches the malicious file operation, isolate the malicious file, and send alarm information through a preset path.
  • the machine learning recognition unit 100 includes:
  • Classification model construction subunit used to build a malicious file classification model based on machine learning algorithms
  • the generalization threshold setting subunit is used to set a preset generalization threshold for a malicious file classification model to obtain a generalization classification model;
  • the malicious file classifier and determination unit are used to classify the files contained in the software under test using a generalized classification model to classify the malicious files and identify the obtained malicious files as preliminary high-risk files.
  • One manifestation of the operation matching unit 300 includes:
  • the time feature extraction subunit is used to obtain the order time and execution time of the corresponding preliminary high-risk file from the file operation; where the order time is before the execution time on the time axis;
  • Difference calculation subunit for calculating the time difference between the execution time and the order time
  • the time feature judging subunit is configured to determine whether the time difference is within a preset time difference range of the malicious file; wherein the time difference range of the malicious file is an item in the malicious file operation set.
  • Another manifestation of the operation matching unit 300 includes:
  • History file modification feature extraction sub-unit used to obtain the number of modification operations on the historical file from the corresponding preliminary high-risk file from the file operation
  • the historical file modification feature judging subunit is used to determine whether the number of modification operations exceeds a preset number of malicious file modification operations; wherein the number of malicious file modification operations is one of the set of malicious file operations.
  • system may further include:
  • the decoy file distribution and modification times acquisition unit is used to randomly distribute a preset number of decoy files and obtain the number of modification operations to the decoy files according to the file operations; among them, the decoy files have a lower lexicographic order and the probability of access to normal software low;
  • a malicious file determination unit based on a bait file is used to determine a corresponding preliminary high-risk file as a malicious file when the number of modification operations to the bait file exceeds a malicious modification threshold.
  • Another expression of the operation matching unit 300 includes:
  • the data communication IP extraction subunit is used to extract the data communication IP of the corresponding preliminary high-risk file from the file operation
  • the malicious IP address judging subunit is configured to determine whether a preset set of malicious IP addresses includes the same IP address as the data communication IP. Among them, the malicious IP address set is an item in the malicious file operation set.
  • machine learning recognition unit 100 may further include:
  • the monitoring mark appending subunit is used to attach a monitoring mark to the preliminary high-risk file to determine a target monitoring file based on the monitoring mark.
  • the classification model construction subunit may include:
  • the clustering algorithm model building module is used to build a malicious file classification model based on the clustering algorithm.
  • system may further include:
  • a new malicious file operation collection unit configured to collect new malicious file operations that the malicious file exhibits in the quarantine environment after the malicious file is quarantined;
  • the malicious file operation set update unit is used to update the malicious file operation set with a new malicious file operation.
  • the present application also provides a malware identification device based on dynamic characteristics.
  • the malware identification device may include a memory and a processor, wherein a computer program is stored in the memory, and the processor calls the memory.
  • the steps provided in the foregoing embodiments can be implemented.
  • the malware identification device may also include various necessary network interfaces, power supplies, and other components.
  • the present application also provides a computer-readable storage medium on which a computer program is stored.
  • the storage medium may include: a U disk, a mobile hard disk, a read-only memory (Read-Only Memory (ROM)), a random access memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, which can store program codes.
  • ROM Read-Only Memory
  • RAM Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种基于动态特征的恶意软件识别方法、系统、设备及计算机可读存储介质,该方法在保留机器学习算法具有的泛化能力对新鲜样本的识别结果的基础上,同时利用HOOK技术获取由机器学习算法判定为初步高风险文件的文件操作,并判断其即将执行的文件操作是否与恶意文件通常执行的文件操作相匹配,本方法不仅保留了泛化能力带来的对新鲜样本的识别能力,还通过监控该初步高风险文件即将执行的文件操作这一动态特征进行恶意文件的二次确定,显著降低了对新鲜样本的误判几率,恶意文件识别更准确。

Description

一种基于动态特征的恶意软件识别方法、系统及相关装置
本申请要求于2018年06月20日提交至中国专利局、申请号为201810638966.6、发明名称为“一种基于动态特征的恶意软件识别方法、系统及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及恶意软件识别领域,特别涉及一种基于动态特征的恶意软件识别方法、系统、装置及计算机可读存储介质。
背景技术
随着计算机编程算法的不断发展,基于各式计算机语言编程得到的软件也使得人们能够更加方便的在计算机中完成各式任务和工作,但携带恶意内容的恶意软件也随之出现,恶意的攻击正常数据文件或盗取他人劳动成果。因此,对待测软件进行是否为恶意软件的识别是十分重要的。
现有一种识别恶意软件的方法:利用机器学习算法基于大量恶意文件构建得到恶意文件识别模型,其区别于传统的基于特征码的识别模型,优点在于由机器学习算法构建的恶意文件识别模型具有一定的泛化能力(指机器学习算法对新鲜样本的适应能力,即通过挖掘隐含在数据背后的规律实现对未经过训练的新鲜样本也能给出较为正确的识别),因此可以发现新型的恶意内容。但现阶段的泛化能力也存在缺点:由该泛化能力认定的恶意内容往往实际上并非真正的恶意内容,即出现较高概率的误判现象。而若是抑制泛化能力,机器学习算法也就与传统基于特征码的识别模型基本无差别。
所以,如何克服现阶段泛化能力存在的各项算法缺陷,提供一种既能够保留机器学习的泛化能力、又能够降低其误报率的恶意软件识别方法是本领域算法人员亟待解决的问题。
发明内容
本申请的目的是提供一种基于动态特征的恶意软件识别方法,在保留 机器学习算法具有的泛化能力对新鲜样本的识别结果的基础上,同时利用HOOK技术获取由机器学习算法判定为初步高风险文件的文件操作,并判断其即将执行的文件操作是否与恶意文件通常执行的文件操作相匹配,本方法不仅保留了泛化能力带来的对新鲜样本的识别能力,还通过监控该初步高风险文件即将执行的文件操作这一动态特征进行恶意文件的二次确定,显著降低了对新鲜样本的误判几率,恶意文件识别更准确。
本申请的另一目的在于提供了一种基于动态特征的恶意软件识别系统、装置及计算机可读存储介质。
为实现上述目的,本申请提供一种基于动态特征的恶意软件识别方法,该方法包括:
利用基于机器学习算法构建的恶意文件识别模型对待测软件进行恶意软件识别,得到初步高风险文件;
利用HOOK技术获取所述初步高风险文件即将执行的文件操作;
判断所述文件操作是否与预设的恶意文件操作集中包含的任一恶意文件操作相匹配;
若匹配,则判定所述初步高风险文件为恶意文件,并隔离所述恶意文件且通过预设路径发送告警信息。
可选的,利用基于机器学习算法构建的恶意文件识别模型对待测软件进行恶意软件识别,得到初步高风险文件,包括:
基于所述机器学习算法构建恶意文件分类模型;
为所述恶意文件分类模型设定预设大小的泛化阈值,得到泛化分类模型;
利用所述泛化分类模型对所述待测软件包含的文件进行恶意文件分类,并将得到的恶意文件认定为所述初步高风险文件。
可选的,判断所述文件操作是否与预设的恶意文件操作集中包含的任一恶意文件操作相匹配,包括:
从所述文件操作中获取对应的初步高风险文件的落盘时间和执行时间;其中,所述落盘时间在时间轴上位于所述执行时间前;
计算所述执行时间与所述落盘时间的时间差值;
判断所述时间差值是否处于预设的恶意文件时间差值范围内;其中, 所述恶意文件时间差值范围为所述恶意文件操作集中的一项。
可选的,判断所述文件操作是否与预设的恶意文件操作集中包含的任一恶意文件操作相匹配,包括:
从所述文件操作中获取对应的初步高风险文件对历史文件的修改操作次数;
判断所述修改操作次数是否超过预设的恶意文件修改操作次数;其中,所述恶意文件修改操作次数为所述恶意文件操作集中的一项。
可选的,该方法还包括:
随机散布预设数量的诱饵文件,并根据所述文件操作获取对所述诱饵文件的修改操作次数;其中,所述诱饵文件拥有较低的字典序且正常软件的访问几率较低;
当对所述诱饵文件的修改操作次数超过恶意修改阈值时,将对应的初步高风险文件判定为所述恶意文件。
可选的,判断所述文件操作是否与预设的恶意文件操作集中包含的任一恶意文件操作相匹配,包括:
从所述文件操作中获取对应的初步高风险文件的数据通讯IP;
判断预设的恶意IP地址集中是否包含与所述数据通讯IP相同的IP地址;其中,所述恶意IP地址集为所述恶意文件操作集中的一项。
可选的,在得到初步高风险文件之后,还包括:
为所述初步高风险文件附加监控标记,以根据所述监控标记确定目标监控文件。
可选的,基于所述机器学习算法构建恶意文件分类模型,包括:
基于聚类算法构建得到所述恶意文件分类模型。
可选的,在隔离所述恶意文件之后,还包括:
收集所述恶意文件在隔离环境中表现出的新恶意文件操作;
利用所述新恶意文件操作更新所述恶意文件操作集。
为实现上述目的,本申请还提供了一种基于动态特征的恶意软件识别系统,该系统包括:
机器学习识别单元,用于利用基于机器学习算法构建的恶意文件识别模型对待测软件进行恶意软件识别,得到初步高风险文件;
待执行文件操作获取单元,用于利用HOOK技术获取所述初步高风险文件即将执行的文件操作;
操作匹配单元,用于判断所述文件操作是否与预设的恶意文件操作集中包含的任一恶意文件操作相匹配;
恶意文件判定及处理单元,用于当所述文件操作与所述恶意文件操作相匹配时,判定所述初步高风险文件为恶意文件,并隔离所述恶意文件且通过预设路径发送告警信息。
可选的,所述机器学习识别单元包括:
分类模型构建子单元,用于基于所述机器学习算法构建恶意文件分类模型;
泛化阈值设定子单元,用于为所述恶意文件分类模型设定预设大小的泛化阈值,得到泛化分类模型;
恶意文件分类子及判定单元,用于利用所述泛化分类模型对所述待测软件包含的文件进行恶意文件分类,并将得到的恶意文件认定为所述初步高风险文件。
可选的,所述操作匹配单元包括:
时间特征提取子单元,用于从所述文件操作中获取对应的初步高风险文件的落盘时间和执行时间;其中,所述落盘时间在时间轴上位于所述执行时间前;
差值计算子单元,用于计算所述执行时间与所述落盘时间的时间差值;
时间特征判断子单元,用于判断所述时间差值是否处于预设的恶意文件时间差值范围内;其中,所述恶意文件时间差值范围为所述恶意文件操作集中的一项。
可选的,所述操作匹配单元包括:
历史文件修改特征提取子单元,用于从所述文件操作中获取对应的初步高风险文件对历史文件的修改操作次数;
历史文件修改特征判断子单元,用于判断所述修改操作次数是否超过预设的恶意文件修改操作次数;其中,所述恶意文件修改操作次数为所述恶意文件操作集中的一项。
可选的,该系统还包括:
诱饵文件散布及修改次数获取单元,用于随机散布预设数量的诱饵文件,并根据所述文件操作获取对所述诱饵文件的修改操作次数;其中,所述诱饵文件拥有较低的字典序且正常软件的访问几率较低;
基于诱饵文件的恶意文件判定单元,用于当对所述诱饵文件的修改操作次数超过恶意修改阈值时,将对应的初步高风险文件判定为所述恶意文件。
可选的,所述操作匹配单元包括:
数据通讯IP提取子单元,用于从所述文件操作中提取得到对应的初步高风险文件的数据通讯IP;
恶意IP地址判断子单元,用于判断预设的恶意IP地址集中是否包含于所述数据通讯IP相同的IP地址;其中,所述恶意IP地址集为所述恶意文件操作集中的一项。
可选的,所述机器学习识别单元还包括:
监控标记附加子单元,用于为所述初步高风险文件附加监控标记,以根据所述监控标记确定目标监控文件。
可选的,所述分类模型构建子单元包括:
聚类算法模型构建模块,用于基于聚类算法构建得到所述恶意文件分类模型。
可选的,该系统还包括:
新恶意文件操作收集单元,用于在隔离所述恶意文件之后,收集所述恶意文件在隔离环境中表现出的新恶意文件操作;
恶意文件操作集更新单元,用于利用所述新恶意文件操作更新所述恶意文件操作集。
为实现上述目的,本申请还提供了一种基于动态特征的恶意软件识别装置,该装置包括:
存储器,用于存储计算机程序;
处理器,用于执行所述计算机程序时实现如上述内容所描述的恶意软件识别方法的步骤。
为实现上述目的,本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时 实现如上述内容所描述的恶意软件识别方法的步骤。
本申请所提供的一种基于动态特征的恶意软件识别方法:利用基于机器学习算法构建的恶意文件识别模型对待测软件进行恶意软件识别,得到初步高风险文件;利用HOOK技术获取所述初步高风险文件即将执行的文件操作;判断所述文件操作是否与预设的恶意文件操作集中包含的任一恶意文件操作相匹配;若匹配,则判定所述初步高风险文件为恶意文件,并隔离所述恶意文件且通过预设路径发送告警信息。
显然,本申请所提供的算法方案,在保留机器学习算法具有的泛化能力对新鲜样本的识别结果的基础上,同时利用HOOK技术获取由机器学习算法判定为初步高风险文件的文件操作,并判断其即将执行的文件操作是否与恶意文件通常执行的文件操作相匹配,本方法不仅保留了泛化能力带来的对新鲜样本的识别能力,还通过监控该初步高风险文件即将执行的文件操作这一动态特征进行恶意文件的二次确定,显著降低了对新鲜样本的误判几率,恶意文件识别更准确。本申请同时还提供了一种基于动态特征的恶意软件识别系统、装置及计算机可读存储介质,具有上述有益效果,在此不再赘述。
附图说明
为了更清楚地说明本申请实施例或现有算法中的算法方案,下面将对实施例或现有算法描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通算法人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例所提供的一种基于动态特征的恶意软件识别方法的流程图;
图2为本申请实施例所提供的基于动态特征的恶意软件识别方法中一种对文件操作的判别的流程图;
图3为本申请实施例所提供的基于动态特征的恶意软件识别方法中另一种对文件操作的判别的流程图;
图4为本申请实施例所提供的基于动态特征的恶意软件识别方法中又一种对文件操作的判别的流程图;
图5为本申请实施例所提供的一种基于动态特征的恶意软件识别系统的结构框图。
具体实施方式
本申请的核心是提供一种基于动态特征的恶意软件识别方法、系统、装置及计算机可读存储介质,在保留机器学习算法具有的泛化能力对新鲜样本的识别结果的基础上,同时利用HOOK技术获取由机器学习算法判定为初步高风险文件的文件操作,并判断其即将执行的文件操作是否与恶意文件通常执行的文件操作相匹配,本方法不仅保留了泛化能力带来的对新鲜样本的识别能力,还通过监控该初步高风险文件即将执行的文件操作这一动态特征进行恶意文件的二次确定,显著降低了对新鲜样本的误判几率,恶意文件识别更准确。
为使本申请实施例的目的、算法方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的算法方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通算法人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
实施例一
以下结合图1,图1为本申请实施例所提供的一种基于动态特征的恶意软件识别方法的流程图。
其具体包括以下步骤:
S101:利用基于机器学习算法构建的恶意文件识别模型对待测软件进行恶意软件识别,得到初步高风险文件;
本步骤旨在首先基于机器学习算法构建恶意文件识别模型,并将经过该恶意文件识别模型识别出的恶意文件认定为初步高风险文件,以在此基础上结合后续判别步骤进行准确识别,实现降低误判率的目的。
一种具体的实现步骤如下:
基于机器学习算法构建恶意文件分类模型;为恶意文件分类模型设定预设大小的泛化阈值,得到泛化分类模型;利用泛化分类模型对待测软件 包含的文件进行恶意文件分类,并将得到的恶意文件认定为初步高风险文件。
其中,机器学习算法包括回归算法(Regression Algorithms)、基于实例的算法(Instance-based Algorithms)、决策树类算法(Decision Tree Algorithms)、聚类算法(Clustering Algorithms)等多类具体的算法,各大类分别拥有自己的特点,且各大类下还拥有更细划分的不同算法。
总体来说,无论基于上述机器学习算法中的哪种具体算法构建起来的恶意文件识别模型,都是旨在通过发现隐藏在数据背后的相同特征,并对其进行关联性分析,以最终从中挖掘出目标内容(恶意内容)的共同特征,并据此可能发现一些拥有同样特征但表现形式较新的新鲜样本,这也就是机器学习算法带来的泛化能力,通常会为该机器学习算法构建的识别模型设定一个较松的阈值,基于较松的阈值会存在较高的误判率,但简单的增大阈值的大小又会使得泛化能力基本无用处,因此为保留机器学习算法带来的泛化能力,已经不能单纯仅依靠调节阈值的方式,有必要在此基础上再增加一套恶意文件的识别机制,以在保留泛化能力的同时降低误判率。
进一步的,还可以为该初步高风险文件附加监控标记,以根据该监控标记确定目标监控文件,便于后续对目标监控文件的监控。
S102:利用HOOK技术获取初步高风险文件即将执行的文件操作;
在S101的基础上,本步骤旨在利用HOOK技术获取被基于机器学习算法搭建的恶意文件识别模型识别为初步高风险文件即将执行的文件操作。
HOOK技术:windows系统下的编程,消息message的传递是贯穿其始终的。这个消息我们可以简单理解为一个有特定意义的整数,正如暗号“长江长江,我是黄河”一个含义。windows中定义的消息给初学者的印象似乎是“不计其数”的,常见的一部分消息在winuser.h头文件中定义。hook与消息有着非常密切的联系,它的中文含义是“钩子”,这样理解起来我们不难得出“hook是消息处理中的一个环节,用于监控消息在系统中的传递,并在这些消息到达最终的消息处理过程前,处理某些特定的消息”。这也是hook分为不同种类的原因,具体的,包括API hook、IAT hook、Inline hook、ssdt hook等,该技术的具体内容已为本领域技术人员所熟知,在此不再赘述。
在当今常用的系统中,普通用户程序的进程空间都是独立的,程序的运行彼此间都不受干扰。而hook的这个本领,使它能够将自身的代码“融入”被hook住的程序的进程中,成为目标进程的一个部分。也就使得能够基于该技术获取目标程序的文件操作。
S103:判断文件操作是否与预设的恶意文件操作集中包含的任一恶意文件操作相匹配;
在S102的基础上,本步骤旨在根据获取的一个待测软件的文件操作与预设的恶意文件操作集中包含的任一恶意文件操作匹配,即该恶意文件操作集合中包含从已经认定为恶意文件中提取得到的恶意文件操作,其中包括可以描述该文件执行各操作的时间关系的时间特征,包括可以该文件执行了何种操作的动作特征,还可以包括诸如是否与恶意IP通讯、是否执行了某些特殊操作、是否调用了某些特殊函数的其它系统调用特征,等等。
本步骤的目的是通过对文件操作的分析判别S101中经过恶意文件识别模型判别得到的初步高风险文件是不是被误判为恶意文件,即经过包括上述多种动态特征的文件操作比对、判定,对初步高风险文件进行二次判定,并只将经第二次恶意文件判定后依然判定为恶意文件的文件认定为恶意文件,因此可以极大的降低原先仅由机器学习算法进行判定带来的高误判率,拥有更加准确的恶意文件识别结果。
S104:判定初步高风险文件为恶意文件,并隔离恶意文件且通过预设路径发送告警信息。
本步骤建立在S103的判断结果为该文件操作匹配于恶意文件操作集中包含的恶意文件操作的基础上,因此可以将该初步高风险文件判定为真正的恶意文件,并在此基础上对该恶意文件进行后续处理,以防止该恶意文件给用户带来损害。
其中,可以采用隔离恶意文件的做法,具体的,还可以将其置于沙箱中,以便能够根据其在沙箱中执行的文件操作进一步核实,同时还可以获取恶意文件的文件操作特征,不断将新发现的恶意文件操作补充进预设的恶意文件操作集。当然,也可以采用其它相同或类似的方式来隔离恶意文件,例如使用特定的虚拟机、特定的虚拟化容器、不联网的计算机和计算机硬件等等,并根据所选用方式的不同,选用相应的方式来观察该恶意文 件在隔离环境下进行的一系列后续操作,以得到新恶意文件操作,并对恶意文件操作集进行补充。
而发送告警信息的预设路径则可以包括电子邮件、各式即时通讯软件等渠道,此处并不做具体限定。
基于上述算法方案,本申请实施例提供的一种基于动态特征的恶意软件识别方法,在保留机器学习算法具有的泛化能力对新鲜样本的识别结果的基础上,同时利用HOOK技术获取由机器学习算法判定为初步高风险文件的文件操作,并判断其即将执行的文件操作是否与恶意文件通常执行的文件操作相匹配,本方法不仅保留了泛化能力带来的对新鲜样本的识别能力,还通过监控该初步高风险文件即将执行的文件操作这一动态特征进行恶意文件的二次确定,显著降低了对新鲜样本的误判几率,恶意文件识别更准确。
实施例二
以下结合图2,图2为本申请实施例所提供的基于动态特征的恶意软件识别方法中一种对文件操作的判别的流程图。
某些恶意软件(勒索类软件)会拥有如下时间特征:(1)在落盘后的短时间内就被执行,并且尝试访问一个在它被落盘之前就存在与本地的其它文件;(2)在执行后以较高的频率读写文档文件和进行文件的遍历操作等。因此,本申请实施例旨在以从文件操作中提取到的落盘时间和执行时间为例说明判别执行的具体步骤,即从时间特征入手。
S201:利用HOOK技术获取初步高风险文件即将执行的文件操作;
S202:从文件操作中获取对应的初步高风险文件的落盘时间和执行时间;
其中,该落盘时间指的该文件通过下载或者外设拷贝等手段到达该机器的时间,而该执行时间指的是已经落盘的文件被执行的时间,正常情况下,该执行时间在时间轴上位于该落盘时间后。
S203:计算执行时间与落盘时间的时间差值;
S204:判断时间差值是否处于预设的恶意文件时间差值范围内;
该预设的恶意文件时间差值范围为根据已经被认定为恶意文件的执行 时间与落盘时间的差值计算得到,为预设的恶意文件操作集中的一项。
S205:判定初步高风险文件为恶意文件,并隔离恶意文件且通过预设路径发送告警信息。
本步骤建立在S204的判断结果为该时间差值处于预设的恶意文件时间差值范围内的基础上,既可以判定该初步高风险文件为恶意文件。
实施例三
以下结合图3,图3为本申请实施例所提供的基于动态特征的恶意软件识别方法中另一种对文件操作的判别的流程图。
勒索类恶意软件为了对用户造成足够大的危害,会修改或删除足够大量的历史文件,因为勒索类恶意软件通常会采用特定的加密算法加密大量历史文件,而加密后的历史文件无法通过常规手段解密,因此在这一过程中会出现对大量历史文件的修改操作。因此,本实施例通过对访问文件模式的特征(对历史文件的修改操作次数)为例说明判别执行的具体步骤,即从访问文件模式的特征入手。
S301:利用HOOK技术获取初步高风险文件即将执行的文件操作;
S302:从文件操作中获取对应的初步高风险文件对历史文件的修改操作次数;
S303:判断修改操作次数是否超过预设的恶意文件修改操作次数;
该预设的恶意文件修改操作次数为根据已经被认定为恶意文件的访问文件模式的特征计算得到,为预设的恶意文件操作集中的一项。
S304:判定初步高风险文件为恶意文件,并隔离恶意文件且通过预设路径发送告警信息。
本步骤建立在S303的判断结果为该修改操作次数超过预设的恶意文件修改操作次数的基础上,既可以判定该初步高风险文件为恶意文件。
进一步的,为降低勒索类恶意软件对本地中正常的历史文件造成的危害,还可以在本地中散布拥有较低的字典序且正常软件的访问几率较低的诱饵文件,以使该勒索类恶意软件在经过文件遍历操作后首先对这些诱饵文件执行各式修改操作,而当检测到一定数据的诱饵文件出现上述情况时就可以完成恶意文件的判定,能够有效的保护其它正常的历史文件。
其中一种具体的操作步骤如下:
随机散布预设数量的诱饵文件,并根据该文件操作获取对诱饵文件的修改操作次数;当对诱饵文件的修改操作次数超过恶意修改阈值时,将对应的初步高风险文件判定为恶意文件。
实施例四
以下结合图4,图4为本申请实施例所提供的基于动态特征的恶意软件识别方法中又一种对文件操作的判别的流程图。
在S103中描述了区别于时间特征和访问文件模式特征的其它系统调用特征,例如数据通讯IP、邮箱、特殊系统端口、特殊系统函数等等,由于有些恶意软件为掩饰自身执行操作会黑屏等特殊操作,本实施例以其中一种数据通讯IP为例说明判别执行的具体步骤,即从其它系统调用特征入手。
S401:利用HOOK技术获取初步高风险文件即将执行的文件操作;
S402:从文件操作中获取对应的初步高风险文件的数据通讯IP;
S403:判断预设的恶意IP地址集中是否包含与数据通讯IP相同的IP地址;
该预设的恶意IP地址集为综合已经被认定为恶意文件进行数据通讯的恶意IP得到,为预设的恶意文件操作集中的一项。
S404:判定初步高风险文件为恶意文件,并隔离恶意文件且通过预设路径发送告警信息。
本步骤建立在S403的判断结果为恶意IP地址集中包含与该数据通讯IP相同的IP地址的基础上,既可以判定该初步高风险文件为恶意文件。
基于上述算法方案,本申请实施例提供的一种基于动态特征的恶意软件识别方法,在保留机器学习算法具有的泛化能力对新鲜样本的识别结果的基础上,同时利用HOOK技术获取由机器学习算法判定为初步高风险文件的文件操作,并判断其即将执行的文件操作是否与恶意文件通常执行的文件操作相匹配,本方法不仅保留了泛化能力带来的对新鲜样本的识别能力,还通过监控该初步高风险文件即将执行的文件操作这一动态特征进行恶意文件的二次确定,显著降低了对新鲜样本的误判几率,恶意文件识别更准确。
实施例二、三、四分别从三中不同种类的文件操作特征入手,通过三个不同的例子说明了对初步高风险文件的判别步骤,当然不限于上述三种,随着计算机技术的发展,会逐渐出现新的恶意文件操作,在实际情况中可以仅使用其中一种来做匹配,当然也可以根据实际情景对匹配结论准确度的要求,同时使用多种进行匹配,具体实现方式可为并行也可以为串行,最终目的为经过多种特征进行多重判定,只要当一个初步高风险文件满足上述各类文件操作特征中的至少一种,就可以将其真正认定为恶意文件,若经过上述各类文件操作特征的判断均得到不匹配的结果,则可以在经过长时间的文件操作监控中逐步排除其为恶意文件的可能性。
因为情况复杂,无法一一列举进行阐述,本领域算法人员应能意识到根据本申请提供的基本方法原理结合实际情况可以存在很多的例子,在不付出足够的创造性劳动下,应均在本申请的保护范围内。
下面请参见图5,图5为本申请实施例所提供的一种基于动态特征的恶意软件识别系统的结构框图。
该恶意软件识别系统可以包括:
机器学习识别单元100,用于利用基于机器学习算法构建的恶意文件识别模型对待测软件进行恶意软件识别,得到初步高风险文件;
待执行文件操作获取单元200,用于利用HOOK技术获取初步高风险文件即将执行的文件操作;
操作匹配单元300,用于判断文件操作是否与预设的恶意文件操作集中包含的任一恶意文件操作相匹配;
恶意文件判定及处理单元400,用于当文件操作与恶意文件操作相匹配时,判定初步高风险文件为恶意文件,并隔离恶意文件且通过预设路径发送告警信息。
其中,机器学习识别单元100包括:
分类模型构建子单元,用于基于机器学习算法构建恶意文件分类模型;
泛化阈值设定子单元,用于为恶意文件分类模型设定预设大小的泛化阈值,得到泛化分类模型;
恶意文件分类子及判定单元,用于利用泛化分类模型对待测软件包含 的文件进行恶意文件分类,并将得到的恶意文件认定为初步高风险文件。
操作匹配单元300的其中一种表现形式包括:
时间特征提取子单元,用于从文件操作中获取对应的初步高风险文件的落盘时间和执行时间;其中,落盘时间在时间轴上位于执行时间前;
差值计算子单元,用于计算执行时间与落盘时间的时间差值;
时间特征判断子单元,用于判断时间差值是否处于预设的恶意文件时间差值范围内;其中,恶意文件时间差值范围为恶意文件操作集中的一项。
操作匹配单元300的另一种表现形式包括:
历史文件修改特征提取子单元,用于从文件操作中获取对应的初步高风险文件对历史文件的修改操作次数;
历史文件修改特征判断子单元,用于判断修改操作次数是否超过预设的恶意文件修改操作次数;其中,恶意文件修改操作次数为恶意文件操作集中的一项。
进一步的,该系统还可以包括:
诱饵文件散布及修改次数获取单元,用于随机散布预设数量的诱饵文件,并根据文件操作获取对诱饵文件的修改操作次数;其中,诱饵文件拥有较低的字典序且正常软件的访问几率较低;
基于诱饵文件的恶意文件判定单元,用于当对诱饵文件的修改操作次数超过恶意修改阈值时,将对应的初步高风险文件判定为恶意文件。
操作匹配单元300的又一种表现形式包括:
数据通讯IP提取子单元,用于从文件操作中提取得到对应的初步高风险文件的数据通讯IP;
恶意IP地址判断子单元,用于判断预设的恶意IP地址集中是否包含于数据通讯IP相同的IP地址;其中,恶意IP地址集为恶意文件操作集中的一项。
进一步的,机器学习识别单元100还可以包括:
监控标记附加子单元,用于为初步高风险文件附加监控标记,以根据监控标记确定目标监控文件。
其中,该分类模型构建子单元可以包括:
聚类算法模型构建模块,用于基于聚类算法构建得到恶意文件分类模 型。
进一步的,该系统还可以包括:
新恶意文件操作收集单元,用于在隔离恶意文件之后,收集恶意文件在隔离环境中表现出的新恶意文件操作;
恶意文件操作集更新单元,用于利用新恶意文件操作更新恶意文件操作集。
基于上述实施例,本申请还提供了一种基于动态特征的恶意软件识别装置,该恶意软件识别装置可以包括存储器和处理器,其中,该存储器中存有计算机程序,该处理器调用该存储器中的计算机程序时,可以实现上述实施例所提供的步骤。当然,该恶意软件识别装置还可以包括各种必要的网络接口、电源以及其它零部件等。
本申请还提供了一种计算机可读存储介质,其上存有计算机程序,该计算机程序被执行终端或处理器执行时可以实现上述实施例所提供的步骤。该存储介质可以包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于算法方案的特定应用和设计约束条件。专业算法人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想。应当指出,对于本算法领域的普通算法人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。
还需要说明的是,在本说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其它变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其它要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、物品或者设备中还存在另外的相同要素。

Claims (20)

  1. 一种基于动态特征的恶意软件识别方法,其特征在于,包括:
    利用基于机器学习算法构建的恶意文件识别模型对待测软件进行恶意软件识别,得到初步高风险文件;
    利用HOOK技术获取所述初步高风险文件即将执行的文件操作;
    判断所述文件操作是否与预设的恶意文件操作集中包含的任一恶意文件操作相匹配;
    若匹配,则判定所述初步高风险文件为恶意文件,并隔离所述恶意文件且通过预设路径发送告警信息。
  2. 根据权利要求1所述方法,其特征在于,利用基于机器学习算法构建的恶意文件识别模型对待测软件进行恶意软件识别,得到初步高风险文件,包括:
    基于所述机器学习算法构建恶意文件分类模型;
    为所述恶意文件分类模型设定预设大小的泛化阈值,得到泛化分类模型;
    利用所述泛化分类模型对所述待测软件包含的文件进行恶意文件分类,并将得到的恶意文件认定为所述初步高风险文件。
  3. 根据权利要求1所述方法,其特征在于,判断所述文件操作是否与预设的恶意文件操作集中包含的任一恶意文件操作相匹配,包括:
    从所述文件操作中获取对应的初步高风险文件的落盘时间和执行时间;其中,所述落盘时间在时间轴上位于所述执行时间前;
    计算所述执行时间与所述落盘时间的时间差值;
    判断所述时间差值是否处于预设的恶意文件时间差值范围内;其中,所述恶意文件时间差值范围为所述恶意文件操作集中的一项。
  4. 根据权利要求1所述方法,其特征在于,判断所述文件操作是否与预设的恶意文件操作集中包含的任一恶意文件操作相匹配,包括:
    从所述文件操作中获取对应的初步高风险文件对历史文件的修改操作次数;
    判断所述修改操作次数是否超过预设的恶意文件修改操作次数;其中,所述恶意文件修改操作次数为所述恶意文件操作集中的一项。
  5. 根据权利要求4所述方法,其特征在于,还包括:
    随机散布预设数量的诱饵文件,并根据所述文件操作获取对所述诱饵文件的修改操作次数;其中,所述诱饵文件拥有较低的字典序且正常软件的访问几率较低;
    当对所述诱饵文件的修改操作次数超过恶意修改阈值时,将对应的初步高风险文件判定为所述恶意文件。
  6. 根据所述权利要求1所述方法,其特征在于,判断所述文件操作是否与预设的恶意文件操作集中包含的任一恶意文件操作相匹配,包括:
    从所述文件操作中获取对应的初步高风险文件的数据通讯IP;
    判断预设的恶意IP地址集中是否包含与所述数据通讯IP相同的IP地址;其中,所述恶意IP地址集为所述恶意文件操作集中的一项。
  7. 根据权利要求1至6任一项所述方法,其特征在于,在得到初步高风险文件之后,还包括:
    为所述初步高风险文件附加监控标记,以根据所述监控标记确定目标监控文件。
  8. 根据权利要求1所述方法,其特征在于,基于所述机器学习算法构建恶意文件分类模型,包括:
    基于聚类算法构建得到所述恶意文件分类模型。
  9. 根据权利要求1所述方法,其特征在于,在隔离所述恶意文件之后,还包括:
    收集所述恶意文件在隔离环境中表现出的新恶意文件操作;
    利用所述新恶意文件操作更新所述恶意文件操作集。
  10. 一种基于动态特征的恶意软件识别系统,其特征在于,包括:
    机器学习识别单元,用于利用基于机器学习算法构建的恶意文件识别模型对待测软件进行恶意软件识别,得到初步高风险文件;
    待执行文件操作获取单元,用于利用HOOK技术获取所述初步高风险文件即将执行的文件操作;
    操作匹配单元,用于判断所述文件操作是否与预设的恶意文件操作集中包含的任一恶意文件操作相匹配;
    恶意文件判定及处理单元,用于当所述文件操作与所述恶意文件操作 相匹配时,判定所述初步高风险文件为恶意文件,并隔离所述恶意文件且通过预设路径发送告警信息。
  11. 根据权利要求10所述系统,其特征在于,所述机器学习识别单元包括:
    分类模型构建子单元,用于基于所述机器学习算法构建恶意文件分类模型;
    泛化阈值设定子单元,用于为所述恶意文件分类模型设定预设大小的泛化阈值,得到泛化分类模型;
    恶意文件分类子及判定单元,用于利用所述泛化分类模型对所述待测软件包含的文件进行恶意文件分类,并将得到的恶意文件认定为所述初步高风险文件。
  12. 根据权利要求10所述系统,其特征在于,所述操作匹配单元包括:
    时间特征提取子单元,用于从所述文件操作中获取对应的初步高风险文件的落盘时间和执行时间;其中,所述落盘时间在时间轴上位于所述执行时间前;
    差值计算子单元,用于计算所述执行时间与所述落盘时间的时间差值;
    时间特征判断子单元,用于判断所述时间差值是否处于预设的恶意文件时间差值范围内;其中,所述恶意文件时间差值范围为所述恶意文件操作集中的一项。
  13. 根据权利要求10所述系统,其特征在于,所述操作匹配单元包括:
    历史文件修改特征提取子单元,用于从所述文件操作中获取对应的初步高风险文件对历史文件的修改操作次数;
    历史文件修改特征判断子单元,用于判断所述修改操作次数是否超过预设的恶意文件修改操作次数;其中,所述恶意文件修改操作次数为所述恶意文件操作集中的一项。
  14. 根据权利要求13所述系统,其特征在于,还包括:
    诱饵文件散布及修改次数获取单元,用于随机散布预设数量的诱饵文件,并根据所述文件操作获取对所述诱饵文件的修改操作次数;其中,所述诱饵文件拥有较低的字典序且正常软件的访问几率较低;
    基于诱饵文件的恶意文件判定单元,用于当对所述诱饵文件的修改操 作次数超过恶意修改阈值时,将对应的初步高风险文件判定为所述恶意文件。
  15. 根据权利要求10所述系统,其特征在于,所述操作匹配单元包括:
    数据通讯IP提取子单元,用于从所述文件操作中提取得到对应的初步高风险文件的数据通讯IP;
    恶意IP地址判断子单元,用于判断预设的恶意IP地址集中是否包含于所述数据通讯IP相同的IP地址;其中,所述恶意IP地址集为所述恶意文件操作集中的一项。
  16. 根据权利要求10至15任一项所述系统,其特征在于,所述机器学习识别单元还包括:
    监控标记附加子单元,用于为所述初步高风险文件附加监控标记,以根据所述监控标记确定目标监控文件。
  17. 根据权利要求10所述系统,其特征在于,所述分类模型构建子单元包括:
    聚类算法模型构建模块,用于基于聚类算法构建得到所述恶意文件分类模型。
  18. 根据权利要求10所述系统,其特征在于,还包括:
    新恶意文件操作收集单元,用于在隔离所述恶意文件之后,收集所述恶意文件在隔离环境中表现出的新恶意文件操作;
    恶意文件操作集更新单元,用于利用所述新恶意文件操作更新所述恶意文件操作集。
  19. 一种基于动态特征的恶意软件识别装置,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序时实现如权利要求1至9任一项所述的恶意软件识别方法的步骤。
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至9任一项所述的恶意软件识别方法的步骤。
PCT/CN2019/087560 2018-06-20 2019-05-20 一种基于动态特征的恶意软件识别方法、系统及相关装置 WO2019242441A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810638966.6A CN110619211A (zh) 2018-06-20 2018-06-20 一种基于动态特征的恶意软件识别方法、系统及相关装置
CN201810638966.6 2018-06-20

Publications (1)

Publication Number Publication Date
WO2019242441A1 true WO2019242441A1 (zh) 2019-12-26

Family

ID=68920802

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/087560 WO2019242441A1 (zh) 2018-06-20 2019-05-20 一种基于动态特征的恶意软件识别方法、系统及相关装置

Country Status (2)

Country Link
CN (1) CN110619211A (zh)
WO (1) WO2019242441A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523588A (zh) * 2020-04-20 2020-08-11 电子科技大学 基于改进的lstm对apt攻击恶意软件流量进行分类的方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926054B (zh) * 2021-02-22 2023-10-03 亚信科技(成都)有限公司 一种恶意文件的检测方法、装置、设备及存储介质
CN113282928B (zh) * 2021-06-11 2022-12-20 杭州安恒信息技术股份有限公司 恶意文件的处理方法、装置、系统、电子装置和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101984450A (zh) * 2010-12-15 2011-03-09 北京安天电子设备有限公司 恶意代码检测方法和系统
CN103761481A (zh) * 2014-01-23 2014-04-30 北京奇虎科技有限公司 一种恶意代码样本自动处理的方法及装置
CN104598824A (zh) * 2015-01-28 2015-05-06 国家计算机网络与信息安全管理中心 一种恶意程序检测方法及其装置
CN107659570A (zh) * 2017-09-29 2018-02-02 杭州安恒信息技术有限公司 基于机器学习与动静态分析的Webshell检测方法及系统
CN108009425A (zh) * 2017-11-29 2018-05-08 四川无声信息技术有限公司 文件检测及威胁等级判定方法、装置及系统

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8756693B2 (en) * 2011-04-05 2014-06-17 The United States Of America As Represented By The Secretary Of The Air Force Malware target recognition
US20170068816A1 (en) * 2015-09-04 2017-03-09 University Of Delaware Malware analysis and detection using graph-based characterization and machine learning
CN106778241B (zh) * 2016-11-28 2020-12-25 东软集团股份有限公司 恶意文件的识别方法及装置
CN107742079B (zh) * 2017-10-18 2020-02-21 杭州安恒信息技术股份有限公司 恶意软件识别方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101984450A (zh) * 2010-12-15 2011-03-09 北京安天电子设备有限公司 恶意代码检测方法和系统
CN103761481A (zh) * 2014-01-23 2014-04-30 北京奇虎科技有限公司 一种恶意代码样本自动处理的方法及装置
CN104598824A (zh) * 2015-01-28 2015-05-06 国家计算机网络与信息安全管理中心 一种恶意程序检测方法及其装置
CN107659570A (zh) * 2017-09-29 2018-02-02 杭州安恒信息技术有限公司 基于机器学习与动静态分析的Webshell检测方法及系统
CN108009425A (zh) * 2017-11-29 2018-05-08 四川无声信息技术有限公司 文件检测及威胁等级判定方法、装置及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523588A (zh) * 2020-04-20 2020-08-11 电子科技大学 基于改进的lstm对apt攻击恶意软件流量进行分类的方法
CN111523588B (zh) * 2020-04-20 2022-04-29 电子科技大学 基于改进的lstm对apt攻击恶意软件流量进行分类的方法

Also Published As

Publication number Publication date
CN110619211A (zh) 2019-12-27

Similar Documents

Publication Publication Date Title
US10430586B1 (en) Methods of identifying heap spray attacks using memory anomaly detection
JP2020505707A (ja) 侵入検出のための継続的な学習
US10819720B2 (en) Information processing device, information processing system, information processing method, and storage medium
WO2019242441A1 (zh) 一种基于动态特征的恶意软件识别方法、系统及相关装置
CN108337153B (zh) 一种邮件的监控方法、系统与装置
CN111460445B (zh) 样本程序恶意程度自动识别方法及装置
US10783239B2 (en) System, method, and apparatus for computer security
US20170185785A1 (en) System, method and apparatus for detecting vulnerabilities in electronic devices
US20140195793A1 (en) Remotely Establishing Device Platform Integrity
JP2015513133A (ja) キャラクター・ヒストグラムを用いるスパム検出のシステムおよび方法
EP3455773A1 (en) Inferential exploit attempt detection
CN110149319B (zh) Apt组织的追踪方法及装置、存储介质、电子装置
WO2020134311A1 (zh) 一种恶意软件检测方法和装置
US11487868B2 (en) System, method, and apparatus for computer security
CN112511517A (zh) 一种邮件检测方法、装置、设备及介质
US11297083B1 (en) Identifying and protecting against an attack against an anomaly detector machine learning classifier
CN113378161A (zh) 一种安全检测方法、装置、设备及存储介质
JP2022089132A (ja) 情報セキュリティ装置及びその方法
WO2022267084A1 (zh) 基于大数据的网络安全检测方法及系统
WO2021130897A1 (ja) 分析装置、分析方法及び分析プログラムが格納された非一時的なコンピュータ可読媒体
CN103001848B (zh) 垃圾邮件过滤方法及装置
CN109067764A (zh) 一种建立设备表项的方法及装置
US20220182260A1 (en) Detecting anomalies on a controller area network bus
US10826923B2 (en) Network security tool
CN113849813A (zh) 数据检测方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19822187

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19822187

Country of ref document: EP

Kind code of ref document: A1