CN113761531A - Malicious software detection system and method based on distributed API (application program interface) feature analysis - Google Patents
Malicious software detection system and method based on distributed API (application program interface) feature analysis Download PDFInfo
- Publication number
- CN113761531A CN113761531A CN202110951731.4A CN202110951731A CN113761531A CN 113761531 A CN113761531 A CN 113761531A CN 202110951731 A CN202110951731 A CN 202110951731A CN 113761531 A CN113761531 A CN 113761531A
- Authority
- CN
- China
- Prior art keywords
- software
- sandbox
- malicious
- detection
- api
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 52
- 238000000034 method Methods 0.000 title claims abstract description 40
- 244000035744 Hura crepitans Species 0.000 claims abstract description 96
- 230000003068 static effect Effects 0.000 claims abstract description 58
- 230000006399 behavior Effects 0.000 claims abstract description 46
- 230000006870 function Effects 0.000 claims abstract description 31
- 230000036541 health Effects 0.000 claims abstract description 16
- 230000008901 benefit Effects 0.000 claims abstract description 11
- 238000013528 artificial neural network Methods 0.000 claims abstract description 10
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 10
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 10
- 238000001514 detection method Methods 0.000 claims description 88
- 238000012544 monitoring process Methods 0.000 claims description 26
- 230000008569 process Effects 0.000 claims description 25
- 238000012549 training Methods 0.000 claims description 18
- 238000000605 extraction Methods 0.000 claims description 13
- 230000007246 mechanism Effects 0.000 claims description 12
- 230000009193 crawling Effects 0.000 claims description 11
- 239000013598 vector Substances 0.000 claims description 11
- 238000005516 engineering process Methods 0.000 claims description 9
- 230000004927 fusion Effects 0.000 claims description 6
- 238000007405 data analysis Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 4
- 238000013468 resource allocation Methods 0.000 claims description 4
- 230000003862 health status Effects 0.000 claims 1
- 230000007547 defect Effects 0.000 abstract description 7
- 230000009471 action Effects 0.000 description 7
- 230000003111 delayed effect Effects 0.000 description 3
- 238000009434 installation Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000013145 classification model Methods 0.000 description 2
- 230000006378 damage Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000008676 import Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000011002 quantification Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
- G06F21/563—Static detection by source code analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/52—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow
- G06F21/53—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow by executing in a restricted environment, e.g. sandbox or secure virtual machine
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/566—Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention provides a malicious software detection system and a malicious software detection method based on distributed API (application program interface) feature analysis, which overcome the defects of the traditional malicious software detection system and method based on a static state and a single machine type and solve the defects that the traditional malicious software detection system cannot detect the shelled malicious software and the single machine sandbox has low operation efficiency, and the basic idea is as follows: combining the advantages of the distributed sandbox and the dynamic API characteristic analysis, and building a distributed sandbox system with load balancing and health state management to efficiently obtain an API execution sequence of a plurality of software samples; then extracting dynamic and static characteristics from the extracted API execution sequence; and finally, inputting the extracted dynamic and static characteristics into the sequence characteristics executed by the convolutional neural network learning malicious function of different receptive fields, and learning the time sequence behavior pattern executed by the malicious function by using the cyclic neural network. The invention can dynamically detect the malicious function execution mode and behavior mode, and can effectively detect the malicious program after the shell adding and the confusion.
Description
Technical Field
The invention relates to the field of network security, in particular to a malicious software detection system and a malicious software detection method based on distributed Application Programming Interface (API) feature analysis.
Background
Malware is software that is intended to harm a computer, server, or computer network. Malware causes varying degrees of damage to a target computer after it is implanted or somehow invaded. The malicious software is installed and operated on a computer under the condition that a user is not explicitly prompted or the permission of the user is not given, and is expressed by malicious behaviors such as forced installation, browser hijacking, data stealing, malicious collection of user sensitive information, malicious binding software and the like. The malicious software is a tool for hackers to implement network crimes, and the attacker induces the user to download and operate the malicious software through a deception means, so that the control right of the user host is obtained or privacy information is stolen. In recent years, the attack threshold is gradually reduced due to the open source of the hacking tool, and people can easily acquire the source code of the hacking tool from the network. Therefore, a malware manufacturer can generate new malware at low time cost, technical cost and economic cost, and the new malware causes great economic and security loss to individuals, society and countries, so that the efficient detection of the malware is of great significance to the protection of network security, people property and national stability.
In order to reduce the influence of malware on the network environment and users, a number of malware detection methods and patents have been proposed.
The invention patent with application number CN201610996935.9 discloses a sample type determination method for malware detection, which discloses a sample type determination method, comprising the following steps: 1) collecting a sample program set, and respectively forming a sample library; submitting the program set in the sample library to a virtual sandbox environment for operation, and then generating a corresponding sample analysis report; 2) analyzing the sample analysis report, extracting special feature combination information, and generating a feature vector set; inputting the feature vector set into a classifier for training to obtain an optimal model; 3) and inputting the program to be tested into the optimal model to obtain a judgment result that the program to be tested is a malicious program or a normal program. The invention improves the efficiency and the accuracy of malicious software detection, avoids complex operation and larger energy consumption in the dynamic detection technology, and greatly improves the detection speed on the basis of ensuring the accuracy. The invention can only detect conventional malware samples and cannot effectively detect elaborate disguised malware.
The invention patent with the application number of CN201810299726.8 discloses a method and a system for detecting malicious software, wherein the method comprises the following steps: 1) determining the authority corresponding function applied by the software to be tested based on the installation package of the software to be tested; 2) installing and running the software to be tested in a test environment based on the installation package of the software to be tested, and monitoring the action and the characteristics of the software to be tested in the running process in real time; 3) and if the detected software acquires preset privacy information when realizing the corresponding function and has non-functional characteristics which can achieve the purpose of running all the time and/or automatically recover to run after being forcibly terminated, preliminarily determining that the detected software is malicious software. Therefore, whether the software to be detected obtains the preset privacy information or not is judged, and meanwhile whether the software to be detected has non-functional characteristics which can achieve the purpose of running all the time and/or automatically recovering to run after being stopped is monitored, and therefore accuracy of malicious software detection is greatly improved through the two judgments. The invention can not dynamically detect the internal function call relation of the malicious software, so that the malicious software using the dynamic attack strategy is difficult to detect only by relying on the static characteristics.
At present, malware detection methods are roughly classified into two types, a static analysis method and a dynamic analysis method, depending on whether malware is executed. The static analysis method does not need to actually run a software sample to be tested, but extracts information from the software sample through an analysis tool, such as data of function call names, file structure information, import tables, character strings, control flows and the like, and judges whether the software sample is malicious or not according to the extracted features. The static analysis method is convenient and quick, but is difficult to detect the deformed, polymorphic, shelled and confused converted malicious software. The dynamic analysis method is characterized in that the actual operation flow of a software sample is recorded in a sandbox or virtual machine environment, the operation characteristics of an application program are monitored in the process, and a log is analyzed and recorded to find malicious behaviors in the application program. In summary, the currently existing malware detection methods have the following major drawbacks:
(1) the traditional malicious software detection method highly depends on expert knowledge and cannot detect the malicious software such as the increasingly changed shell, confusion, variety and the like. Most of the traditional detection methods are based on a signature mechanism to realize the marking of malicious samples or features, for example, hash values of software samples are used for matching whether the malicious signature library contains the software, or information such as bytes and character strings of the software is used for rule matching. Signature rules are manually set by security experts according to the salient features of known malicious software families, and the method cannot update and expand the feature signature library in real time, has obvious hysteresis, and can only detect certain software which is discovered to be malicious by security workers and is added into the signature feature library. In addition, the malicious feature signature library is continuously increased along with the appearance of new samples, so that the query and matching costs are gradually increased.
(2) High quality dynamic behavior training samples are lacking. The dynamic analysis method needs a high-quality software running sample, the software sample needs to be submitted into a sandbox to run for minutes or even tens of minutes in the acquisition process, and the time and resource cost for acquiring the dynamic behavior data is high. The training process of the current popular machine learning and deep learning models has higher requirements on the quality and quantity of data, but a large-scale and complete dynamic training data set is lacked at present. In addition, most of data in the existing public data sets are results of data preprocessing performed by a publisher, analysis can be performed only based on existing information, more relevant information of the data corresponding to an original file cannot be acquired, and only two types, namely malicious types and benign types, are labeled in part of the data sets, so that specific types of malicious software cannot be divided more finely.
(3) The single dynamic behavior detection method has limitations. After some software is put into a sandbox to operate, the software can conceal the software by a delayed operation means, sufficient behavior data cannot be generated, some software can execute subsequent malicious behaviors after being triggered by a certain condition, and the operation condition of each software cannot be specifically inquired in the process of large-scale sample operation, so that detection by only depending on dynamic behaviors can be omitted, and particularly, malicious software which is shelled, varied and confused can easily escape from a single malicious software detection system.
The invention comprehensively considers the advantages and the defects of a plurality of malicious software detection algorithms, and provides a malicious software detection system and a malicious software detection method based on distributed API characteristic analysis.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a malicious software detection system and a malicious software detection method based on distributed API characteristic analysis, overcomes the defects of the traditional malicious software detection system and method based on static and single machine, and solves the defects that the traditional malicious software detection system cannot detect the shelled malicious software and the single machine sandbox has low operation efficiency, and the basic idea is as follows: combining the advantages of the distributed sandbox and the dynamic API characteristic analysis, and building a distributed sandbox system with load balancing and health state management to efficiently obtain an API execution sequence of a plurality of software samples; then extracting dynamic and static characteristics from the extracted API execution sequence; and finally, inputting the extracted dynamic and static characteristics into the sequence characteristics executed by the convolutional neural network learning malicious function of different receptive fields, and learning the time sequence behavior pattern executed by the malicious function by using the cyclic neural network. The invention can dynamically detect the malicious function execution mode and behavior mode, and can effectively detect the malicious program after the shell adding and the confusion.
In order to achieve the purpose of the invention, the technical scheme provided by the invention is as follows:
a malware detection system based on distributed API profiling, comprising: the device comprises a sample downloading module, a task issuing module, a sandbox system state monitoring module, a sandbox system task scheduling module, a feature extraction module and a detection report generation module. The sample downloading module is used for crawling a large number of malicious software samples with specific type labels from a publicly known software analysis service website by utilizing a currently published and common data crawling technology and acquiring benign software samples from a published software downloading warehouse; the task issuing module is responsible for issuing the collected software sample to a proper sandbox node according to a self-defined load balancing strategy to obtain the execution process of the software sample; the sandbox system state monitoring module monitors the running state of each sandbox node in real time, and sends alarm information to the central end when the state of the sandbox node is poor; the sandbox system task scheduling module is responsible for issuing a software sample execution task to a sandbox node with optimal performance and highest efficiency according to a load balancing and resource utilization optimization strategy; the characteristic extraction module is responsible for extracting the behavior characteristics of each software sample, such as function call, file operation, process execution, network request and the like from the sandbox operation report; and the detection report generation module is responsible for training an automatic and integrated malicious software detection model to realize malicious software detection and generate a corresponding detection report.
Further according to the malicious software detection system based on the distributed API characteristic analysis, on one hand, the sample downloading module utilizes the current public and common data crawling technology to crawl a large number of malicious software samples with specific type labels from a publicly known software analysis service website; and on the other hand, writing an automatic crawler to acquire benign software from an open software download warehouse. And extracting equal proportion of malicious software and benign software from the test data as training samples, inputting the training samples into a distributed sandbox, and respectively obtaining API execution sequences of different samples.
Further, according to the malicious software detection system based on the distributed API characteristic analysis, the task issuing module issues the sample to be detected according to the load balancing state of the distributed sandbox. The task issuing module is used for issuing collected software samples in batches, is responsible for receiving the software samples uploaded by users, adds a record in the task database, selects an optimal sandbox node for the samples through the scheduling module, and updates the task database by tracking the state of the monitoring task at regular time after API (application program interface) behavior execution records until behavior data of the software samples are finally input into the monitoring model, so that a final detection result and a final detection report are obtained.
Further, according to the malware detection system based on distributed API feature analysis, the sandbox system state monitoring module monitors the running health state of each sandbox node in real time, collects data from a database, integrates information, transmits the data to a front-end interface in an HTTP (Hyper Text Transfer Protocol) interface mode for displaying, and meanwhile is used as a reference for decision making of the scheduling module. The sandbox system state monitoring module is responsible for task state statistics, node historical load statistics, node current task state statistics, node hardware state statistics (such as a magnetic disk, a memory and the like), sample detection results and the like, and ensures that the current latest state of the system can be obtained when an interface is called every time. The sandbox system state monitoring module monitors from two dimensions of tasks and nodes respectively, the monitoring state can be used as the input of a follow-up task scheduling algorithm, the quantification of performance difference of different nodes at the current moment is facilitated, and node loads are adjusted in real time. Meanwhile, the monitoring module is configured with an automatic alarm function, and when the utilization rate index of a certain item of node hardware resources is too high, alarm information can be automatically sent to an administrator mailbox.
Further, according to the malicious software detection system based on the distributed API characteristic analysis, the sandbox system task scheduling module schedules the task arrangement of different sandbox nodes by utilizing an individualized load balancing strategy. The task scheduling module of the sandbox system monitors the working health state of the sandbox cluster in real time by utilizing a classic client/server architecture. Specifically, when the server is in the administrator task mode, the server can issue software samples to corresponding sandbox nodes in batches according to the load-bearing capacity and the resource utilization condition of each sandbox node and an optimization strategy, so that the utilization rate of each sandbox node is improved, and the stability and the efficiency of the whole sandbox cluster are ensured; when the server is in a common user task working mode, a task submitted by a user through the client is added into a waiting queue, the server acquires the task submitted first from the head of the queue in a polling mode, and arranges a virtual machine in an idle state to execute the task.
Further, according to the malicious software detection system based on the distributed API feature analysis, the feature extraction module is used for extracting the sample execution sequence features in the running process of the sandbox. Specifically, for each software sample issued to the sandbox cluster, the sandbox is executed for three minutes, the actions of function calling, file operation, process execution, network requests and the like related to each software sample in the execution process are recorded, the actions are stored into a corresponding database for backup, data analysis and dynamic and static feature extraction are carried out at the same time, and the actions are mapped into a calculable numerical value vector. The dynamic and static numerical vectors are input into the malicious software detection model provided by the invention.
Further, according to the malware detection system based on distributed API feature analysis, the detection report generation module combines multiple efficient classification models with a stacking model to achieve efficient detection of malware. Specifically, a fusion detection model based on dynamic and static characteristics is designed, static information such as character strings, quoted dynamic link libraries and assembly sequences is extracted from a sample by using an analysis tool, a dynamic API function calling sequence is extracted by using sandbox running software, static malicious software characteristics are learned by using a convolutional neural network, dynamic malicious API time sequence behavior patterns are learned by using a cyclic neural network, and finally an attention mechanism and a stacking algorithm are used for fusing a plurality of basic models, so that effective detection of malicious software such as shelling, variety and confusion can be realized. Meanwhile, a detailed detection report is automatically generated, and a friendly visual detection result is provided for a user.
A malicious software detection method based on distributed API (application program interface) feature analysis is characterized by comprising the following steps:
step (1), collecting software samples, namely crawling a large number of malicious software samples with specific type labels from a publicly known software analysis service website by using a currently open and common data crawling technology, then compiling an automatic crawler to acquire benign software from an open software downloading warehouse, mixing the collected malicious and benign software samples in equal proportion, and constructing a training sample data set;
step (2), submitting software samples, namely submitting collected malicious and benign software samples to a distributed sandbox cluster, issuing the software samples to appropriate sandbox nodes according to a load balancing and optimized resource allocation strategy, storing an operation report of each software sample to a corresponding database for backup, monitoring the operation state and health condition of each sandbox in real time, and returning the operation condition of the software samples and the operation health condition of the sandbox nodes to a central server in real time;
step (3), constructing dynamic and static characteristics of malicious software, recording behaviors such as function call, file operation, process execution, network request, network flow and the like related to each software sample in the execution process based on the software operation report in the step (2), storing the behaviors into a corresponding database for backup, simultaneously carrying out data analysis and dynamic and static characteristic extraction, and representing and mapping the behaviors into a calculable numerical vector;
step (4), training a basic detection model of the malicious software, namely designing a fusion detection model based on dynamic and static characteristics, extracting static information such as character strings, cited dynamic link libraries and assembly sequences from a sample by using an analysis tool, extracting a dynamic API function calling sequence by using sandbox running software, learning the static malicious software characteristics by using a convolutional neural network, learning the dynamic malicious API time sequence behavior pattern by using a cyclic neural network, and respectively obtaining the basic detection model based on the dynamic and static characteristics;
and (5) detecting the malicious software, wherein based on the basic detection model of the malicious software in the step (4), the weight of each basic detection model is learned by using a stacking algorithm and an attention mechanism, and a malicious software detection model integrating the advantages of a plurality of detection models is trained to realize the detection of unknown software, particularly the detection of the shelled and confused malicious software is solved.
The invention has the beneficial effects that:
1) the invention breaks through the traditional malicious software detection method based on a signature mechanism and static characteristics, and designs and realizes the malicious software detection method based on distributed API characteristic analysis. By deploying a distributed, self-scheduling and self-monitoring sandbox cluster, a large number of API execution sequences of malicious software samples and benign software samples are efficiently obtained, and data support is provided for learning the execution mode of malicious software.
2) The method provided by the invention can effectively solve the problem that the signature mechanism and the static characteristic detection method can not detect the malicious software in the forms of shell adding, variation, confusion and the like.
3) According to the invention, the characteristics of the malicious software are described by synchronously utilizing the dynamic and static characteristics, the static statistical characteristics and the time sequence behavior characteristics of the malicious software can be effectively captured, the behavior characteristics of the malicious software which is executed in a delayed manner can be effectively captured by the real distributed sandbox cluster, and the detection model designed by the invention can learn the real execution behavior characteristics of the malicious software, so that the accuracy and the high efficiency of the malicious software detection are improved.
4) The prototype system application practice proves that the invention can effectively detect meticulously disguised malicious software, particularly the malicious software subjected to shell adding, variety adding and confusion, and the scheme of the invention is easy to arrange in the existing network, simple to operate, safe and reliable, and has remarkable economic and social benefits and wide market popularization and application prospects.
Drawings
FIG. 1 is a block diagram of the general architecture of a distributed API profiling-based malware detection system and method of the present invention;
Detailed Description
The following detailed description of the embodiments of the present invention is provided in conjunction with the accompanying drawings to enable those skilled in the art to more clearly understand the embodiments of the present invention, but not to limit the scope of the present invention.
At present, malware detection methods are roughly classified into two types, a static analysis method and a dynamic analysis method, depending on whether malware is executed. The static analysis method does not need to actually run a software sample to be tested, but extracts information from the software sample through an analysis tool, such as data of function call names, file structure information, import tables, character strings, control flows and the like, and judges whether the software sample is malicious or not according to the extracted features. The static analysis method is convenient and quick, but is difficult to detect the deformed, polymorphic, shelled and confused converted malicious software. The dynamic analysis method is characterized in that the actual operation flow of a software sample is recorded in a sandbox or virtual machine environment, the operation time sequence characteristics of a sample program are monitored in the process, log information is analyzed, and malicious behaviors in the sample program are found. The invention comprehensively considers the advantages and the defects of a plurality of malicious software detection algorithms and provides a malicious software detection system and a malicious software detection method based on distributed API characteristic analysis.
First, the innovative principles of the technology of the present invention are explained, and the basic ideas are as follows: combining the advantages of the distributed sandbox and the dynamic API characteristic analysis, and building a distributed sandbox system with load balancing and health state management to efficiently obtain an API execution sequence of a plurality of software samples; then extracting dynamic and static characteristics from the extracted API execution sequence; and finally, inputting the extracted dynamic and static characteristics into the sequence characteristics executed by the convolutional neural network learning malicious function of different receptive fields, and learning the time sequence behavior pattern executed by the malicious function by using the cyclic neural network. The invention can dynamically detect the malicious function execution mode and behavior mode, and can effectively detect the malicious program after the shell adding and the confusion.
The malware detection system based on distributed API feature analysis according to the present invention is shown in fig. 1. 1) The method comprises the steps that a large number of malicious software samples with specific type labels are crawled from a publicly known software analysis service website by utilizing a currently published and common data crawling technology, then an automatic crawler is compiled to acquire benign software from a published software downloading warehouse, collected malicious and benign software samples are mixed in equal proportion, and a training sample data set is constructed; 2) submitting collected malicious and benign software samples to a distributed sandbox cluster, issuing the software samples to appropriate sandbox nodes according to load balancing and optimized resource allocation strategies, storing an operation report of each software sample to a corresponding database for backup, monitoring the operation state and health condition of each sandbox in real time, and returning the operation condition of the software samples and the operation health condition of the sandbox nodes to a central server in real time; 3) constructing dynamic and static characteristics of malicious software, recording behaviors such as function call, file operation, process execution, network request, network flow and the like related to each software sample in the execution process based on the software operation report in the step 2), storing the behaviors into a corresponding database for backup, simultaneously analyzing data and extracting the dynamic and static characteristics, and mapping the behaviors into a calculable numerical vector; 4) training a basic detection model of the malicious software, namely designing a fusion detection model based on dynamic and static characteristics, extracting static information such as character strings, cited dynamic link libraries and assembly sequences from a sample by using an analysis tool, extracting a dynamic API function calling sequence by using sandbox running software, learning the static malicious software characteristics by using a convolutional neural network, and learning the dynamic malicious API time sequence behavior pattern by using a cyclic neural network to respectively obtain the basic detection model based on the dynamic and static characteristics; 5) and 4) malicious software detection, based on the basic malicious software detection model in the step 4), learning the weight of each basic detection model by using a stacking algorithm and a self-attention mechanism, training a malicious software detection model integrating the advantages of a plurality of detection models, and realizing detection of unknown software, particularly detection of shelled and confused malicious software.
The invention breaks through the traditional malicious software detection method based on a signature mechanism and static characteristics, and designs and realizes the malicious software detection method based on distributed API characteristic analysis. By deploying a distributed, self-scheduling and self-monitoring sandbox cluster, a large number of API execution sequences of malicious software samples and benign software samples are efficiently obtained, and data support is provided for learning the execution mode of malicious software. The method provided by the invention can effectively solve the problem that the signature mechanism and the static characteristic detection method can not detect the malicious software in the forms of shell adding, variation, confusion and the like. According to the invention, the characteristics of the malicious software are described by synchronously utilizing the dynamic and static characteristics, the static statistical characteristics and the time sequence behavior characteristics of the malicious software can be effectively captured, the behavior characteristics of the malicious software which is executed in a delayed manner can be effectively captured by the real distributed sandbox cluster, and the detection model designed by the invention can learn the real characteristics of the execution behavior of the malicious software, so that the accuracy and the high efficiency of the detection of the malicious software are improved. The prototype system application practice proves that the invention can effectively detect meticulously disguised malicious software, particularly the malicious software subjected to shell adding, variety adding and confusion, and the scheme of the invention is easy to arrange in the existing network, simple to operate, safe and reliable, and has remarkable economic and social benefits and wide market popularization and application prospects.
The structural principle and the working process of the distributed API feature analysis-based malware detection system and method according to the present invention are described in detail below with reference to the accompanying drawings, which preferably include the following embodiments.
PREFERRED EMBODIMENTS FOR CARRYING OUT THE INVENTION
As shown in fig. 1, as a preferred embodiment, the malware detection system based on distributed API feature analysis according to the present invention includes: the device comprises a sample downloading module, a task issuing module, a sandbox system state monitoring module, a sandbox system task scheduling module, a feature extraction module and a detection report generation module.
On one hand, the sample downloading module utilizes the current public and common data crawling technology to crawl a large number of malicious software samples with specific type labels from a publicly known software analysis service website; and on the other hand, writing an automatic crawler to acquire benign software from an open software download warehouse. And extracting equal proportion of malicious software and benign software from the test data as training samples, inputting the training samples into a distributed sandbox, and respectively obtaining API execution sequences of different samples.
And the task issuing module issues the sample to be detected according to the load balancing state of the distributed sandbox. The task issuing module is used for issuing collected software samples in batches, is responsible for receiving the software samples uploaded by users, adds a record in the task database, selects an optimal sandbox node for the samples through the scheduling module, and updates the task database by tracking the state of the monitoring task at regular time after API (application program interface) behavior execution records until behavior data of the software samples are finally input into the monitoring model, so that a final detection result and a final detection report are obtained.
The sandbox system state monitoring module monitors the running health state of each sandbox node in real time, collects data from a database, integrates information, transmits the data to a front-end interface in an HTTP interface mode for displaying, and is also used as a reference for decision making of the scheduling module. The sandbox system state monitoring module is responsible for task state statistics, node historical load statistics, node current task state statistics, node hardware state statistics (such as a magnetic disk, a memory and the like), sample detection results and the like, and ensures that the current latest state of the system can be obtained when an interface is called every time. The sandbox system state monitoring module monitors from two dimensions of tasks and nodes respectively, the monitoring state can be used as the input of a follow-up task scheduling algorithm, the quantification of performance difference of different nodes at the current moment is facilitated, and node loads are adjusted in real time. Meanwhile, the monitoring module is configured with an automatic alarm function, and when the utilization rate index of a certain item of node hardware resources is too high, alarm information can be automatically sent to an administrator mailbox.
The task scheduling module of the sandbox system monitors the working health state of the sandbox cluster in real time by utilizing a classic client/server architecture. Specifically, when the server is in the administrator task mode, the server can issue software samples to corresponding sandbox nodes in batches according to the load-bearing capacity and the resource utilization condition of each sandbox node and an optimization strategy, so that the utilization rate of each sandbox node is improved, and the stability and the efficiency of the whole sandbox cluster are ensured; when the server is in a common user task working mode, a task submitted by a user through the client is added into a waiting queue, the server acquires the task submitted first from the head of the queue in a polling mode, and arranges a virtual machine in an idle state to execute the task.
The characteristic extraction module is used for extracting the characteristic of the sample execution sequence in the running process of the sandbox. Specifically, for each software sample issued to the sandbox cluster, the sandbox is executed for three minutes, the actions of function calling, file operation, process execution, network requests and the like related to each software sample in the execution process are recorded, the actions are stored into a corresponding database for backup, data analysis and dynamic and static feature extraction are carried out at the same time, and the actions are mapped into a calculable numerical value vector. The dynamic and static numerical vectors are input into the malicious software detection model provided by the invention.
The detection report generation module utilizes a stacking model to combine a plurality of efficient classification models to realize efficient detection of the malicious software. Specifically, a fusion detection model based on dynamic and static characteristics is designed, static information such as character strings, quoted dynamic link libraries and assembly sequences is extracted from a sample by using an analysis tool, a dynamic API function calling sequence is extracted by using sandbox running software, static malicious software characteristics are learned by using a convolutional neural network, dynamic malicious API time sequence behavior patterns are learned by using a cyclic neural network, and finally an attention mechanism and a stacking algorithm are used for fusing a plurality of basic models, so that effective detection of malicious software such as shelling, variety and confusion can be realized. Meanwhile, a detailed detection report is automatically generated, and a friendly visual detection result is provided for a user.
The invention further provides a malicious software detection method based on distributed API characteristic analysis, which comprises the following steps:
step (1) software sample collection, namely, using the current public and common data crawling technology to crawl a large number of malicious software samples with specific type labels from a publicly known software analysis service website, then compiling an automatic crawler to obtain benign software from a public software download warehouse, mixing the collected malicious and benign software samples in equal proportion, and constructing a training sample data set;
step (2), submitting software samples, namely submitting collected malicious and benign software samples to a distributed sandbox cluster, issuing the software samples to appropriate sandbox nodes according to a load balancing and optimized resource allocation strategy, storing an operation report of each software sample to a corresponding database for backup, monitoring the operation state and health condition of each sandbox in real time, and returning the operation condition of the software samples and the operation health condition of the sandbox nodes to a central server in real time;
step (3), constructing dynamic and static characteristics of malicious software, recording behaviors such as function calling, file operation, process execution, network request and the like related to each software sample in the execution process based on the software operation report in the step (2), storing the behaviors into a corresponding database for backup, simultaneously performing data analysis and dynamic and static characteristic extraction, and mapping the behaviors into a calculable numerical vector;
step (4), training a basic detection model of the malicious software, namely designing a fusion detection model based on dynamic and static characteristics, extracting static information such as character strings, cited dynamic link libraries and assembly sequences from a sample by using an analysis tool, extracting a dynamic API function calling sequence by using sandbox running software, learning the static malicious software characteristics by using a convolutional neural network, learning the dynamic malicious API time sequence behavior pattern by using a cyclic neural network, and respectively obtaining the basic detection model based on the dynamic and static characteristics;
and (5) detecting the malicious software, wherein based on the basic detection model of the malicious software in the step (4), the weight of each basic detection model is learned by using a stacking algorithm and an attention mechanism, and a malicious software detection model integrating the advantages of a plurality of detection models is trained to realize the detection of unknown software, particularly the detection of the shelled and confused malicious software is solved.
The above description is only for the preferred embodiment of the present invention, and the technical solution of the present invention is not limited thereto, and any known modifications made by those skilled in the art based on the main technical idea of the present invention belong to the technical scope of the present invention, and the specific protection scope of the present invention is subject to the description of the claims.
Claims (8)
1. A malware detection system based on distributed API (Application Programming interface) feature analysis, comprising: the device comprises a sample downloading module, a task issuing module, a sandbox system state monitoring module, a sandbox system task scheduling module, a feature extraction module and a detection report generation module.
2. The distributed API profiling-based malware detection system of claim 1, wherein said sample download module crawls a large number of malware samples tagged with specific types from publicly known software analysis services websites using currently published, common data crawling techniques; and obtains benign software from the open software download repository.
3. The distributed API feature analysis-based malware detection system of claim 1, wherein the task issuing module allocates a sample to be tested to an optimal sandbox node according to a load balancing state of the distributed sandbox, and regularly monitors an execution state of a current task until the API of the sample is executed.
4. The distributed API feature analysis based malware detection system of claim 1, wherein the sandbox system status monitoring module monitors the operating health status of each sandbox node in real time, and is responsible for task status statistics, node historical load statistics, node current task status statistics, node hardware status statistics (including disk, CPU, memory, etc.), sample detection results, etc., and is configured with an automatic alarm function, and when a certain utilization index of node hardware resources is too high, an automatic alarm is given.
5. The distributed API feature analysis based malware detection system of claim 1, wherein the sandbox system task scheduling module schedules task orchestration of different sandbox nodes by using an individualized load balancing strategy, monitors the working health state of the sandbox cluster in real time by using a client/server architecture, and schedules software samples to corresponding sandbox nodes in batches according to an optimization strategy according to the bearing capacity and resource utilization condition of each sandbox node, so that the utilization rate of each sandbox node is improved, and the stability and the efficiency of the whole sandbox cluster are ensured.
6. The distributed API feature analysis based malware detection system of claim 1, wherein the feature extraction module implements extraction of API execution sequence features of a sample to be tested. The behaviors including but not limited to function calls, file operations, process executions, network requests and the like involved in the execution process of the software sample are extracted and mapped into a computable numerical vector.
7. The distributed API feature analysis based malware detection system of claim 1, wherein the detection report generation module learns static features of malware using convolutional neural networks, learns dynamic time-series behavior patterns of malware using cyclic neural networks based on attention mechanism, finally fuses a plurality of basic models using a stacking algorithm, generates a detection report, and records malicious execution sequences and final detection results.
8. A malicious software detection method based on distributed API (application program interface) feature analysis is characterized by comprising the following steps:
step (1), collecting software samples, namely crawling a large number of malicious software samples with specific type labels from a publicly known software analysis service website by using a currently open and common data crawling technology, then compiling an automatic crawler to acquire benign software from an open software downloading warehouse, mixing the collected malicious and benign software samples in equal proportion, and constructing a training sample data set;
step (2), submitting software samples, namely submitting collected malicious and benign software samples to a distributed sandbox cluster, issuing the software samples to appropriate sandbox nodes according to a load balancing and optimized resource allocation strategy, storing an operation report of each software sample to a corresponding database for backup, monitoring the operation state and health condition of each sandbox in real time, and returning the operation condition of the software samples and the operation health condition of the sandbox nodes to a central server in real time;
step (3), constructing dynamic and static characteristics of malicious software, recording behaviors such as function calling, file operation, process execution, network request and the like related to each software sample in the execution process based on the software operation report in the step (2), storing the behaviors into a corresponding database for backup, simultaneously performing data analysis and dynamic and static characteristic extraction, and mapping the behaviors into a calculable numerical vector;
step (4), training a basic detection model of the malicious software, namely designing a fusion detection model based on dynamic and static characteristics, extracting static information such as character strings, cited dynamic link libraries and assembly sequences from a sample by using an analysis tool, extracting a dynamic API function calling sequence by using sandbox running software, learning the static malicious software characteristics by using a convolutional neural network, learning the dynamic malicious API time sequence behavior pattern by using a cyclic neural network, and respectively obtaining the basic detection model based on the dynamic and static characteristics;
and (5) detecting the malicious software, wherein based on the basic detection model of the malicious software in the step (4), the weight of each basic detection model is learned by using a stacking algorithm and an attention mechanism, and a malicious software detection model integrating the advantages of a plurality of detection models is trained to realize the detection of unknown software, particularly the detection of the shelled and confused malicious software is solved.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110951731.4A CN113761531A (en) | 2021-08-13 | 2021-08-13 | Malicious software detection system and method based on distributed API (application program interface) feature analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110951731.4A CN113761531A (en) | 2021-08-13 | 2021-08-13 | Malicious software detection system and method based on distributed API (application program interface) feature analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113761531A true CN113761531A (en) | 2021-12-07 |
Family
ID=78790425
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110951731.4A Withdrawn CN113761531A (en) | 2021-08-13 | 2021-08-13 | Malicious software detection system and method based on distributed API (application program interface) feature analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113761531A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115563614A (en) * | 2022-10-27 | 2023-01-03 | 任文欣 | Software abnormal behavior file tracing method applied to artificial intelligence |
CN116028277A (en) * | 2023-03-27 | 2023-04-28 | 广州智算信息技术有限公司 | Database backup method and system based on CDC mode |
CN116226854A (en) * | 2023-05-06 | 2023-06-06 | 江西萤火虫微电子科技有限公司 | Malware detection method, system, readable storage medium and computer |
-
2021
- 2021-08-13 CN CN202110951731.4A patent/CN113761531A/en not_active Withdrawn
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115563614A (en) * | 2022-10-27 | 2023-01-03 | 任文欣 | Software abnormal behavior file tracing method applied to artificial intelligence |
CN115563614B (en) * | 2022-10-27 | 2023-08-04 | 艾德领客(上海)数字技术有限公司 | Software abnormal behavior file tracing method applied to artificial intelligence |
CN116028277A (en) * | 2023-03-27 | 2023-04-28 | 广州智算信息技术有限公司 | Database backup method and system based on CDC mode |
CN116226854A (en) * | 2023-05-06 | 2023-06-06 | 江西萤火虫微电子科技有限公司 | Malware detection method, system, readable storage medium and computer |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102403622B1 (en) | Systems and methods for behavioral threat detection | |
CN113761531A (en) | Malicious software detection system and method based on distributed API (application program interface) feature analysis | |
CN107659543B (en) | Protection method for APT (android packet) attack of cloud platform | |
CN105100032B (en) | A kind of method and device for preventing resource from stealing | |
CN109361643B (en) | Deep tracing method for malicious sample | |
CN105187392B (en) | Mobile terminal from malicious software detecting method and its system based on Network Access Point | |
CN111460446B (en) | Malicious file detection method and device based on model | |
CN107003976A (en) | Based on active rule can be permitted determine that activity can be permitted | |
CN112507330B (en) | Malicious software detection system based on distributed sandbox | |
CN111090864B (en) | Penetration test frame system, penetration test platform and penetration test method | |
JP7389806B2 (en) | Systems and methods for behavioral threat detection | |
CN107426148A (en) | A kind of anti-reptile method and system based on running environment feature recognition | |
CN109948335A (en) | System and method for detecting the rogue activity in computer system | |
CN110995652B (en) | Big data platform unknown threat detection method based on deep migration learning | |
Bernardi et al. | A fuzzy-based process mining approach for dynamic malware detection | |
CN110572302B (en) | Diskless local area network scene identification method and device and terminal | |
Kannan et al. | A novel cloud intrusion detection system using feature selection and classification | |
Eldos et al. | On the KDD'99 Dataset: Statistical Analysis for Feature Selection | |
Manthena et al. | Analyzing and Explaining Black-Box Models for Online Malware Detection | |
CN116756738A (en) | Malicious code detection system and method based on distributed API call relationship | |
Sun et al. | Advances in Artificial Intelligence and Security: 7th International Conference, ICAIS 2021, Dublin, Ireland, July 19-23, 2021, Proceedings, Part III | |
JP2018132787A (en) | Log analysis support apparatus and log analysis support method | |
CN113360916A (en) | Risk detection method, device, equipment and medium for application programming interface | |
Liang et al. | Semantics-based anomaly detection of processes in linux containers | |
Kandukuru et al. | PNSDroid: a hybrid approach for detection of Android malware |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20211207 |