CN113761531A - Malicious software detection system and method based on distributed API (application program interface) feature analysis - Google Patents

Malicious software detection system and method based on distributed API (application program interface) feature analysis Download PDF

Info

Publication number
CN113761531A
CN113761531A CN202110951731.4A CN202110951731A CN113761531A CN 113761531 A CN113761531 A CN 113761531A CN 202110951731 A CN202110951731 A CN 202110951731A CN 113761531 A CN113761531 A CN 113761531A
Authority
CN
China
Prior art keywords
software
sandbox
malicious
detection
api
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110951731.4A
Other languages
Chinese (zh)
Inventor
张长河
林奇伟
闫翔宇
王剑辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Weida Information Technology Co ltd
Original Assignee
Beijing Weida Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Weida Information Technology Co ltd filed Critical Beijing Weida Information Technology Co ltd
Priority to CN202110951731.4A priority Critical patent/CN113761531A/en
Publication of CN113761531A publication Critical patent/CN113761531A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/52Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow
    • G06F21/53Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow by executing in a restricted environment, e.g. sandbox or secure virtual machine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a malicious software detection system and a malicious software detection method based on distributed API (application program interface) feature analysis, which overcome the defects of the traditional malicious software detection system and method based on a static state and a single machine type and solve the defects that the traditional malicious software detection system cannot detect the shelled malicious software and the single machine sandbox has low operation efficiency, and the basic idea is as follows: combining the advantages of the distributed sandbox and the dynamic API characteristic analysis, and building a distributed sandbox system with load balancing and health state management to efficiently obtain an API execution sequence of a plurality of software samples; then extracting dynamic and static characteristics from the extracted API execution sequence; and finally, inputting the extracted dynamic and static characteristics into the sequence characteristics executed by the convolutional neural network learning malicious function of different receptive fields, and learning the time sequence behavior pattern executed by the malicious function by using the cyclic neural network. The invention can dynamically detect the malicious function execution mode and behavior mode, and can effectively detect the malicious program after the shell adding and the confusion.

Description

Malicious software detection system and method based on distributed API (application program interface) feature analysis
Technical Field
The invention relates to the field of network security, in particular to a malicious software detection system and a malicious software detection method based on distributed Application Programming Interface (API) feature analysis.
Background
Malware is software that is intended to harm a computer, server, or computer network. Malware causes varying degrees of damage to a target computer after it is implanted or somehow invaded. The malicious software is installed and operated on a computer under the condition that a user is not explicitly prompted or the permission of the user is not given, and is expressed by malicious behaviors such as forced installation, browser hijacking, data stealing, malicious collection of user sensitive information, malicious binding software and the like. The malicious software is a tool for hackers to implement network crimes, and the attacker induces the user to download and operate the malicious software through a deception means, so that the control right of the user host is obtained or privacy information is stolen. In recent years, the attack threshold is gradually reduced due to the open source of the hacking tool, and people can easily acquire the source code of the hacking tool from the network. Therefore, a malware manufacturer can generate new malware at low time cost, technical cost and economic cost, and the new malware causes great economic and security loss to individuals, society and countries, so that the efficient detection of the malware is of great significance to the protection of network security, people property and national stability.
In order to reduce the influence of malware on the network environment and users, a number of malware detection methods and patents have been proposed.
The invention patent with application number CN201610996935.9 discloses a sample type determination method for malware detection, which discloses a sample type determination method, comprising the following steps: 1) collecting a sample program set, and respectively forming a sample library; submitting the program set in the sample library to a virtual sandbox environment for operation, and then generating a corresponding sample analysis report; 2) analyzing the sample analysis report, extracting special feature combination information, and generating a feature vector set; inputting the feature vector set into a classifier for training to obtain an optimal model; 3) and inputting the program to be tested into the optimal model to obtain a judgment result that the program to be tested is a malicious program or a normal program. The invention improves the efficiency and the accuracy of malicious software detection, avoids complex operation and larger energy consumption in the dynamic detection technology, and greatly improves the detection speed on the basis of ensuring the accuracy. The invention can only detect conventional malware samples and cannot effectively detect elaborate disguised malware.
The invention patent with the application number of CN201810299726.8 discloses a method and a system for detecting malicious software, wherein the method comprises the following steps: 1) determining the authority corresponding function applied by the software to be tested based on the installation package of the software to be tested; 2) installing and running the software to be tested in a test environment based on the installation package of the software to be tested, and monitoring the action and the characteristics of the software to be tested in the running process in real time; 3) and if the detected software acquires preset privacy information when realizing the corresponding function and has non-functional characteristics which can achieve the purpose of running all the time and/or automatically recover to run after being forcibly terminated, preliminarily determining that the detected software is malicious software. Therefore, whether the software to be detected obtains the preset privacy information or not is judged, and meanwhile whether the software to be detected has non-functional characteristics which can achieve the purpose of running all the time and/or automatically recovering to run after being stopped is monitored, and therefore accuracy of malicious software detection is greatly improved through the two judgments. The invention can not dynamically detect the internal function call relation of the malicious software, so that the malicious software using the dynamic attack strategy is difficult to detect only by relying on the static characteristics.
At present, malware detection methods are roughly classified into two types, a static analysis method and a dynamic analysis method, depending on whether malware is executed. The static analysis method does not need to actually run a software sample to be tested, but extracts information from the software sample through an analysis tool, such as data of function call names, file structure information, import tables, character strings, control flows and the like, and judges whether the software sample is malicious or not according to the extracted features. The static analysis method is convenient and quick, but is difficult to detect the deformed, polymorphic, shelled and confused converted malicious software. The dynamic analysis method is characterized in that the actual operation flow of a software sample is recorded in a sandbox or virtual machine environment, the operation characteristics of an application program are monitored in the process, and a log is analyzed and recorded to find malicious behaviors in the application program. In summary, the currently existing malware detection methods have the following major drawbacks:
(1) the traditional malicious software detection method highly depends on expert knowledge and cannot detect the malicious software such as the increasingly changed shell, confusion, variety and the like. Most of the traditional detection methods are based on a signature mechanism to realize the marking of malicious samples or features, for example, hash values of software samples are used for matching whether the malicious signature library contains the software, or information such as bytes and character strings of the software is used for rule matching. Signature rules are manually set by security experts according to the salient features of known malicious software families, and the method cannot update and expand the feature signature library in real time, has obvious hysteresis, and can only detect certain software which is discovered to be malicious by security workers and is added into the signature feature library. In addition, the malicious feature signature library is continuously increased along with the appearance of new samples, so that the query and matching costs are gradually increased.
(2) High quality dynamic behavior training samples are lacking. The dynamic analysis method needs a high-quality software running sample, the software sample needs to be submitted into a sandbox to run for minutes or even tens of minutes in the acquisition process, and the time and resource cost for acquiring the dynamic behavior data is high. The training process of the current popular machine learning and deep learning models has higher requirements on the quality and quantity of data, but a large-scale and complete dynamic training data set is lacked at present. In addition, most of data in the existing public data sets are results of data preprocessing performed by a publisher, analysis can be performed only based on existing information, more relevant information of the data corresponding to an original file cannot be acquired, and only two types, namely malicious types and benign types, are labeled in part of the data sets, so that specific types of malicious software cannot be divided more finely.
(3) The single dynamic behavior detection method has limitations. After some software is put into a sandbox to operate, the software can conceal the software by a delayed operation means, sufficient behavior data cannot be generated, some software can execute subsequent malicious behaviors after being triggered by a certain condition, and the operation condition of each software cannot be specifically inquired in the process of large-scale sample operation, so that detection by only depending on dynamic behaviors can be omitted, and particularly, malicious software which is shelled, varied and confused can easily escape from a single malicious software detection system.
The invention comprehensively considers the advantages and the defects of a plurality of malicious software detection algorithms, and provides a malicious software detection system and a malicious software detection method based on distributed API characteristic analysis.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a malicious software detection system and a malicious software detection method based on distributed API characteristic analysis, overcomes the defects of the traditional malicious software detection system and method based on static and single machine, and solves the defects that the traditional malicious software detection system cannot detect the shelled malicious software and the single machine sandbox has low operation efficiency, and the basic idea is as follows: combining the advantages of the distributed sandbox and the dynamic API characteristic analysis, and building a distributed sandbox system with load balancing and health state management to efficiently obtain an API execution sequence of a plurality of software samples; then extracting dynamic and static characteristics from the extracted API execution sequence; and finally, inputting the extracted dynamic and static characteristics into the sequence characteristics executed by the convolutional neural network learning malicious function of different receptive fields, and learning the time sequence behavior pattern executed by the malicious function by using the cyclic neural network. The invention can dynamically detect the malicious function execution mode and behavior mode, and can effectively detect the malicious program after the shell adding and the confusion.
In order to achieve the purpose of the invention, the technical scheme provided by the invention is as follows:
a malware detection system based on distributed API profiling, comprising: the device comprises a sample downloading module, a task issuing module, a sandbox system state monitoring module, a sandbox system task scheduling module, a feature extraction module and a detection report generation module. The sample downloading module is used for crawling a large number of malicious software samples with specific type labels from a publicly known software analysis service website by utilizing a currently published and common data crawling technology and acquiring benign software samples from a published software downloading warehouse; the task issuing module is responsible for issuing the collected software sample to a proper sandbox node according to a self-defined load balancing strategy to obtain the execution process of the software sample; the sandbox system state monitoring module monitors the running state of each sandbox node in real time, and sends alarm information to the central end when the state of the sandbox node is poor; the sandbox system task scheduling module is responsible for issuing a software sample execution task to a sandbox node with optimal performance and highest efficiency according to a load balancing and resource utilization optimization strategy; the characteristic extraction module is responsible for extracting the behavior characteristics of each software sample, such as function call, file operation, process execution, network request and the like from the sandbox operation report; and the detection report generation module is responsible for training an automatic and integrated malicious software detection model to realize malicious software detection and generate a corresponding detection report.
Further according to the malicious software detection system based on the distributed API characteristic analysis, on one hand, the sample downloading module utilizes the current public and common data crawling technology to crawl a large number of malicious software samples with specific type labels from a publicly known software analysis service website; and on the other hand, writing an automatic crawler to acquire benign software from an open software download warehouse. And extracting equal proportion of malicious software and benign software from the test data as training samples, inputting the training samples into a distributed sandbox, and respectively obtaining API execution sequences of different samples.
Further, according to the malicious software detection system based on the distributed API characteristic analysis, the task issuing module issues the sample to be detected according to the load balancing state of the distributed sandbox. The task issuing module is used for issuing collected software samples in batches, is responsible for receiving the software samples uploaded by users, adds a record in the task database, selects an optimal sandbox node for the samples through the scheduling module, and updates the task database by tracking the state of the monitoring task at regular time after API (application program interface) behavior execution records until behavior data of the software samples are finally input into the monitoring model, so that a final detection result and a final detection report are obtained.
Further, according to the malware detection system based on distributed API feature analysis, the sandbox system state monitoring module monitors the running health state of each sandbox node in real time, collects data from a database, integrates information, transmits the data to a front-end interface in an HTTP (Hyper Text Transfer Protocol) interface mode for displaying, and meanwhile is used as a reference for decision making of the scheduling module. The sandbox system state monitoring module is responsible for task state statistics, node historical load statistics, node current task state statistics, node hardware state statistics (such as a magnetic disk, a memory and the like), sample detection results and the like, and ensures that the current latest state of the system can be obtained when an interface is called every time. The sandbox system state monitoring module monitors from two dimensions of tasks and nodes respectively, the monitoring state can be used as the input of a follow-up task scheduling algorithm, the quantification of performance difference of different nodes at the current moment is facilitated, and node loads are adjusted in real time. Meanwhile, the monitoring module is configured with an automatic alarm function, and when the utilization rate index of a certain item of node hardware resources is too high, alarm information can be automatically sent to an administrator mailbox.
Further, according to the malicious software detection system based on the distributed API characteristic analysis, the sandbox system task scheduling module schedules the task arrangement of different sandbox nodes by utilizing an individualized load balancing strategy. The task scheduling module of the sandbox system monitors the working health state of the sandbox cluster in real time by utilizing a classic client/server architecture. Specifically, when the server is in the administrator task mode, the server can issue software samples to corresponding sandbox nodes in batches according to the load-bearing capacity and the resource utilization condition of each sandbox node and an optimization strategy, so that the utilization rate of each sandbox node is improved, and the stability and the efficiency of the whole sandbox cluster are ensured; when the server is in a common user task working mode, a task submitted by a user through the client is added into a waiting queue, the server acquires the task submitted first from the head of the queue in a polling mode, and arranges a virtual machine in an idle state to execute the task.
Further, according to the malicious software detection system based on the distributed API feature analysis, the feature extraction module is used for extracting the sample execution sequence features in the running process of the sandbox. Specifically, for each software sample issued to the sandbox cluster, the sandbox is executed for three minutes, the actions of function calling, file operation, process execution, network requests and the like related to each software sample in the execution process are recorded, the actions are stored into a corresponding database for backup, data analysis and dynamic and static feature extraction are carried out at the same time, and the actions are mapped into a calculable numerical value vector. The dynamic and static numerical vectors are input into the malicious software detection model provided by the invention.
Further, according to the malware detection system based on distributed API feature analysis, the detection report generation module combines multiple efficient classification models with a stacking model to achieve efficient detection of malware. Specifically, a fusion detection model based on dynamic and static characteristics is designed, static information such as character strings, quoted dynamic link libraries and assembly sequences is extracted from a sample by using an analysis tool, a dynamic API function calling sequence is extracted by using sandbox running software, static malicious software characteristics are learned by using a convolutional neural network, dynamic malicious API time sequence behavior patterns are learned by using a cyclic neural network, and finally an attention mechanism and a stacking algorithm are used for fusing a plurality of basic models, so that effective detection of malicious software such as shelling, variety and confusion can be realized. Meanwhile, a detailed detection report is automatically generated, and a friendly visual detection result is provided for a user.
A malicious software detection method based on distributed API (application program interface) feature analysis is characterized by comprising the following steps:
step (1), collecting software samples, namely crawling a large number of malicious software samples with specific type labels from a publicly known software analysis service website by using a currently open and common data crawling technology, then compiling an automatic crawler to acquire benign software from an open software downloading warehouse, mixing the collected malicious and benign software samples in equal proportion, and constructing a training sample data set;
step (2), submitting software samples, namely submitting collected malicious and benign software samples to a distributed sandbox cluster, issuing the software samples to appropriate sandbox nodes according to a load balancing and optimized resource allocation strategy, storing an operation report of each software sample to a corresponding database for backup, monitoring the operation state and health condition of each sandbox in real time, and returning the operation condition of the software samples and the operation health condition of the sandbox nodes to a central server in real time;
step (3), constructing dynamic and static characteristics of malicious software, recording behaviors such as function call, file operation, process execution, network request, network flow and the like related to each software sample in the execution process based on the software operation report in the step (2), storing the behaviors into a corresponding database for backup, simultaneously carrying out data analysis and dynamic and static characteristic extraction, and representing and mapping the behaviors into a calculable numerical vector;
step (4), training a basic detection model of the malicious software, namely designing a fusion detection model based on dynamic and static characteristics, extracting static information such as character strings, cited dynamic link libraries and assembly sequences from a sample by using an analysis tool, extracting a dynamic API function calling sequence by using sandbox running software, learning the static malicious software characteristics by using a convolutional neural network, learning the dynamic malicious API time sequence behavior pattern by using a cyclic neural network, and respectively obtaining the basic detection model based on the dynamic and static characteristics;
and (5) detecting the malicious software, wherein based on the basic detection model of the malicious software in the step (4), the weight of each basic detection model is learned by using a stacking algorithm and an attention mechanism, and a malicious software detection model integrating the advantages of a plurality of detection models is trained to realize the detection of unknown software, particularly the detection of the shelled and confused malicious software is solved.
The invention has the beneficial effects that:
1) the invention breaks through the traditional malicious software detection method based on a signature mechanism and static characteristics, and designs and realizes the malicious software detection method based on distributed API characteristic analysis. By deploying a distributed, self-scheduling and self-monitoring sandbox cluster, a large number of API execution sequences of malicious software samples and benign software samples are efficiently obtained, and data support is provided for learning the execution mode of malicious software.
2) The method provided by the invention can effectively solve the problem that the signature mechanism and the static characteristic detection method can not detect the malicious software in the forms of shell adding, variation, confusion and the like.
3) According to the invention, the characteristics of the malicious software are described by synchronously utilizing the dynamic and static characteristics, the static statistical characteristics and the time sequence behavior characteristics of the malicious software can be effectively captured, the behavior characteristics of the malicious software which is executed in a delayed manner can be effectively captured by the real distributed sandbox cluster, and the detection model designed by the invention can learn the real execution behavior characteristics of the malicious software, so that the accuracy and the high efficiency of the malicious software detection are improved.
4) The prototype system application practice proves that the invention can effectively detect meticulously disguised malicious software, particularly the malicious software subjected to shell adding, variety adding and confusion, and the scheme of the invention is easy to arrange in the existing network, simple to operate, safe and reliable, and has remarkable economic and social benefits and wide market popularization and application prospects.
Drawings
FIG. 1 is a block diagram of the general architecture of a distributed API profiling-based malware detection system and method of the present invention;
Detailed Description
The following detailed description of the embodiments of the present invention is provided in conjunction with the accompanying drawings to enable those skilled in the art to more clearly understand the embodiments of the present invention, but not to limit the scope of the present invention.
At present, malware detection methods are roughly classified into two types, a static analysis method and a dynamic analysis method, depending on whether malware is executed. The static analysis method does not need to actually run a software sample to be tested, but extracts information from the software sample through an analysis tool, such as data of function call names, file structure information, import tables, character strings, control flows and the like, and judges whether the software sample is malicious or not according to the extracted features. The static analysis method is convenient and quick, but is difficult to detect the deformed, polymorphic, shelled and confused converted malicious software. The dynamic analysis method is characterized in that the actual operation flow of a software sample is recorded in a sandbox or virtual machine environment, the operation time sequence characteristics of a sample program are monitored in the process, log information is analyzed, and malicious behaviors in the sample program are found. The invention comprehensively considers the advantages and the defects of a plurality of malicious software detection algorithms and provides a malicious software detection system and a malicious software detection method based on distributed API characteristic analysis.
First, the innovative principles of the technology of the present invention are explained, and the basic ideas are as follows: combining the advantages of the distributed sandbox and the dynamic API characteristic analysis, and building a distributed sandbox system with load balancing and health state management to efficiently obtain an API execution sequence of a plurality of software samples; then extracting dynamic and static characteristics from the extracted API execution sequence; and finally, inputting the extracted dynamic and static characteristics into the sequence characteristics executed by the convolutional neural network learning malicious function of different receptive fields, and learning the time sequence behavior pattern executed by the malicious function by using the cyclic neural network. The invention can dynamically detect the malicious function execution mode and behavior mode, and can effectively detect the malicious program after the shell adding and the confusion.
The malware detection system based on distributed API feature analysis according to the present invention is shown in fig. 1. 1) The method comprises the steps that a large number of malicious software samples with specific type labels are crawled from a publicly known software analysis service website by utilizing a currently published and common data crawling technology, then an automatic crawler is compiled to acquire benign software from a published software downloading warehouse, collected malicious and benign software samples are mixed in equal proportion, and a training sample data set is constructed; 2) submitting collected malicious and benign software samples to a distributed sandbox cluster, issuing the software samples to appropriate sandbox nodes according to load balancing and optimized resource allocation strategies, storing an operation report of each software sample to a corresponding database for backup, monitoring the operation state and health condition of each sandbox in real time, and returning the operation condition of the software samples and the operation health condition of the sandbox nodes to a central server in real time; 3) constructing dynamic and static characteristics of malicious software, recording behaviors such as function call, file operation, process execution, network request, network flow and the like related to each software sample in the execution process based on the software operation report in the step 2), storing the behaviors into a corresponding database for backup, simultaneously analyzing data and extracting the dynamic and static characteristics, and mapping the behaviors into a calculable numerical vector; 4) training a basic detection model of the malicious software, namely designing a fusion detection model based on dynamic and static characteristics, extracting static information such as character strings, cited dynamic link libraries and assembly sequences from a sample by using an analysis tool, extracting a dynamic API function calling sequence by using sandbox running software, learning the static malicious software characteristics by using a convolutional neural network, and learning the dynamic malicious API time sequence behavior pattern by using a cyclic neural network to respectively obtain the basic detection model based on the dynamic and static characteristics; 5) and 4) malicious software detection, based on the basic malicious software detection model in the step 4), learning the weight of each basic detection model by using a stacking algorithm and a self-attention mechanism, training a malicious software detection model integrating the advantages of a plurality of detection models, and realizing detection of unknown software, particularly detection of shelled and confused malicious software.
The invention breaks through the traditional malicious software detection method based on a signature mechanism and static characteristics, and designs and realizes the malicious software detection method based on distributed API characteristic analysis. By deploying a distributed, self-scheduling and self-monitoring sandbox cluster, a large number of API execution sequences of malicious software samples and benign software samples are efficiently obtained, and data support is provided for learning the execution mode of malicious software. The method provided by the invention can effectively solve the problem that the signature mechanism and the static characteristic detection method can not detect the malicious software in the forms of shell adding, variation, confusion and the like. According to the invention, the characteristics of the malicious software are described by synchronously utilizing the dynamic and static characteristics, the static statistical characteristics and the time sequence behavior characteristics of the malicious software can be effectively captured, the behavior characteristics of the malicious software which is executed in a delayed manner can be effectively captured by the real distributed sandbox cluster, and the detection model designed by the invention can learn the real characteristics of the execution behavior of the malicious software, so that the accuracy and the high efficiency of the detection of the malicious software are improved. The prototype system application practice proves that the invention can effectively detect meticulously disguised malicious software, particularly the malicious software subjected to shell adding, variety adding and confusion, and the scheme of the invention is easy to arrange in the existing network, simple to operate, safe and reliable, and has remarkable economic and social benefits and wide market popularization and application prospects.
The structural principle and the working process of the distributed API feature analysis-based malware detection system and method according to the present invention are described in detail below with reference to the accompanying drawings, which preferably include the following embodiments.
PREFERRED EMBODIMENTS FOR CARRYING OUT THE INVENTION
As shown in fig. 1, as a preferred embodiment, the malware detection system based on distributed API feature analysis according to the present invention includes: the device comprises a sample downloading module, a task issuing module, a sandbox system state monitoring module, a sandbox system task scheduling module, a feature extraction module and a detection report generation module.
On one hand, the sample downloading module utilizes the current public and common data crawling technology to crawl a large number of malicious software samples with specific type labels from a publicly known software analysis service website; and on the other hand, writing an automatic crawler to acquire benign software from an open software download warehouse. And extracting equal proportion of malicious software and benign software from the test data as training samples, inputting the training samples into a distributed sandbox, and respectively obtaining API execution sequences of different samples.
And the task issuing module issues the sample to be detected according to the load balancing state of the distributed sandbox. The task issuing module is used for issuing collected software samples in batches, is responsible for receiving the software samples uploaded by users, adds a record in the task database, selects an optimal sandbox node for the samples through the scheduling module, and updates the task database by tracking the state of the monitoring task at regular time after API (application program interface) behavior execution records until behavior data of the software samples are finally input into the monitoring model, so that a final detection result and a final detection report are obtained.
The sandbox system state monitoring module monitors the running health state of each sandbox node in real time, collects data from a database, integrates information, transmits the data to a front-end interface in an HTTP interface mode for displaying, and is also used as a reference for decision making of the scheduling module. The sandbox system state monitoring module is responsible for task state statistics, node historical load statistics, node current task state statistics, node hardware state statistics (such as a magnetic disk, a memory and the like), sample detection results and the like, and ensures that the current latest state of the system can be obtained when an interface is called every time. The sandbox system state monitoring module monitors from two dimensions of tasks and nodes respectively, the monitoring state can be used as the input of a follow-up task scheduling algorithm, the quantification of performance difference of different nodes at the current moment is facilitated, and node loads are adjusted in real time. Meanwhile, the monitoring module is configured with an automatic alarm function, and when the utilization rate index of a certain item of node hardware resources is too high, alarm information can be automatically sent to an administrator mailbox.
The task scheduling module of the sandbox system monitors the working health state of the sandbox cluster in real time by utilizing a classic client/server architecture. Specifically, when the server is in the administrator task mode, the server can issue software samples to corresponding sandbox nodes in batches according to the load-bearing capacity and the resource utilization condition of each sandbox node and an optimization strategy, so that the utilization rate of each sandbox node is improved, and the stability and the efficiency of the whole sandbox cluster are ensured; when the server is in a common user task working mode, a task submitted by a user through the client is added into a waiting queue, the server acquires the task submitted first from the head of the queue in a polling mode, and arranges a virtual machine in an idle state to execute the task.
The characteristic extraction module is used for extracting the characteristic of the sample execution sequence in the running process of the sandbox. Specifically, for each software sample issued to the sandbox cluster, the sandbox is executed for three minutes, the actions of function calling, file operation, process execution, network requests and the like related to each software sample in the execution process are recorded, the actions are stored into a corresponding database for backup, data analysis and dynamic and static feature extraction are carried out at the same time, and the actions are mapped into a calculable numerical value vector. The dynamic and static numerical vectors are input into the malicious software detection model provided by the invention.
The detection report generation module utilizes a stacking model to combine a plurality of efficient classification models to realize efficient detection of the malicious software. Specifically, a fusion detection model based on dynamic and static characteristics is designed, static information such as character strings, quoted dynamic link libraries and assembly sequences is extracted from a sample by using an analysis tool, a dynamic API function calling sequence is extracted by using sandbox running software, static malicious software characteristics are learned by using a convolutional neural network, dynamic malicious API time sequence behavior patterns are learned by using a cyclic neural network, and finally an attention mechanism and a stacking algorithm are used for fusing a plurality of basic models, so that effective detection of malicious software such as shelling, variety and confusion can be realized. Meanwhile, a detailed detection report is automatically generated, and a friendly visual detection result is provided for a user.
The invention further provides a malicious software detection method based on distributed API characteristic analysis, which comprises the following steps:
step (1) software sample collection, namely, using the current public and common data crawling technology to crawl a large number of malicious software samples with specific type labels from a publicly known software analysis service website, then compiling an automatic crawler to obtain benign software from a public software download warehouse, mixing the collected malicious and benign software samples in equal proportion, and constructing a training sample data set;
step (2), submitting software samples, namely submitting collected malicious and benign software samples to a distributed sandbox cluster, issuing the software samples to appropriate sandbox nodes according to a load balancing and optimized resource allocation strategy, storing an operation report of each software sample to a corresponding database for backup, monitoring the operation state and health condition of each sandbox in real time, and returning the operation condition of the software samples and the operation health condition of the sandbox nodes to a central server in real time;
step (3), constructing dynamic and static characteristics of malicious software, recording behaviors such as function calling, file operation, process execution, network request and the like related to each software sample in the execution process based on the software operation report in the step (2), storing the behaviors into a corresponding database for backup, simultaneously performing data analysis and dynamic and static characteristic extraction, and mapping the behaviors into a calculable numerical vector;
step (4), training a basic detection model of the malicious software, namely designing a fusion detection model based on dynamic and static characteristics, extracting static information such as character strings, cited dynamic link libraries and assembly sequences from a sample by using an analysis tool, extracting a dynamic API function calling sequence by using sandbox running software, learning the static malicious software characteristics by using a convolutional neural network, learning the dynamic malicious API time sequence behavior pattern by using a cyclic neural network, and respectively obtaining the basic detection model based on the dynamic and static characteristics;
and (5) detecting the malicious software, wherein based on the basic detection model of the malicious software in the step (4), the weight of each basic detection model is learned by using a stacking algorithm and an attention mechanism, and a malicious software detection model integrating the advantages of a plurality of detection models is trained to realize the detection of unknown software, particularly the detection of the shelled and confused malicious software is solved.
The above description is only for the preferred embodiment of the present invention, and the technical solution of the present invention is not limited thereto, and any known modifications made by those skilled in the art based on the main technical idea of the present invention belong to the technical scope of the present invention, and the specific protection scope of the present invention is subject to the description of the claims.

Claims (8)

1. A malware detection system based on distributed API (Application Programming interface) feature analysis, comprising: the device comprises a sample downloading module, a task issuing module, a sandbox system state monitoring module, a sandbox system task scheduling module, a feature extraction module and a detection report generation module.
2. The distributed API profiling-based malware detection system of claim 1, wherein said sample download module crawls a large number of malware samples tagged with specific types from publicly known software analysis services websites using currently published, common data crawling techniques; and obtains benign software from the open software download repository.
3. The distributed API feature analysis-based malware detection system of claim 1, wherein the task issuing module allocates a sample to be tested to an optimal sandbox node according to a load balancing state of the distributed sandbox, and regularly monitors an execution state of a current task until the API of the sample is executed.
4. The distributed API feature analysis based malware detection system of claim 1, wherein the sandbox system status monitoring module monitors the operating health status of each sandbox node in real time, and is responsible for task status statistics, node historical load statistics, node current task status statistics, node hardware status statistics (including disk, CPU, memory, etc.), sample detection results, etc., and is configured with an automatic alarm function, and when a certain utilization index of node hardware resources is too high, an automatic alarm is given.
5. The distributed API feature analysis based malware detection system of claim 1, wherein the sandbox system task scheduling module schedules task orchestration of different sandbox nodes by using an individualized load balancing strategy, monitors the working health state of the sandbox cluster in real time by using a client/server architecture, and schedules software samples to corresponding sandbox nodes in batches according to an optimization strategy according to the bearing capacity and resource utilization condition of each sandbox node, so that the utilization rate of each sandbox node is improved, and the stability and the efficiency of the whole sandbox cluster are ensured.
6. The distributed API feature analysis based malware detection system of claim 1, wherein the feature extraction module implements extraction of API execution sequence features of a sample to be tested. The behaviors including but not limited to function calls, file operations, process executions, network requests and the like involved in the execution process of the software sample are extracted and mapped into a computable numerical vector.
7. The distributed API feature analysis based malware detection system of claim 1, wherein the detection report generation module learns static features of malware using convolutional neural networks, learns dynamic time-series behavior patterns of malware using cyclic neural networks based on attention mechanism, finally fuses a plurality of basic models using a stacking algorithm, generates a detection report, and records malicious execution sequences and final detection results.
8. A malicious software detection method based on distributed API (application program interface) feature analysis is characterized by comprising the following steps:
step (1), collecting software samples, namely crawling a large number of malicious software samples with specific type labels from a publicly known software analysis service website by using a currently open and common data crawling technology, then compiling an automatic crawler to acquire benign software from an open software downloading warehouse, mixing the collected malicious and benign software samples in equal proportion, and constructing a training sample data set;
step (2), submitting software samples, namely submitting collected malicious and benign software samples to a distributed sandbox cluster, issuing the software samples to appropriate sandbox nodes according to a load balancing and optimized resource allocation strategy, storing an operation report of each software sample to a corresponding database for backup, monitoring the operation state and health condition of each sandbox in real time, and returning the operation condition of the software samples and the operation health condition of the sandbox nodes to a central server in real time;
step (3), constructing dynamic and static characteristics of malicious software, recording behaviors such as function calling, file operation, process execution, network request and the like related to each software sample in the execution process based on the software operation report in the step (2), storing the behaviors into a corresponding database for backup, simultaneously performing data analysis and dynamic and static characteristic extraction, and mapping the behaviors into a calculable numerical vector;
step (4), training a basic detection model of the malicious software, namely designing a fusion detection model based on dynamic and static characteristics, extracting static information such as character strings, cited dynamic link libraries and assembly sequences from a sample by using an analysis tool, extracting a dynamic API function calling sequence by using sandbox running software, learning the static malicious software characteristics by using a convolutional neural network, learning the dynamic malicious API time sequence behavior pattern by using a cyclic neural network, and respectively obtaining the basic detection model based on the dynamic and static characteristics;
and (5) detecting the malicious software, wherein based on the basic detection model of the malicious software in the step (4), the weight of each basic detection model is learned by using a stacking algorithm and an attention mechanism, and a malicious software detection model integrating the advantages of a plurality of detection models is trained to realize the detection of unknown software, particularly the detection of the shelled and confused malicious software is solved.
CN202110951731.4A 2021-08-13 2021-08-13 Malicious software detection system and method based on distributed API (application program interface) feature analysis Withdrawn CN113761531A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110951731.4A CN113761531A (en) 2021-08-13 2021-08-13 Malicious software detection system and method based on distributed API (application program interface) feature analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110951731.4A CN113761531A (en) 2021-08-13 2021-08-13 Malicious software detection system and method based on distributed API (application program interface) feature analysis

Publications (1)

Publication Number Publication Date
CN113761531A true CN113761531A (en) 2021-12-07

Family

ID=78790425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110951731.4A Withdrawn CN113761531A (en) 2021-08-13 2021-08-13 Malicious software detection system and method based on distributed API (application program interface) feature analysis

Country Status (1)

Country Link
CN (1) CN113761531A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563614A (en) * 2022-10-27 2023-01-03 任文欣 Software abnormal behavior file tracing method applied to artificial intelligence
CN116028277A (en) * 2023-03-27 2023-04-28 广州智算信息技术有限公司 Database backup method and system based on CDC mode
CN116226854A (en) * 2023-05-06 2023-06-06 江西萤火虫微电子科技有限公司 Malware detection method, system, readable storage medium and computer

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563614A (en) * 2022-10-27 2023-01-03 任文欣 Software abnormal behavior file tracing method applied to artificial intelligence
CN115563614B (en) * 2022-10-27 2023-08-04 艾德领客(上海)数字技术有限公司 Software abnormal behavior file tracing method applied to artificial intelligence
CN116028277A (en) * 2023-03-27 2023-04-28 广州智算信息技术有限公司 Database backup method and system based on CDC mode
CN116226854A (en) * 2023-05-06 2023-06-06 江西萤火虫微电子科技有限公司 Malware detection method, system, readable storage medium and computer

Similar Documents

Publication Publication Date Title
KR102403622B1 (en) Systems and methods for behavioral threat detection
CN113761531A (en) Malicious software detection system and method based on distributed API (application program interface) feature analysis
CN107659543B (en) Protection method for APT (android packet) attack of cloud platform
CN105100032B (en) A kind of method and device for preventing resource from stealing
CN109361643B (en) Deep tracing method for malicious sample
CN105187392B (en) Mobile terminal from malicious software detecting method and its system based on Network Access Point
CN111460446B (en) Malicious file detection method and device based on model
CN107003976A (en) Based on active rule can be permitted determine that activity can be permitted
CN112507330B (en) Malicious software detection system based on distributed sandbox
CN111090864B (en) Penetration test frame system, penetration test platform and penetration test method
JP7389806B2 (en) Systems and methods for behavioral threat detection
CN107426148A (en) A kind of anti-reptile method and system based on running environment feature recognition
CN109948335A (en) System and method for detecting the rogue activity in computer system
CN110995652B (en) Big data platform unknown threat detection method based on deep migration learning
Bernardi et al. A fuzzy-based process mining approach for dynamic malware detection
CN110572302B (en) Diskless local area network scene identification method and device and terminal
Kannan et al. A novel cloud intrusion detection system using feature selection and classification
Eldos et al. On the KDD'99 Dataset: Statistical Analysis for Feature Selection
Manthena et al. Analyzing and Explaining Black-Box Models for Online Malware Detection
CN116756738A (en) Malicious code detection system and method based on distributed API call relationship
Sun et al. Advances in Artificial Intelligence and Security: 7th International Conference, ICAIS 2021, Dublin, Ireland, July 19-23, 2021, Proceedings, Part III
JP2018132787A (en) Log analysis support apparatus and log analysis support method
CN113360916A (en) Risk detection method, device, equipment and medium for application programming interface
Liang et al. Semantics-based anomaly detection of processes in linux containers
Kandukuru et al. PNSDroid: a hybrid approach for detection of Android malware

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20211207