CN115114627B

CN115114627B - Malicious software detection method and device

Info

Publication number: CN115114627B
Application number: CN202211043847.9A
Authority: CN
Inventors: 周公延; 陈杰; 赵林林; 薛锋; 童兆丰
Original assignee: Beijing ThreatBook Technology Co Ltd
Current assignee: Beijing ThreatBook Technology Co Ltd
Priority date: 2022-08-30
Filing date: 2022-08-30
Publication date: 2022-12-16
Anticipated expiration: 2042-08-30
Also published as: CN115114627A

Abstract

The application provides a malicious software detection method and a malicious software detection device, wherein the method comprises the following steps: behavior data generated when the malicious software runs in the sandbox are collected, and a training data set is constructed based on the behavior data; expanding the training data set through a random walk algorithm to obtain an expanded data set; based on the extended data set, acquiring a vectorized API calling sequence; constructing a malicious software dynamic detection model based on the API calling sequence; and detecting the to-be-detected software running in real time based on the malicious software dynamic detection model to obtain a malicious software detection result. Therefore, by the implementation of the implementation mode, the current detection predicament can be broken through based on the random walk algorithm, so that the malicious software can be detected more effectively, and the overall detection efficiency is improved.

Description

Malicious software detection method and device

Technical Field

The application relates to the field of network security, in particular to a malicious software detection method and device.

Background

With the rapid development of big data and artificial intelligence technology, network attacks present the characteristics of large quantity, rapidness, persistence and the like, thereby bringing great challenges to the information security of each large enterprise. Among them, malware is a malicious attack method that is most easily utilized, and an attacker has mounted various technologies such as multithread obfuscation in order to make the attacker exert the greatest harm. This directly results in that the existing malware detection system cannot effectively detect malware, and also greatly reduces the efficiency of detecting malware.

Disclosure of Invention

The embodiment of the application aims to provide a method and a device for detecting malicious software, which can break through the current detection dilemma based on a random walk algorithm, so that the malicious software can be detected more effectively, and the overall detection efficiency is improved.

A first aspect of an embodiment of the present application provides a method for detecting malware, including:

behavior data generated when the malicious software runs in the sandbox is collected, and a training data set is constructed based on the behavior data;

expanding the training data set through a random walk algorithm to obtain an expanded data set;

based on the extended data set, acquiring a vectorized API call sequence;

building a malicious software dynamic detection model based on the API calling sequence;

and detecting the to-be-detected software running in real time based on the dynamic malicious software detection model to obtain a malicious software detection result.

In the implementation process, the method can preferentially collect behavior data generated when the malicious software runs in the sandbox, and construct a training data set based on the behavior data; expanding the training data set through a random walk algorithm to obtain an expanded data set; then, acquiring a vectorized API call sequence based on the expansion data set; then, constructing a dynamic detection model of the malicious software based on the API calling sequence; and finally, detecting the to-be-detected software running in real time based on the dynamic malicious software detection model to obtain a malicious software detection result. Therefore, the implementation of the method can break through the current detection dilemma based on the random walk algorithm, so that the malicious software can be detected more effectively, and the overall detection efficiency is improved.

Further, the behavior data at least comprises process information, thread information and an API call function name.

Further, the step of expanding the training data set by the random walk algorithm to obtain an expanded data set includes:

extracting the name of an API (application programming interface) calling function of the behavior data in the training data set;

recording a plurality of threads where the behavior data are located based on the API call function name;

calling a random walk algorithm among the multiple threads to obtain walk data;

and expanding the training data set based on the walking data to obtain an expanded data set.

Further, the step of obtaining a vectorized API call sequence based on the augmented data set comprises:

extracting all API call function names in the extended data set;

coding the API calling function name to obtain an API calling function table;

and vectorizing based on the API call function table to obtain a vectorized API call sequence.

Further, the step of detecting the to-be-detected software running in real time based on the dynamic malware detection model to obtain a malware detection result includes:

recording API calling behaviors of the software to be detected running in real time;

and analyzing the API calling behavior based on the malicious software dynamic detection model to obtain a malicious software detection result.

A second aspect of embodiments of the present application provides a malware detection apparatus, where the malware detection apparatus includes:

the building unit is used for collecting behavior data generated when the malicious software runs in the sandbox and building a training data set based on the behavior data;

the expansion unit is used for expanding the training data set through a random walk algorithm to obtain an expansion data set;

an obtaining unit, configured to obtain a vectorized API call sequence based on the extended data set;

the modeling unit is used for constructing a malicious software dynamic detection model based on the API calling sequence;

and the detection unit is used for detecting the to-be-detected software running in real time based on the dynamic detection model of the malicious software to obtain a malicious software detection result.

In the implementation process, the malicious software detection device construction unit collects behavior data generated when malicious software runs in a sandbox, and constructs a training data set based on the behavior data; expanding the training data set through a random walk algorithm by an expansion unit to obtain an expanded data set; acquiring a vectorized API call sequence based on the expansion data set through an acquisition unit; constructing a dynamic detection model of the malicious software through a modeling unit based on the API calling sequence; and detecting the to-be-detected software running in real time by using a detection unit based on the dynamic malicious software detection model to obtain a malicious software detection result. Therefore, by the implementation of the implementation mode, the current detection predicament can be broken through based on the random walk algorithm, so that the malicious software can be detected more effectively, and the overall detection efficiency is improved.

Further, the extension unit includes:

the first extraction subunit is used for extracting the API call function name of the behavior data in the training data set;

the recording subunit is used for recording a plurality of threads where the behavior data are located based on the API calling function name;

the calling subunit is used for calling a random walk algorithm among the multiple threads to obtain walk data;

and the expansion subunit is used for expanding the training data set based on the walking data to obtain an expansion data set.

Further, the acquisition unit includes:

the second extraction subunit is used for extracting all the API call function names in the expansion data set;

the processing subunit is used for coding the name of the API calling function to obtain an API calling function table;

the processing subunit is further configured to perform vectorization processing based on the API call function table, so as to obtain a vectorized API call sequence.

Further, the detection unit includes:

the second recording subunit is used for recording the API calling behavior of the to-be-detected software running in real time;

and the analysis subunit is used for analyzing the API calling behavior based on the malicious software dynamic detection model to obtain a malicious software detection result.

A third aspect of embodiments of the present application provides an electronic device, including a memory and a processor, where the memory is used to store a computer program, and the processor runs the computer program to enable the electronic device to execute the malware detection method according to any one of the first aspect of embodiments of the present application.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, which stores computer program instructions, and when the computer program instructions are read and executed by a processor, the computer program instructions perform the malware detection method according to any one of the first aspect of the embodiments of the present application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic flowchart illustrating a malware detection method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a malware detection apparatus according to an embodiment of the present application;

fig. 3 is a diagram of a process of inter-thread random walk according to an embodiment of the present application;

fig. 4 is a schematic diagram of a vector training and retrieving structure according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not construed as indicating or implying relative importance.

Example 1

Referring to fig. 1, fig. 1 is a flowchart illustrating a malware detection method according to the present embodiment. The malicious software detection method comprises the following steps:

s101, behavior data generated when the malicious software runs in the sandbox are collected, and a training data set is constructed based on the behavior data.

In this embodiment, the behavior data at least includes process information, thread information, and an API call function name.

In this embodiment, the method may run the malware in the sandbox, collect behavior data, and construct a training data set.

In the embodiment, the method can divide a large amount of malicious software data into a training set, a verification set and a test set, respectively run the malicious software in the training set and the verification set in a sandbox, and acquire behavior information; meanwhile, the method can be divided according to the data set of the original data, and a new training set and a new testing set are generated by the collected behavior data; where each sample (i.e., behavior data) includes at least a process, a thread, and an API call function name.

And S102, extracting the API call function name of the behavior data in the training data set.

And S103, recording a plurality of threads where the behavior data are located based on the API calling function name.

In this embodiment, the method may mark the thread where the API call function name of each sample data is located based on the API call function name of each sample data.

In this embodiment, the method mainly performs recording preprocessing on data based on a plurality of threads.

In this embodiment, the method may determine the number of threads of the sample data (i.e., behavior data), and use a random walk method between threads when the number of threads is greater than 2.

And S104, calling a random walk algorithm among the multiple threads to obtain walk data.

In this embodiment, the data after the random walk is added to the data set as a new sample data, and the tag is kept consistent with the tag before the random walk.

And S105, expanding the training data set based on the wandering data to obtain an expanded data set.

In this embodiment, the method may be applied to different threads of software running data using a random walk algorithm.

In this embodiment, the random walk algorithm used by the method is described as follows:

assuming that the number of threads is M, selecting 1 thread as a starting point with the probability of 1/M, recording as M =0, then updating M to M +1 in each step, and performing M-step non-repeated sampling, wherein the sampling probability of each step is 1/M-M; the above process is repeated M times, and finally each original sample data becomes M square data.

And S106, extracting all API call function names in the expansion data set.

And S107, coding the API calling function name to obtain an API calling function table.

In this embodiment, the method may perform unified encoding processing on all API functions in advance, that is, ignore case and case.

In this embodiment, the method may further continue to use word-level encoding to construct the API function vocabulary. The encoding mode is a mode of adding filling marks and unknown word marks into a word list based on an API function level word grading mode to generate an API call function list.

And S108, carrying out vectorization processing based on the API call function table to obtain a vectorized API call sequence.

In this embodiment, the method may generate an API function vectorization model by using pre-training manners such as word2vec and Bert based on pre-training API call function vectorization, and store an embedding vector of each API call function.

In the embodiment, the method can vectorize the API calling sequence and construct an offline vector retrieval library; the off-line search library can be equivalent to a malware dynamic detection model and is a module with a function of detecting API call behaviors.

And S109, constructing a malicious software dynamic detection model based on the API calling sequence.

In this embodiment, the method may be based on a statistical learning model (e.g., random forest), and first use a sentence vector method (e.g., weighted average, doc2vec, etc.) to convert the API function embedding vector into an API call sequence vector; for deep learning methods, such as textcnn and lstm, the embedding vector of the API function sequence is directly input to the model.

In this embodiment, the method may minimize the objective function when training the model.

And S110, recording the API call behavior of the software to be detected running in real time.

In this embodiment, the method can record the API call behavior of running the malware in real time.

And S111, analyzing the API calling behavior based on the malicious software dynamic detection model to obtain a malicious software detection result.

In this embodiment, the method may search for APIs in a vector search library based on API functions embedding, and vectorize the APIs; and then detecting whether the running software is malicious software or not in real time based on the vectorized API.

Therefore, the method can use a random walk method to divide the collected API calling sequence according to the thread id and carry out random walk among threads; meanwhile, the method can also be used for pre-training the API calling sequence by using a pre-training method, and the embedding offline storage of the API calling is generated, so that the API calling can be directly called during online detection.

For example, the method can be implemented by the following steps:

(1) and running the malicious software in the sandbox, collecting behavior data and constructing a data set.

Specifically, the method can divide a large amount of malicious software data into a training set, a verification set and a test set, run the malicious software in the training set and the verification set in a sandbox respectively, and collect behavior information, wherein the behavior information of the embodiment includes pid of a process, a thread tid and a name of calling an api function;

then, dividing the method according to a data set of original data, and generating a new training set and a new testing set from the collected behavior data; wherein each sample comprises at least a process, a thread, and an API call;

(2) a random walk method is used to apply to different threads of software running data.

Specifically, the method can mark the API call of each sample data to the thread where the API call is located, the thread number is represented by tid, and each data is represented by tid: an API, if tid is 3008 corresponding to API CreateRectRgn, then it is denoted 3008 CreateRectRgn.

Then, the number of threads of the sample data is judged, and if the number of threads is more than 2, a random walk method is used among the threads. The random walk used by the method is described as follows, the number of threads is assumed to be M, 1 thread is selected as a starting point according to the probability of 1/M and is marked as M =0, then M is updated to M +1 in each step, M steps are carried out without repeated sampling, and the sampling probability of each step is p, wherein; the above process is repeated M times, and finally each original sample data becomes M factorial, i.e. M | data, as shown in fig. 3.

(3) The API call sequence is vectorized to construct an offline search library, as shown in fig. 4.

Firstly, the method carries out data preprocessing, and uniformly encodes all API functions, namely neglecting case, such as InternalSetIpstats and InternalSetTcpEntry which are uniformly marked as Internalsetiptists and Internalsettcpentry;

then, an API function word list is built, word level coding, namely an API function word level coding mode, is used, filling marks are added into the word list, and unknown word marks are added into the word list, so that a dictionary is generated. Wherein the pad identifier is used for sequence padding, the unknown identifier is used for not taking a Windows operating system as an example in the embodiment, only 20000 most commonly used APIs are considered, and other APIs are represented by "[ other ]";

secondly, based on the vectorization of the API call functions of pre-training, generating an API function vectorization model by using word2vec, bert and other pre-training modes, wherein in the embodiment, bert is used as the pre-training model, and an embedding vector of each API call function is stored;

finally, an offline vector search library is constructed based on an ANN method, and the embodiment uses faiss as a vector search tool.

(4) And constructing a dynamic malware detection model based on the API calling sequence.

In particular, the method uses an API call sequence vector representation. For a model based on statistical learning, such as a random forest, the method firstly uses a sentence vector method, such as weighted average, doc2vec and the like, to convert an API function embedding vector into an API call sequence vector; for deep learning methods, such as textcnn and lstm, directly inputting the embedding vector of the API function sequence into the model; this embodiment uses textcnn as a detection model.

Meanwhile, in the process of training the model, the method uses a cross entropy loss function to minimize an objective function, and the function is as follows:

(5) and the online service judges whether the software running in real time is malicious or not.

Firstly, recording API calling behaviors of running malicious software in real time;

then, searching the API in a vector search library based on an API function embedding, and vectorizing the API;

and finally, detecting whether the running software is malicious or not in real time.

In this embodiment, the execution subject of the method may be a computing device such as a computer and a server, and is not limited in this embodiment.

In this embodiment, the main body of the method may also be an intelligent device such as a smart phone and a tablet computer, which is not limited in this embodiment.

It can be seen that by implementing the malware detection method described in this embodiment, real-time malware detection can be dynamically operated, and a part of confusion problems can be alleviated based on the proposed random walk method, and a data enhancement effect is achieved. In addition, the offline pre-training storage API embedded vector provided by the method can maximally reduce the time delay of real-time detection, so that the detection effect is better.

Example 2

Referring to fig. 2, fig. 2 is a schematic structural diagram of a malware detection apparatus provided in this embodiment. As shown in fig. 2, the malware detection apparatus includes:

the building unit 210 is configured to collect behavior data generated when the malware runs in the sandbox, and build a training data set based on the behavior data;

an expansion unit 220, configured to expand the training data set through a random walk algorithm to obtain an expanded data set;

an obtaining unit 230, configured to obtain a vectorized API call sequence based on the extended data set;

the modeling unit 240 is configured to construct a dynamic malware detection model based on the API call sequence;

the detecting unit 250 is configured to detect the to-be-detected software running in real time based on the malware dynamic detection model, so as to obtain a malware detection result.

As an alternative embodiment, the expansion unit 220 includes:

a first extraction subunit 221, configured to extract an API call function name of behavior data in the training data set;

a recording subunit 222, configured to record, based on the API call function name, multiple threads where the behavior data is located;

a calling subunit 223, configured to call a random walk algorithm among the multiple threads, to obtain walk data;

and the expansion subunit 224 is configured to expand the training data set based on the walking data to obtain an expanded data set.

As an alternative embodiment, the obtaining unit 230 includes:

a second extraction subunit 231, configured to extract all API call function names in the extended data set;

a processing subunit 232, configured to encode the API call function name to obtain an API call function table;

the processing subunit 232 is further configured to perform vectorization processing based on the API call function table, so as to obtain a vectorized API call sequence.

As an alternative embodiment, the detection unit 250 includes:

the second recording subunit 251 is configured to record an API call behavior of the to-be-detected software running in real time;

and the analysis subunit 252 is configured to analyze the API call behavior based on the malware dynamic detection model to obtain a malware detection result.

In this embodiment, for the explanation of the malware detection apparatus, reference may be made to the description in embodiment 1, and details are not repeated in this embodiment.

It can be seen that, by implementing the malware detection device described in this embodiment, real-time malware detection can be dynamically operated, and a part of confusion problems can be alleviated based on the proposed random walk method, and a data enhancement effect is achieved. In addition, the offline pre-training storage API embedded vector provided by the method can maximally reduce the time delay of real-time detection, so that the detection effect is better.

An embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory is used to store a computer program, and the processor runs the computer program to enable the electronic device to execute the malware detection method in embodiment 1 of the present application.

An embodiment of the present application provides a computer-readable storage medium, which stores computer program instructions, and when the computer program instructions are read and executed by a processor, the method for detecting malicious software in embodiment 1 of the present application is executed.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative and, for example, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It should be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A malware detection method, the method comprising:

based on the extended data set, acquiring a vectorized API call sequence;

constructing a malicious software dynamic detection model based on the API calling sequence;

detecting the to-be-detected software running in real time based on the dynamic detection model of the malicious software to obtain a detection result of the malicious software;

wherein the step of expanding the training data set by the random walk algorithm to obtain an expanded data set comprises:

calling a random walk algorithm among the multiple threads to obtain walk data; the random walk algorithm is applied to different threads of software running data;

expanding the training data set based on the walking data to obtain an expanded data set;

the step of calling a random walk algorithm among the multiple threads to obtain walk data comprises the following steps:

selecting 1 thread as a starting point by the probability of 1/(M-M), wherein M is 0;

updating M to M +1 by selecting a new thread with a probability of 1/(M-M) until M-step repeated sampling is carried out to obtain walk data;

acquiring all non-repeated wander data; wherein, the number of the wandering data is M (M-1) _ (M-2) _ 8230, 2 |, 1, namely, the number of the wandering data is M! .

2. The malware detection method of claim 1, wherein the behavioral data includes at least process information, thread information, and API call function name.

3. The malware detection method of claim 1, wherein obtaining a sequence of vectorized API calls based on the augmented dataset comprises:

extracting all API call function names in the extended data set;

coding the API calling function name to obtain an API calling function table;

4. The malware detection method according to claim 1, wherein the step of detecting the to-be-detected software running in real time based on the malware dynamic detection model to obtain a malware detection result comprises:

recording API calling behaviors of the to-be-detected software running in real time;

5. A malware detection apparatus, comprising:

the detection unit is used for detecting the to-be-detected software running in real time based on the malicious software dynamic detection model to obtain a malicious software detection result;

wherein the extension unit includes:

the calling subunit is used for calling a random walk algorithm among the multiple threads to obtain walk data; the random walk algorithm is applied to different threads of software running data;

the expansion subunit is used for expanding the training data set based on the wandering data to obtain an expansion data set;

the number of threads of the multiple threads is M, and M is greater than 2, the calling subunit is specifically configured to select 1 thread as a starting point with a probability of 1/(M-M), where M is 0; updating M to M +1 by selecting a new thread with a probability of 1/(M-M) until M-step repeated sampling is carried out to obtain walk data; acquiring all non-repeated wander data; wherein, the number of the wandering data is M (M-1) _ (M-2) _ 8230, 2 |, 1, namely, the number of the wandering data is M! .

6. The malware detection apparatus of claim 5, wherein the behavioral data includes at least process information, thread information, and API call function names.

7. An electronic device, comprising a memory for storing a computer program and a processor for executing the computer program to cause the electronic device to perform the malware detection method of any one of claims 1-4.

8. A readable storage medium having stored therein computer program instructions which, when read and executed by a processor, perform the malware detection method of any one of claims 1 to 4.