CN117113349A

CN117113349A - Malicious software detection method based on malicious behavior enhancement pre-training model

Info

Publication number: CN117113349A
Application number: CN202311076846.9A
Authority: CN
Inventors: 吴志杰; 寇亮
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-08-25
Filing date: 2023-08-25
Publication date: 2023-11-24

Abstract

The invention discloses a malicious software detection method based on a malicious behavior enhancement pre-training model. And secondly, forming a corpus by all API call sequences, and generating an API tuple set reflecting the malicious behaviors of the software according to the statistical rules. And then using the public data set to pretrain the malicious software detection model, inputting the unlabeled API call sequence into the malicious software detection model, and performing mask language model training on the malicious software detection model. And finally, fine tuning the malicious software detection model by using the labeled API call sequence, and realizing two classifications by linear dimension reduction to finish the detection of the malicious software. According to the method and the device, the malicious behavior characteristics of the software can be fully learned by the malicious software detection model, and the malicious behaviors and the malicious software can be accurately identified.

Description

Malicious software detection method based on malicious behavior enhancement pre-training model

Technical Field

The invention relates to the technical field of software safety protection, in particular to a method for detecting malicious software based on a pre-training model.

Background

With the rapid growth of the internet, the amount and variety of malware has also shown an explosive growth trend. According to data statistics of the German safety technology company AV-TEST, the number of malicious software exceeds nine hundred million by 7 months in 2023, wherein the malicious software of the Windows platform accounts for more than 75 percent. Because Windows operating systems are widely used in various fields, malware can cause serious damage and threats to user devices, data, and resources. Although Microsoft and software security companies have made great efforts in protecting user operating systems, the United states security technology company McAfee reports that the growth rate of malware is not suppressed, mainly because malware authors employ more advanced anti-detection techniques, which increase the probability of malware escape detection. Therefore, it is highly necessary to detect malware using advanced and efficient methods.

Malware necessarily has some form of malicious behavior feature, so one of the most common and efficient methods is to analyze API call sequences generated by the running of the malware, mine rules or patterns from them that reflect their behavior feature, and use these rules or patterns to identify whether an unknown sample is a malicious sample. Deep neural networks such as Convolutional Neural Networks (CNNs) and two-way long and short term memory networks (bilstms) in recent years exhibit powerful feature extraction and rule discovery capabilities and achieve good accuracy in this task of malware detection. Meanwhile, as the self-supervision pre-training model obtains optimal results on natural language processing tasks such as text classification, research and attempts to use the pre-training model to realize malicious software detection are started in the field. Pre-training models such as the bi-directional encoder representation (BERT) employed pre-training task Mask Language Model (MLM) enables models to learn API call sequence context semantic information, and studies indicate that incremental pre-training enables models to better adapt to downstream tasks. The pre-training model based on the attention mechanism can not only consider the semantic information of the API call sequence, but also discover the relation between the API calls. However, in malware detection of this task, a single API call is difficult to represent whether it possesses some form of malicious behavior, and existing pre-training models do not take this feature into account, resulting in detection results that may be suboptimal. Therefore, how to make the pre-training model learn the malicious behavior characteristics of the software more accurately and effectively is a problem to be solved.

Disclosure of Invention

In order to overcome the defects of the technology, the invention provides a pre-training model which is integrated with the behavior characteristics of the malicious software, and the malicious software can be accurately detected in a real sample.

The invention is realized by the following technical scheme:

the invention relates to a malicious software detection method based on a malicious behavior enhancement pre-training model, which comprises the following steps:

and step 1, extracting the API call sequence from the API call sequence data set.

And 2, forming a corpus by all the API call sequences extracted from the API call sequence data set, and generating an API tuple set capable of reflecting the malicious behaviors of the software according to the statistical rules.

And 3, pre-training a malicious software detection model by using the public data set, then continuously pre-training by using the unlabeled API call sequence, inputting the unlabeled API call sequence into the malicious software detection model, and performing Mask Language Model (MLM) training on the malicious software detection model.

And 4, fine tuning the malicious software detection model by using the labeled API call sequence, and performing linear dimension reduction to realize two classifications, thereby completing the detection of the malicious software.

The API call sequence data set refers to: and running the real sample software in the sandbox to form a set of extracted API call sequences, wherein the set comprises a label-free API call sequence set for pre-training of a malicious software detection model and a label API call sequence set for fine tuning.

The generation of the API tuple set capable of reflecting the malicious behavior of the software by the statistical rule refers to: the present invention proposes that TF-TW (Term Frequency-Term Weight) to evaluate those API tuples is helpful for detecting malicious behavior. The method comprises dividing all API call sequences in API call sequence data set into two groups, such asIf the sequence A1A2 A3A4 A5 … is divided into tuples A1A2, A2A3, A3A4, A4A5, …, then let TF represent the frequency of occurrence of a tuple in malware: tf=tn _m MTN, TN therein _m Representing the number of times a tuple appears in malware, the MTN is the total number of tuples of malware; the design idea of TW is that when a tuple appears only in malware, indicating that it is more representative of malicious behavior, it needs to be given a greater weight, on the contrary, a tuple appears quite frequently in malicious and non-malicious software, indicating that it is uncertain whether its behavior is malicious, and the weight needs to be reduced; thus the invention lets TW= (TN) _m /TN _αll ) K, where TN _αll Representing the total number of times a certain tuple appears, and considering the situation that a certain behavior is low in frequency but has a high probability of malicious behavior, the invention sets a parameter K of amplified weight when TN _m /TN _αll When the value of (2) is greater than 0.9, i.e. the tuple is a malicious behavior with a high probability of occurrence, K is set to 100 to amplify the weight, and otherwise is set to 1. The larger the TF-TW value TF×TW of a tuple, the less chance that the tuple will appear in malicious behavior and appear in normal behavior, the more the TF-TW value is ordered from big to small, and the 5% of the front value of the TF-TW is extracted as an API tuple set capable of reflecting the malicious behavior of the software.

The label-free API call sequence refers to: and extracting the API call sequence from the software with unknown behavior property in the API call sequence data set in the sandbox.

The tagged API call sequence refers to: the software of known behavior properties in the API call sequence dataset extracts the API call sequence in the sandbox.

The malicious software detection model comprises:

(1) An embedding layer: the embedding layer includes a word embedding layer, a location embedding layer, and an API tuple information layer. The word embedding layer and the position embedding layer are respectively formed by full connection layers; the word embedding layer converts each API in the input API call sequence into a vector with fixed dimension; the position embedding layer is used for representing a vector of position information of each API call in the input sequence and learning the sequence attribute of the input API call sequence.

The API Tuple information layer regards the API Tuple set as a dictionary, according to which, for any sequence of API calls, in addition to the API tuples matched to the API Tuple set, when both the (n-1, n) th and (n, n+1) th are present in the API Tuple set, the (n-1 ) th is considered as an API Tuple, where the (n-1, n) th represents the API Tuple that has been matched, and n-1 and n represent the API call in the matched API Tuple; an API tuple matching matrix is then created based on the API call sequence and the matched set of API tuples, wherein the matched API tuples are populated with 1's locations in the matrix and the unmatched locations are populated with 0's.

The embedded layer output is the sum of the word embedded layer, the location embedded layer and the API tuple information layer.

(2) Feature extraction layer: the feature extraction layer consists of 12 consecutive transducer encoders. After each encoder, a GELU activation function is used, the GELU activation function has smoother derivatives, the problem of gradient disappearance in the training process is reduced, the GELU activation function can also accelerate the convergence of a model, and the GELU activation function introduces a transformation similar to a sigmoid function in nonlinear transformation, so that the output of the GELU activation function can fall in a wider range, and the convergence speed of the model is accelerated. In addition to the residual connection and layer normalization operations common in deep neural networks, the encoder also contains a multi-head attention mechanism and a feed-forward layer. The mechanism of attention is actually in computing the correlation. First input X from the embedded layer _embedding The middle-order linear mapping generates Q, K and V; wherein Q represents query, K represents Key Key, V represents Value, and the calculation formula is Q=X _embedding *W _Q ,K＝X _embedding *W _K ,V＝X _embedding *W _V Wherein X is _embedding Is a matrix after passing through the embedded layer, W _Q ，W _K ，W _V Are weight matrices. The calculation formula of the attention mechanism is as followsWherein d is _k Is the dimension of each column feature in the key matrix, and the multi-head attention mechanism is realized by X _embedding And generating a plurality of Q, K and V, generating a plurality of attntions, multiplying a transformation matrix after the output of the multi-head attention mechanism is the splicing (Concat) of the plurality of attntions to keep the shape of the input matrix and the output matrix consistent, and enabling the feedforward layer to pass through an activation function RELU after being linearly mapped in two layers.

(3) Linear layer: in neural networks, the linear layer is a common network layer that implements linear transformations between two layers of neurons. The purpose of the linear layer is to linearly transform the input data, thereby changing its distribution or dimension, facilitating subsequent nonlinear operations or other network layers. The linear layer may also be regarded as a feature transformation or dimension reduction technique. The linear module is mainly used for reducing the dimension and finally identifying whether the software is malicious software or not.

The invention has the technical effects that: based on the problem that the existing pre-training model cannot effectively learn the malicious behavior characteristics of the software, the TF-TW is provided in combination with the malicious behavior characteristics of the software, the API tuple which can help the malicious software detection model to identify the malicious behavior is extracted, the malicious behavior information of the software is fully utilized, and the training malicious software detection model is used for strengthening the learning of the malicious behavior characteristics of the software in the pre-training and fine-tuning stages. Compared with the prior art, the method and the device have the advantages that the malicious software detection model can fully learn the malicious behavior characteristics of the software, and accurately identify the malicious behaviors and detect the malicious software.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram of a model of the present invention;

fig. 3 is a matrix diagram of malicious behavior information.

Detailed Description

As shown in fig. 1, the present embodiment relates to a method for detecting malicious software based on a pre-training model, including the following steps:

and step 1, extracting an API call sequence.

And running the real sample software in the sandbox to form a set of extracted API call sequences, wherein the set comprises a label-free API call sequence set for pre-training of a malicious software detection model and a label API call sequence set for fine tuning.

The running analysis result of the executable file contains the information which is not used in the present embodiment, such as Call name (Call name), call number (Call PID), and the like, besides the API Call sequence, so that the present embodiment firstly extracts the API Call sequence from the analysis result, and then deletes W or a at the end of each API Call. The reason for deleting W or A at the end of the API call is that the character represents only a different code style, with no impact on semantic information.

And 2, generating an API tuple set.

In the field of natural language processing NLP, TF-IDF (Term Frequency-inverse document Frequency) is a statistical method for evaluating the importance of a word to one of its documents in a corpus. Where TF is used to measure the frequency of a word or word in the corpus and IDF shows how much information the word or word provides. In the field of malware detection, it is also investigated to use TF-IDF to give different weights to APIs or API runtime parameters to better identify the malicious behavior of the software. This embodiment proposes TF-TW (Term Frequency-Term Weight) to evaluate which API tuples are helpful for detecting malicious behavior by taking into account the idea of TF-IDF. This is done by first partitioning all API call sequences into tuples, e.g. the API call sequences A1A2 A3A4 A5 … into tuples A1A2, A2A3, A3A4, A4A5, …, then letting TF represent how often a tuple appears in malware: tf=tn _m MTN, TN therein _m Representing the number of times a tuple appears in malware, the MTN is the total number of tuples of malware; the design idea of TW is that when a tuple appears only in malware, indicating that it is more representative of malicious behavior, it needs to be given a greater weight, on the contrary, a tuple appears quite frequently in malicious and non-malicious software, indicating that it is uncertain whether its behavior is malicious, and the weight needs to be reduced; thus the invention lets TW= (TN) _m /TN _αll ) K, where TN _αll Representing the total number of occurrences of a tuple, at the same timeConsidering that when a behavior frequency is low but a probability of malicious behavior is high, the invention sets a parameter K of amplified weight, when TN _m /TN _αll When the value of (2) is greater than 0.9, i.e. the tuple is a malicious behavior with a high probability of occurrence, K is set to 100 to amplify the weight, and otherwise is set to 1. The larger the TF-TW value TF×TW of a tuple, the less chance that the tuple will appear in malicious behavior and appear in normal behavior, the more the TF-TW value is ordered from big to small, and the 5% of the front value of the TF-TW is extracted as an API tuple set capable of reflecting the malicious behavior of the software.

And 3, the malicious software detection model firstly uses a text public data set in the Wikipedia to conduct pre-training, then uses an unlabeled API call sequence to conduct continuous pre-training on the malicious software detection model, inputs the unlabeled API call sequence into the malicious software detection model, and conducts Mask Language Model (MLM) training on the malicious software detection model.

The language model task is performed by designing a network structure, and then the linguistic knowledge in the language model task is extracted and encoded into the network structure by utilizing a large number of unlabeled natural language texts. When the data of a specific task with labeling information is limited, the prior linguistic features can have great feature supplement effect on the current task, so that the generalization capability of the model is enhanced. Training of neural network models requires a large number of sample training to converge, but it is not an easy matter to obtain a malware sample and extract the API call sequence it is running. The biggest benefit of the pre-training model is the reusability of the bottom layer features, and the problem of insufficient samples can be well solved.

As shown in fig. 2, the malware detection model includes an embedding layer, a feature extraction layer, and a linear layer.

The embedding layer includes a word embedding layer, a location embedding layer, and an API tuple information layer.

The word embedding layer and the position embedding layer are respectively formed by full connection layers; the word embedding layer converts each API in the input API call sequence into a vector with fixed dimension; the position embedding layer is used for representing a vector of position information of each API call in the input sequence and learning the sequence attribute of the input API call sequence;

(1) The word embedding layer functions to convert each API in the input API call sequence into a vector of fixed dimension, 768 dimensions in this embodiment.

(2) The position embedding layer can enable the model to learn the input sequence attribute.

(3) The API tuple information layer enables the model to learn the relation among the API tuples by adding the API tuple information into the malicious software detection model, so that malicious software can be better identified. The specific way is to regard the previously extracted set of API tuples as a dictionary, according to which, for any sequence of API calls, in addition to the API tuples matched to the set of API tuples, when both the (n, n-1) th and the (n, n+1) th Tuple are present in the set of API tuples, the Tuple (n-1 ) is regarded as an API Tuple, wherein the Tuple (n-1, n) represents the already matched API Tuple and n-1 and n represent the API call in the matched API Tuple. An API tuple matching matrix is then created based on the API call sequence and the matched set of API tuples, wherein the matched API tuples are populated with 1's locations in the matrix and the unmatched locations are populated with 0's.

The output of the embedding layer is the sum of the word embedding layer, the location embedding layer and the API tuple information layer. Fig. 3 is a flowchart of generating API tuple information.

The feature extraction layer consists of 12 consecutive transducer encoders. After each encoder, a GELU activation function is used, the GELU activation function has smoother derivatives, the problem of gradient disappearance in the training process is reduced, the GELU activation function can also accelerate the convergence of a model, and the GELU activation function introduces a transformation similar to a sigmoid function in nonlinear transformation, so that the output of the GELU activation function can fall in a wider range, and the convergence speed of the model is accelerated. In addition to the residual connection and layer normalization operations common in deep neural networks, the encoder also contains a multi-head attention mechanism and a feed-forward layer. The mechanism of attention is actually in computing the correlation. First input X from the embedded layer _embedding Mid-range linear mapping generationQ, K, V; wherein Q represents query, K represents Key Key, V represents Value, and the calculation formula is Q=X _embedding *W _Q ,K＝X _embedding *W _k ,V＝X _embedding *W _V Wherein X is _embedding Is a matrix after passing through the embedded layer, W _Q ，W _K ，W _V Are weight matrices. The attention mechanism calculation formula is as followsWherein d is _k Is the dimension of each column feature in the key matrix, the multi-headed attentiveness mechanism is through X _embedding And generating a plurality of Q, K and V, generating a plurality of attntions, multiplying a transformation matrix after the output of the multi-head attention mechanism is the splicing (Concat) of the plurality of attntions to keep the shape of the input matrix and the output matrix consistent, and enabling the feedforward layer to pass through an activation function RELU after being linearly mapped in two layers.

The linear layer implements a linear transformation between the two layers of neurons. The purpose of the linear layer is to linearly transform the input data, thereby changing its distribution or dimension, facilitating subsequent nonlinear operations or other network layers. The linear layer is considered a feature transformation or dimension reduction technique. The present embodiment uses linear layer dimension reduction to identify whether the software is malware.

The present embodiment employs a Masking Language Model (MLM) as a pre-training task, which randomly masks some of the tokens in the sequence and then trains a malware detection model to guess the masked tokens. Through the training, the malicious software detection model can learn the relation between the front and the back of a single API function in the API call sequence, and can also learn the relation between the front and the back of the API combinations through the API tuple information. Research shows that the field adaptive pre-training and task adaptive pre-training can further improve the performance of the pre-training model. Therefore, the embodiment uses one pretraining model parameter which is trained by using the recognition corpus and the corpus in the computer field in the Hugging Face for initialization, and further uses the unlabeled API call sequence to pretrain the malicious software detection model continuously.

And 4, performing fine adjustment on the model by using the labeled API call sequence to help the malicious software detection model to fit and realize two classification, and finally realizing the detection of the malicious software.

Through specific practical experiments, the implementation is carried out under the following environment settings: CPU Intel Cooli i513@2.20GHz, GPU NVIDIA RTX3060 12GB, operating system Windows 10, emulating the environment python.

The present embodiment evaluates the efficiency and effect of an algorithm by performing experiments on a real malware API call sequence dataset consisting of two parts: the first part comprises 40,000 unlabeled API call sequences for pre-training a malware detection model; the second part contains 40,000 tagged API call sequences for fine-tuning after the malware detection model is pre-trained. Wherein the second portion of the API call sequence data sets each account for half of the malware and benign software API call sequences. The API call sequence data set originates from a third party published publicly on Github, all data from a software sample running in the sandbox and extracted from the running report. This example is expressed in experiments in MalTBERT and the results are given in table 1 in%.

TABLE 1

The experiment adopts Accuracy (Accuracy), precision (Recall) and F1-Score (F1 Score) as model evaluation indexes, and the calculation method is as follows:

where TP (True Positive) represents classifying malware as malware, TN (True Negative) represents classifying benign software as benign software, FN (False Negative) is classifying malware as benign software, and FP (False Positive) is classifying benign software as malware.

Experimental results show that MalTBERT shows a very high evaluation index on the test set. As shown in table 1, both the pretrained models BERT and RoBERTa (pretrained language model based on the Transformer architecture) achieved more than 95% of the evaluation index on the test set. This result shows that the pre-trained model can achieve good malware detection capabilities when trained on a data set based on a sequence of API calls. MalTBERT shows better detection capability than BERT and RoBERTa, accuracy and F1-Score achieve ultra-high detection rate of 98.9%, and Recall achieves 99.8%. This indicates that the API tuple information is very effective for the model to identify malware.

This example compares MalTBERT with four commonly used machine learning methods, namely KNN (K nearest neighbor algorithm), LR (logistic regression algorithm), bayes (Bayesian algorithm) and SVM (support vector machine algorithm), and two deep learning methods, including the convolutional neural network of the invention (Text-CNN) and the bi-directional long-short term memory network (BiLSTM), in test dataset for Accuracy, precision, recall and F1-Score. As shown in table 1, malTBERT outperformed other deep learning and machine learning models. For example, malTBERT has an F1-Score of 98.92%, while other models are highest at 96.18% of Text-CNN. The method is superior to the existing method in four indexes of Accuracy, precision, recall and F1-Score.

The foregoing embodiments may be partially modified in numerous ways by those skilled in the art without departing from the principles and spirit of the invention, the scope of which is defined in the claims and not by the foregoing embodiments, and all such implementations are within the scope of the invention.

Claims

1. The malicious software detection method based on the malicious behavior enhancement pre-training model is characterized by comprising the following steps of:

step 1, extracting an API call sequence from an API call sequence data set;

step 2, forming a corpus by all API call sequences, and generating an API tuple set reflecting the malicious behaviors of the software according to the statistical rules;

step 3, pre-training a malicious software detection model by using the public data set, and then continuously pre-training by using the unlabeled API call sequence;

2. The method for detecting malicious software based on the malicious behavior enhancement pre-training model according to claim 1, wherein the API call sequence data set in step 1 is: and running the real sample software in the sandbox to form a set of extracted API call sequences, wherein the set comprises a label-free API call sequence set for pre-training of a malicious software detection model and a label API call sequence set for fine tuning.

3. The method for detecting malicious software based on the malicious behavior enhancement pre-training model according to claim 2, wherein the specific process of generating the API tuple set reflecting the malicious behavior of the software according to the statistical rule in step 2 is as follows:

2.1, dividing all API call sequences into two groups;

2.2, definition TF represents the frequency of occurrence of a tuple in malware: tf=tn _m MTN, TN therein _m Representing the number of times a tuple appears in malware, the MTN is the total number of tuples of malware;

definition Tw= (TN) _m /TN _αll )*K，Wherein TN is _αll Representing the total number of times a certain tuple appears, K being a parameter;

2.3, defining the TF-TW value as TF×TW, sequencing the TF-TW value from big to small, and extracting the 5% value before the TF-TW as an API tuple set reflecting the malicious behavior of the software.

4. A method for detecting malicious software based on a pre-training model for malicious behavior enhancement according to claim 3, wherein in step 3 and step 4, the unlabeled API call sequence refers to: extracting an API call sequence from software with unknown behavior properties in the API call sequence data set in a sandbox;

5. The method for detecting malicious software based on a pre-training model for malicious behavior enhancement according to claim 4, wherein in step 3, the malicious software detection model comprises an embedded layer, a feature extraction layer and a linear layer which are sequentially connected;

the embedded layer comprises a word embedded layer, a position embedded layer and an API tuple information layer;

the feature extraction layer: the feature extraction layer consists of 12 consecutive transform encoders, after each of which a GELU activation function is used; the encoder also comprises a multi-head attention mechanism and a feedforward layer, wherein the feedforward layer is a two-layer linear mapping and then passes through an activation function RELU.

6. The method for detecting malicious software based on a malicious behavior enhancement pre-training model according to claim 5, wherein the word embedding layer and the position embedding layer are respectively formed by full connection layers; the word embedding layer converts each API in the input API call sequence into a vector with fixed dimension; the position embedding layer is used for representing a vector of position information of each API call in the input sequence and learning the sequence attribute of the input API call sequence;

the API Tuple information layer regards the API Tuple set as a dictionary, according to which, for any sequence of API calls, in addition to the API tuples matched to the API Tuple set, when both the (n-1, n) th and (n, n+1) th are present in the API Tuple set, the (n-1 ) th is considered as an API Tuple, where the (n-1, n) th represents the API Tuple that has been matched, and n-1 and n represent the API call in the matched API Tuple; then, an API tuple matching matrix is established according to the API call sequence and the matched API tuple set, wherein the matched API tuple is filled with 1 in the position in the matrix, and the unmatched position is filled with 0;

7. The method for detecting malicious software based on the malicious behavior enhancement pre-training model according to claim 6, wherein in step 3, the use of the unlabeled API call sequence for continuous pre-training is specifically as follows: inputting the unlabeled API call sequence into a malicious software detection model, and performing Mask Language Model (MLM) training on the malicious software detection model.