CN115545091A

CN115545091A - Integrated learner-based malicious program API (application program interface) calling sequence detection method

Info

Publication number: CN115545091A
Application number: CN202211015410.4A
Authority: CN
Inventors: 杨强; 汪金明; 杨涛; 阮伟; 王文海
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-08-24
Filing date: 2022-08-24
Publication date: 2022-12-30

Abstract

The invention provides a malicious program API calling sequence detection method based on an ensemble learner. The method uses a plurality of base classifiers as the first-layer content of the ensemble learner, and labels are printed on a training set sample for supervised learning; obtaining a base classifier group after model training is finished, and labeling a verification set sample by each base classifier in the base classifier group so as to finally obtain a base label vector set consisting of N base labels; inputting the base label vector set serving as a training set into a meta classification model for training to obtain a meta classifier; in practical application, after the training of the base classifier group and the meta classifier is completed, the API call sequence subjected to data processing is output to the base classifier group to obtain a base label vector, the base label vector is input to the meta classifier, and the meta classifier gives a final label of the API sequence.

Description

Integrated learner-based malicious program API (application programming interface) calling sequence detection method

Technical Field

The invention relates to a malicious program API (application program interface) calling sequence detection method, in particular to a malicious program API calling sequence detection method based on an integrated learner.

Background

In network security protection, defending against malicious software attacks is one of the key problems that the network security needs to solve urgently at present. At present, common antivirus software mostly adopts a static detection method, such as a virus library label-based detection method and a heuristic detection method, and the traditional methods mostly extract static characteristics including malicious characteristics such as byte sequences, character string matching and code sequences in the malware to detect, so that the traditional methods have a good detection effect on known viruses. With the development of malware manufacturing technology, attackers can also continuously improve their malware, for example, by obfuscating the features of malicious code in the malware or changing the structure of the code runtime behavior, and the traditional security detection policy has little success in improving the malware or new viruses. To address the limitations of static detection, dynamic behavior detection arises at the same time. Dynamic profiling is performed by placing malware in a secure virtual environment called a sandbox, and then obtaining its behavioral data such as registries, logs, flow activities, and API call sequences. Compared with static analysis, although dynamic analysis consumes more resources, the dynamic analysis does not need reverse engineering methods such as decompilation and the like, and dynamic characteristics obtained after the program runs are analyzed, so that the effect and robustness of the dynamic analysis are stronger for improved viruses and even new viruses.

Malware detectors based on dynamic analysis focus primarily on API behavior information obtained from software interaction with the operating system at dynamic runtime. The intelligent malicious software detection method based on the API behavior characteristics can identify unknown malicious files with behaviors similar to known malicious files, and is not influenced by technologies such as polymorphism, code confusion, encryption and shell addition. Monitoring the executable file by utilizing a sequence of API calls between the process and the operating system is one of the best strategies, since APIs are considered the most important behavioral differences between malicious and non-malicious processes.

Many researches nowadays take API as main characteristic, deep learning algorithm as main classification tool to detect malicious samples, but the detection system based on single classifier has relative instability because the single algorithm is easy to have overfitting condition and is also easy to be interfered by anti-attack. Therefore, how to improve the effect of the API sequence detector and the robustness of the whole detection system by using the multi-classifier characteristic of ensemble learning becomes a key problem to be solved in the current field.

Disclosure of Invention

The invention aims to solve the problem of how to improve the effect and robustness of a malicious detector based on API behavior detection in the field of network security protection. Aiming at the problems that the existing malicious detector based on API behavior detection has poor effect and poor robustness, a further optimization method is provided: the API call sequence detection method based on the malicious sample of the ensemble learner has guiding significance for improving the effect and robustness of the API behavior detection-based malicious detector.

The purpose of the invention can be realized by the following technical scheme:

a malicious program API calling sequence detection method based on an ensemble learner comprises the following steps:

(1) Obtaining malicious API sequence samples and benign API sequence samples

(2) And extracting the feature vectors of the malicious API sequence samples and the benign API sequence samples to form a data set, dividing the data set into a training set and a verification set according to a proportion, and training N base classification models by using the training set to obtain the trained base classifier.

(3) Classifying the data of the verification set by using the trained base classifier, outputting a base label vector for the feature vector of each sample to form a base label vector set, and training the meta-classification model by using the base label vector set to obtain the trained meta-classifier.

(4) And putting the executable program to be detected into a sandbox to operate to obtain an API calling sequence of the executable program, and extracting the characteristic vector of the API calling sequence of the executable program to detect on the integrated learner.

As a preferred scheme of the invention, the number N of the base classification models is more than or equal to 5.

As a preferred scheme of the invention, the proportion of the training set to the validation set is 6.

As a preferred embodiment of the present invention, the malicious and benign samples are both executable program samples with known signatures, which is an assessment that a sample is malicious or benign.

Further, the step (1) is specifically as follows: and putting the malicious samples and the benign samples into a Cuckoo sandbox to operate so as to respectively obtain the malicious API sequence samples and the benign API sequence samples.

Further, the specific details in the step (2) include:

(2.1) extracting the characteristics of the malicious API sequence samples and the benign API sequence samples, wherein each API sequence sample corresponds to a 1 xq-dimensional characteristic vector to form a characteristic vector data set which is divided into a training set and a verification set, and the formula is as follows:

Mal' _i ＝Feature_extraction(Mal _i )

Benign' _i ＝Feature_extraction(Benign _i )

in the formula, feature _ extraction () is a Feature extraction algorithm, mal _i Is the ith malicious API sequence sample; mal' _i Is Mal _i Obtaining a 1 xq dimensional feature vector after feature extraction; benign _i Is the ith benign API sequence sample; benign' _i Is Benign _i Obtaining a 1 xq dimensional feature vector after feature extraction; API _ sequence is a set of sum of feature vectors for malicious and benign API sequences, m is the number of malicious samples, and n is the number of benign samples.

(2.2) training the N base classification models by using a training set, selecting each base classification model to output a label of Probasic or Presect value, wherein the Probasic is the possibility that the sample is a malicious sample, and the value fluctuates within 0-1, when the value is closer to 1, the base classifier tends to judge that the sample is the malicious sample, otherwise, the value is closer to 0, the sample is judged to be a benign sample, and the Presect value is the predicted value of the base classifier, and is 0 or 1, when the value is 1, the classifier judges the sample as the malicious sample, and when the value is 0, the classifier judges the sample as the benign sample.

Further the step (3) is specifically

(3.1) after the training of the N basic classification models in the (2.2) is finished, obtaining N corresponding basic classifiers; and then using the verification set as input to obtain a base label vector set of the verification set by the base classifier, wherein the formula is as follows:

V _i ＝[P _i,1 ,P _i,2 ,P _i,3 ,…,P _i,N ,L _i ]

in the formula, V _i Base label vector, P, obtained for the ith feature vector _i,1 ,P _i,2 ,P _i,3 ,…,P _i,N Respectively, the output values, L, of the ith eigenvector in the verification set for the N basis classifiers _i A real label of the ith feature vector sample; v is a base tag vector set obtained from the verification set, and l is the number of samples in the verification set.

And (3.2) taking the base label vector set as a training set of the meta-classification model, training the meta-classifier, and combining the base classifier and the meta-classifier after the training is finished to obtain the complete integrated learning device.

The invention has the beneficial effects that:

(1) The invention innovatively establishes a novel integrated learning classification model which comprises a plurality of base classifiers and a meta classifier, wherein the base classifiers are trained by utilizing a training set, data of a verification set is used as input to obtain the output of the base classifiers, namely base label vectors, which are used as the training data of the meta classifier, and finally a complete integrated learning classification model is obtained; in the model provided by the invention, if a single classifier is adopted, the classification effect is not ideal, and the antagonism is poor. Therefore, the invention adopts N (N is more than or equal to 5) different base classifiers and one meta classifier as the key node of the classification model, thereby not only improving the performance of the classifier, but also improving the robustness and the antagonism of the model.

(2) Compared with the existing common ensemble learning method, the method is characterized in that the data used for training the meta classifier are different, the output labels in the training process of the base classifier are used in the common ensemble learning method, and the training data of the meta classifier in the method are derived from the output labels of the trained base classifier on the verification set samples; the common ensemble learning method has the defect that the overfitting phenomenon easily exists, so that the method adopted by the invention can avoid the model from falling into overfitting.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The objects and effects of the present invention will become more apparent from the following detailed description of the present invention with reference to the accompanying drawings.

The invention discloses a malicious program API (application program interface) calling sequence detection method based on an ensemble learner, which comprises the following steps of:

(1) Obtaining malicious API sequence samples and benign API sequence samples

(2) Extracting feature vectors of the malicious API sequence samples and the benign API sequence samples to form a data set, dividing the data set into a training set and a verification set according to a proportion, and training N base classification models by using the training set;

(3) Classifying the data of the verification set by using the trained base classifier, outputting a base label vector for the feature vector of each sample to form a base label vector set, and training the element classification model by using the base label vector set to obtain the trained element classifier.

(4) And putting the executable program to be detected into a sandbox to operate to obtain an executable program API calling sequence, and extracting a feature vector of the executable program API calling sequence to detect on the integrated learner.

In one embodiment of the present invention, the various steps of the present invention are described in more detail.

And (3) collecting a large number of executable program samples from various security network manufacturers and virus websites, wherein the number of the collected malicious samples is 14800, and the number of the collected benign samples is 14800, putting the samples into a sandbox for operation, extracting an operation log and analyzing to obtain an API call information table. After dividing both malicious and benign samples randomly according to the proportion of 6.

TABLE 1

After obtaining the API samples, performing feature extraction on the API sequence samples, wherein each sample corresponds to a 1 xq-dimensional feature vector to form a feature vector data set, and the formula is as follows:

Mal' _i ＝Feature_extraction(Mal _i )

Benign' _i ＝Feature_extraction(Benign _i )

The method comprises the steps of training N base classification models by using a training set, wherein N =11 is selected temporarily, the base classification models select decision trees DT classification, logistic regression LR classification, K neighbor KNN classification, random forest RF classification, gradient enhancement GB classification, support vector machine SVM classification, multi-layer perceptron MLP, recurrent neural network RNN, bidirectional recurrent neural network BiRNN, long and short memory network LSTM and bidirectional long and short memory network BiLSTM, each classification model is selected to output a label of Probasic or Predict value, probasic is the possibility that a sample is a malicious sample, the value fluctuates within 0-1, when the value is closer to 1, the sample is more prone to be judged as the malicious sample, otherwise, the sample is more prone to be judged as the benign sample, when the value is closer to 0, the Predict value is 0 or 1, the base classifier judges the sample as the malicious sample, and when the value is 0, the base classifier judges the sample as the malicious sample as the benign sample.

Subsequently, the verification set is used to verify the 11 trained base classifiers and obtain the classification performance (such as Accuracy, recall, etc.), the performance of each base classifier is shown in table 2, and each base label vector group is obtained, and the formula is as follows:

V _i ＝[P _i,1 ,P _i,2 ,P _i,3 ,P _i,4 ,P _i,5 ,P _i,6 ,P _i,7 ,P _i,8 ,P _i,9 ,P _i,10 ,P _i,11 ,L _i ]

in the formula, V _i Base tag vector, P, obtained for the ith feature vector _i,1 ,P _i,2 ,P _i,3 ,P _i,4 ,P _i,5 ,P _i,6 ,P _i,7 ,P _i,8 ,P _i,9 ,P _i,10 ,P _i,11 DT, GB, LR, KNN, RF, SThe 11 base classifiers VM, MLP, RNN, biRNN, LSTM, biLSTM output P value, L to the ith feature vector in the verification set _i A real label of the ith feature vector sample; v is a base tag vector set obtained from the verification set, and l is the number of samples in the verification set.

TABLE 2 base classifier Performance

Processed data	Accuracy(％)	Precision(％)	Recall(％)	f1_score(％)
					LR	87.06	85.23	89.20	87.17
KNN	86.91	87.18	86.41	86.79
					DT	81.96	80.59	83.61	82.07
GBDT	88.75	89.25	88.05	88.64
					RF	87.50	88.44	86.25	87.33
SVM	83.40	80.90	86.69	83.69
					MLP	88.90	88.41	89.17	88.79
RNN	92.29	91.57	93.17	92.36
					BiRNN	92.21	91.77	92.62	92.19
LSTM	91.27	91.35	91.24	91.30
					BiLSTM	93.54	93.42	93.81	93.62

And taking the base label vector set as a training set of the meta-classification model, training the meta-classifier, and combining the base classifier and the meta-classifier after the training is finished to obtain the complete malicious sample API call sequence detection method based on the ensemble learner.

Finally, the whole ensemble learning classifier is tested by using a test set, and the test result is shown in table 3, wherein the meta classifier combination is divided into 3 types: (1) including the 11 base classifiers mentioned above, (2) excluding the two classifiers with the worst performance, i.e. DT and SVM, and the remaining 9 base classifiers, (3) including only 5 deep learning classifiers, i.e. MLP, RNN, biRNN, LSTM, biLSTM; the base labels are two, namely Proavailability and Presect value; the meta classifier is of two kinds, KNN and GB.

TABLE 3 ensemble learning classifier Performance under different options

The experimental environment used is shown in the following table:

TABLE 4

	Version(s)
		System for controlling a power supply	Windows 10
CPU	Intel(R)Core(TM)i9-9820X CPU@3.30GHz
		Deep learning framework	TensorFlow
Memory device	16G

The results show that the performance of the ensemble learning classifier with any selected composition is better than that of a single classifier, and the robustness is better, wherein when the base label is selected to be preset value, the base classifier is 5 deep learning algorithms, and the meta classifier is GB, the Accuracy rate Accuracy reaches the highest 97.54%, the Recall rate Recall reaches the highest 97.85%, is higher than 93.54% and 93.42% of the BiLSTM, and is far higher than 81.96% and 80.59% of the DT. The detection capability of the ensemble learning classifier on the malicious API sequence is obviously improved, and the disturbance resistance and overfitting resistance are greatly enhanced.

The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. A malicious program API calling sequence detection method based on an ensemble learner is characterized by comprising the following steps:

(1) Acquiring a malicious API sequence sample and a benign API sequence sample;

(2) Extracting feature vectors of malicious API sequence samples and benign API sequence samples to form a data set, dividing the data set into a training set and a verification set according to a proportion, and training N base classification models by using the training set to obtain a trained base classifier;

(3) Classifying the verification set data by using the trained base classifier, outputting a base label vector for the feature vector of each sample to form a base label vector set, and training a meta-classification model by using the base label vector set to obtain the trained meta-classifier; combining the base classifier and the meta classifier to obtain a complete ensemble learner;

2. The integrated learner-based malware API call sequence detection method according to claim 1, wherein the step (1) is specifically:

and putting the malicious samples and the benign samples into a Cuckoo sandbox to operate so as to respectively obtain the malicious API sequence samples and the benign API sequence samples.

3. The integrated learner-based malware API call sequence detection method of claim 1, wherein the step (2) is specifically:

(2.1) extracting the characteristics of the malicious API sequence samples and the benign API sequence samples, wherein each API sequence sample corresponds to a 1 xq-dimensional characteristic vector to form a characteristic vector data set, and the characteristic vector data set is divided into a training set and a verification set according to a certain proportion, and the formula is as follows:

Mal' _i ＝Feature_extraction(Mal _i )

Benign' _i ＝Feature_extraction(Benign _i )

in the formula, feature _ extraction () is a Feature extraction algorithm, mal _i Is the ith malicious API sequence sample; mal' _i Is Mal _i Obtaining a 1 xq dimensional feature vector after feature extraction; benign _i Is the ith benign API sequence sample; benign' _i Is Benign _i Obtaining a 1 xq dimensional feature vector after feature extraction; API _ sequence is a feature vector sum set of malicious and benign API sequences, m is the number of malicious samples, and n is the number of benign samples;

(2.2) training the N base classification models by using a training set to obtain a trained base classifier; selecting each base classifier to output a label of Probasic or Presect value, wherein the Probasic is the possibility that the sample is a malicious sample, the value fluctuates within 0-1, the closer the value to 1, the more the base classifier tends to judge that the sample is a malicious sample, otherwise, the closer the value to 0, the more the sample is a benign sample, the Predict value is the predicted value of the base classifier, the value is 0 or 1, the value to 1, the base classifier judges that the sample is a malicious sample, and the value to 0, the benign sample.

4. The integrated learner-based malware API call sequence detection method of claim 1, wherein the step (3) is specifically:

(3.1) after the training of the N basic classification models in the (2.2) is finished, obtaining N corresponding basic classifiers; and then, the verification set is used as input to obtain a base tag vector set of the verification set by the base classifier, wherein the formula is as follows:

V _i ＝[P _i,1 ,P _i,2 ,P _i,3 ,…,P _i,N ,L _i ]

in the formula, V _i Base tag vector, P, obtained for the ith feature vector _i,1 ,P _i,2 ,P _i,3 ,…,P _i,N Respectively, the output value, L, of the ith feature vector in the verification set for N base classifiers _i A real label of the ith feature vector sample; v is a base tag vector set obtained from the verification set, and l is the number of samples in the verification set;

and (3.2) taking the base label vector set as a training set of the meta classification model, training the meta classifier, and combining the base classifier and the meta classifier after training to obtain the complete integrated learning device.

5. The ensemble learner-based malware API call sequence detection method of claim 3, wherein the ratio of training set to validation set in step (2.1) is 6.

6. The ensemble learner-based malware API call sequence detection method of claim 1, wherein the number N of base classification models in step (2) is greater than or equal to 5.

7. The ensemble learner based malicious program API call sequence detection method according to claim 2, wherein the malicious samples and the benign samples are both executable program samples of known tags.