CN115545091A - Integrated learner-based malicious program API (application program interface) calling sequence detection method - Google Patents

Integrated learner-based malicious program API (application program interface) calling sequence detection method Download PDF

Info

Publication number
CN115545091A
CN115545091A CN202211015410.4A CN202211015410A CN115545091A CN 115545091 A CN115545091 A CN 115545091A CN 202211015410 A CN202211015410 A CN 202211015410A CN 115545091 A CN115545091 A CN 115545091A
Authority
CN
China
Prior art keywords
base
api
classifier
sample
benign
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211015410.4A
Other languages
Chinese (zh)
Inventor
杨强
汪金明
杨涛
阮伟
王文海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202211015410.4A priority Critical patent/CN115545091A/en
Publication of CN115545091A publication Critical patent/CN115545091A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms

Abstract

The invention provides a malicious program API calling sequence detection method based on an ensemble learner. The method uses a plurality of base classifiers as the first-layer content of the ensemble learner, and labels are printed on a training set sample for supervised learning; obtaining a base classifier group after model training is finished, and labeling a verification set sample by each base classifier in the base classifier group so as to finally obtain a base label vector set consisting of N base labels; inputting the base label vector set serving as a training set into a meta classification model for training to obtain a meta classifier; in practical application, after the training of the base classifier group and the meta classifier is completed, the API call sequence subjected to data processing is output to the base classifier group to obtain a base label vector, the base label vector is input to the meta classifier, and the meta classifier gives a final label of the API sequence.

Description

Integrated learner-based malicious program API (application programming interface) calling sequence detection method
Technical Field
The invention relates to a malicious program API (application program interface) calling sequence detection method, in particular to a malicious program API calling sequence detection method based on an integrated learner.
Background
In network security protection, defending against malicious software attacks is one of the key problems that the network security needs to solve urgently at present. At present, common antivirus software mostly adopts a static detection method, such as a virus library label-based detection method and a heuristic detection method, and the traditional methods mostly extract static characteristics including malicious characteristics such as byte sequences, character string matching and code sequences in the malware to detect, so that the traditional methods have a good detection effect on known viruses. With the development of malware manufacturing technology, attackers can also continuously improve their malware, for example, by obfuscating the features of malicious code in the malware or changing the structure of the code runtime behavior, and the traditional security detection policy has little success in improving the malware or new viruses. To address the limitations of static detection, dynamic behavior detection arises at the same time. Dynamic profiling is performed by placing malware in a secure virtual environment called a sandbox, and then obtaining its behavioral data such as registries, logs, flow activities, and API call sequences. Compared with static analysis, although dynamic analysis consumes more resources, the dynamic analysis does not need reverse engineering methods such as decompilation and the like, and dynamic characteristics obtained after the program runs are analyzed, so that the effect and robustness of the dynamic analysis are stronger for improved viruses and even new viruses.
Malware detectors based on dynamic analysis focus primarily on API behavior information obtained from software interaction with the operating system at dynamic runtime. The intelligent malicious software detection method based on the API behavior characteristics can identify unknown malicious files with behaviors similar to known malicious files, and is not influenced by technologies such as polymorphism, code confusion, encryption and shell addition. Monitoring the executable file by utilizing a sequence of API calls between the process and the operating system is one of the best strategies, since APIs are considered the most important behavioral differences between malicious and non-malicious processes.
Many researches nowadays take API as main characteristic, deep learning algorithm as main classification tool to detect malicious samples, but the detection system based on single classifier has relative instability because the single algorithm is easy to have overfitting condition and is also easy to be interfered by anti-attack. Therefore, how to improve the effect of the API sequence detector and the robustness of the whole detection system by using the multi-classifier characteristic of ensemble learning becomes a key problem to be solved in the current field.
Disclosure of Invention
The invention aims to solve the problem of how to improve the effect and robustness of a malicious detector based on API behavior detection in the field of network security protection. Aiming at the problems that the existing malicious detector based on API behavior detection has poor effect and poor robustness, a further optimization method is provided: the API call sequence detection method based on the malicious sample of the ensemble learner has guiding significance for improving the effect and robustness of the API behavior detection-based malicious detector.
The purpose of the invention can be realized by the following technical scheme:
a malicious program API calling sequence detection method based on an ensemble learner comprises the following steps:
(1) Obtaining malicious API sequence samples and benign API sequence samples
(2) And extracting the feature vectors of the malicious API sequence samples and the benign API sequence samples to form a data set, dividing the data set into a training set and a verification set according to a proportion, and training N base classification models by using the training set to obtain the trained base classifier.
(3) Classifying the data of the verification set by using the trained base classifier, outputting a base label vector for the feature vector of each sample to form a base label vector set, and training the meta-classification model by using the base label vector set to obtain the trained meta-classifier.
(4) And putting the executable program to be detected into a sandbox to operate to obtain an API calling sequence of the executable program, and extracting the characteristic vector of the API calling sequence of the executable program to detect on the integrated learner.
As a preferred scheme of the invention, the number N of the base classification models is more than or equal to 5.
As a preferred scheme of the invention, the proportion of the training set to the validation set is 6.
As a preferred embodiment of the present invention, the malicious and benign samples are both executable program samples with known signatures, which is an assessment that a sample is malicious or benign.
Further, the step (1) is specifically as follows: and putting the malicious samples and the benign samples into a Cuckoo sandbox to operate so as to respectively obtain the malicious API sequence samples and the benign API sequence samples.
Further, the specific details in the step (2) include:
(2.1) extracting the characteristics of the malicious API sequence samples and the benign API sequence samples, wherein each API sequence sample corresponds to a 1 xq-dimensional characteristic vector to form a characteristic vector data set which is divided into a training set and a verification set, and the formula is as follows:
Mal' i =Feature_extraction(Mal i )
Benign' i =Feature_extraction(Benign i )
Figure BDA0003812344570000031
in the formula, feature _ extraction () is a Feature extraction algorithm, mal i Is the ith malicious API sequence sample; mal' i Is Mal i Obtaining a 1 xq dimensional feature vector after feature extraction; benign i Is the ith benign API sequence sample; benign' i Is Benign i Obtaining a 1 xq dimensional feature vector after feature extraction; API _ sequence is a set of sum of feature vectors for malicious and benign API sequences, m is the number of malicious samples, and n is the number of benign samples.
(2.2) training the N base classification models by using a training set, selecting each base classification model to output a label of Probasic or Presect value, wherein the Probasic is the possibility that the sample is a malicious sample, and the value fluctuates within 0-1, when the value is closer to 1, the base classifier tends to judge that the sample is the malicious sample, otherwise, the value is closer to 0, the sample is judged to be a benign sample, and the Presect value is the predicted value of the base classifier, and is 0 or 1, when the value is 1, the classifier judges the sample as the malicious sample, and when the value is 0, the classifier judges the sample as the benign sample.
Further the step (3) is specifically
(3.1) after the training of the N basic classification models in the (2.2) is finished, obtaining N corresponding basic classifiers; and then using the verification set as input to obtain a base label vector set of the verification set by the base classifier, wherein the formula is as follows:
V i =[P i,1 ,P i,2 ,P i,3 ,…,P i,N ,L i ]
Figure BDA0003812344570000032
in the formula, V i Base label vector, P, obtained for the ith feature vector i,1 ,P i,2 ,P i,3 ,…,P i,N Respectively, the output values, L, of the ith eigenvector in the verification set for the N basis classifiers i A real label of the ith feature vector sample; v is a base tag vector set obtained from the verification set, and l is the number of samples in the verification set.
And (3.2) taking the base label vector set as a training set of the meta-classification model, training the meta-classifier, and combining the base classifier and the meta-classifier after the training is finished to obtain the complete integrated learning device.
The invention has the beneficial effects that:
(1) The invention innovatively establishes a novel integrated learning classification model which comprises a plurality of base classifiers and a meta classifier, wherein the base classifiers are trained by utilizing a training set, data of a verification set is used as input to obtain the output of the base classifiers, namely base label vectors, which are used as the training data of the meta classifier, and finally a complete integrated learning classification model is obtained; in the model provided by the invention, if a single classifier is adopted, the classification effect is not ideal, and the antagonism is poor. Therefore, the invention adopts N (N is more than or equal to 5) different base classifiers and one meta classifier as the key node of the classification model, thereby not only improving the performance of the classifier, but also improving the robustness and the antagonism of the model.
(2) Compared with the existing common ensemble learning method, the method is characterized in that the data used for training the meta classifier are different, the output labels in the training process of the base classifier are used in the common ensemble learning method, and the training data of the meta classifier in the method are derived from the output labels of the trained base classifier on the verification set samples; the common ensemble learning method has the defect that the overfitting phenomenon easily exists, so that the method adopted by the invention can avoid the model from falling into overfitting.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The objects and effects of the present invention will become more apparent from the following detailed description of the present invention with reference to the accompanying drawings.
The invention discloses a malicious program API (application program interface) calling sequence detection method based on an ensemble learner, which comprises the following steps of:
(1) Obtaining malicious API sequence samples and benign API sequence samples
(2) Extracting feature vectors of the malicious API sequence samples and the benign API sequence samples to form a data set, dividing the data set into a training set and a verification set according to a proportion, and training N base classification models by using the training set;
(3) Classifying the data of the verification set by using the trained base classifier, outputting a base label vector for the feature vector of each sample to form a base label vector set, and training the element classification model by using the base label vector set to obtain the trained element classifier.
(4) And putting the executable program to be detected into a sandbox to operate to obtain an executable program API calling sequence, and extracting a feature vector of the executable program API calling sequence to detect on the integrated learner.
In one embodiment of the present invention, the various steps of the present invention are described in more detail.
And (3) collecting a large number of executable program samples from various security network manufacturers and virus websites, wherein the number of the collected malicious samples is 14800, and the number of the collected benign samples is 14800, putting the samples into a sandbox for operation, extracting an operation log and analyzing to obtain an API call information table. After dividing both malicious and benign samples randomly according to the proportion of 6.
TABLE 1
Figure BDA0003812344570000051
Figure BDA0003812344570000061
After obtaining the API samples, performing feature extraction on the API sequence samples, wherein each sample corresponds to a 1 xq-dimensional feature vector to form a feature vector data set, and the formula is as follows:
Mal' i =Feature_extraction(Mal i )
Benign' i =Feature_extraction(Benign i )
Figure BDA0003812344570000062
in the formula, feature _ extraction () is a Feature extraction algorithm, mal i Is the ith malicious API sequence sample; mal' i Is Mal i Obtaining a 1 xq dimensional feature vector after feature extraction; benign i Is the ith benign API sequence sample; benign' i Is Benign i Obtaining a 1 xq dimensional feature vector after feature extraction; API _ sequence is a set of sum of feature vectors for malicious and benign API sequences, m is the number of malicious samples, and n is the number of benign samples.
The method comprises the steps of training N base classification models by using a training set, wherein N =11 is selected temporarily, the base classification models select decision trees DT classification, logistic regression LR classification, K neighbor KNN classification, random forest RF classification, gradient enhancement GB classification, support vector machine SVM classification, multi-layer perceptron MLP, recurrent neural network RNN, bidirectional recurrent neural network BiRNN, long and short memory network LSTM and bidirectional long and short memory network BiLSTM, each classification model is selected to output a label of Probasic or Predict value, probasic is the possibility that a sample is a malicious sample, the value fluctuates within 0-1, when the value is closer to 1, the sample is more prone to be judged as the malicious sample, otherwise, the sample is more prone to be judged as the benign sample, when the value is closer to 0, the Predict value is 0 or 1, the base classifier judges the sample as the malicious sample, and when the value is 0, the base classifier judges the sample as the malicious sample as the benign sample.
Subsequently, the verification set is used to verify the 11 trained base classifiers and obtain the classification performance (such as Accuracy, recall, etc.), the performance of each base classifier is shown in table 2, and each base label vector group is obtained, and the formula is as follows:
V i =[P i,1 ,P i,2 ,P i,3 ,P i,4 ,P i,5 ,P i,6 ,P i,7 ,P i,8 ,P i,9 ,P i,10 ,P i,11 ,L i ]
Figure BDA0003812344570000071
in the formula, V i Base tag vector, P, obtained for the ith feature vector i,1 ,P i,2 ,P i,3 ,P i,4 ,P i,5 ,P i,6 ,P i,7 ,P i,8 ,P i,9 ,P i,10 ,P i,11 DT, GB, LR, KNN, RF, SThe 11 base classifiers VM, MLP, RNN, biRNN, LSTM, biLSTM output P value, L to the ith feature vector in the verification set i A real label of the ith feature vector sample; v is a base tag vector set obtained from the verification set, and l is the number of samples in the verification set.
TABLE 2 base classifier Performance
Processed data Accuracy(%) Precision(%) Recall(%) f1_score(%)
LR 87.06 85.23 89.20 87.17
KNN 86.91 87.18 86.41 86.79
DT 81.96 80.59 83.61 82.07
GBDT 88.75 89.25 88.05 88.64
RF 87.50 88.44 86.25 87.33
SVM 83.40 80.90 86.69 83.69
MLP 88.90 88.41 89.17 88.79
RNN 92.29 91.57 93.17 92.36
BiRNN 92.21 91.77 92.62 92.19
LSTM 91.27 91.35 91.24 91.30
BiLSTM 93.54 93.42 93.81 93.62
And taking the base label vector set as a training set of the meta-classification model, training the meta-classifier, and combining the base classifier and the meta-classifier after the training is finished to obtain the complete malicious sample API call sequence detection method based on the ensemble learner.
Finally, the whole ensemble learning classifier is tested by using a test set, and the test result is shown in table 3, wherein the meta classifier combination is divided into 3 types: (1) including the 11 base classifiers mentioned above, (2) excluding the two classifiers with the worst performance, i.e. DT and SVM, and the remaining 9 base classifiers, (3) including only 5 deep learning classifiers, i.e. MLP, RNN, biRNN, LSTM, biLSTM; the base labels are two, namely Proavailability and Presect value; the meta classifier is of two kinds, KNN and GB.
TABLE 3 ensemble learning classifier Performance under different options
Figure BDA0003812344570000081
The experimental environment used is shown in the following table:
TABLE 4
Version(s)
System for controlling a power supply Windows 10
CPU Intel(R)Core(TM)i9-9820X CPU@3.30GHz
Deep learning framework TensorFlow
Memory device 16G
The results show that the performance of the ensemble learning classifier with any selected composition is better than that of a single classifier, and the robustness is better, wherein when the base label is selected to be preset value, the base classifier is 5 deep learning algorithms, and the meta classifier is GB, the Accuracy rate Accuracy reaches the highest 97.54%, the Recall rate Recall reaches the highest 97.85%, is higher than 93.54% and 93.42% of the BiLSTM, and is far higher than 81.96% and 80.59% of the DT. The detection capability of the ensemble learning classifier on the malicious API sequence is obviously improved, and the disturbance resistance and overfitting resistance are greatly enhanced.
The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims (7)

1. A malicious program API calling sequence detection method based on an ensemble learner is characterized by comprising the following steps:
(1) Acquiring a malicious API sequence sample and a benign API sequence sample;
(2) Extracting feature vectors of malicious API sequence samples and benign API sequence samples to form a data set, dividing the data set into a training set and a verification set according to a proportion, and training N base classification models by using the training set to obtain a trained base classifier;
(3) Classifying the verification set data by using the trained base classifier, outputting a base label vector for the feature vector of each sample to form a base label vector set, and training a meta-classification model by using the base label vector set to obtain the trained meta-classifier; combining the base classifier and the meta classifier to obtain a complete ensemble learner;
(4) And putting the executable program to be detected into a sandbox to operate to obtain an API calling sequence of the executable program, and extracting the characteristic vector of the API calling sequence of the executable program to detect on the integrated learner.
2. The integrated learner-based malware API call sequence detection method according to claim 1, wherein the step (1) is specifically:
and putting the malicious samples and the benign samples into a Cuckoo sandbox to operate so as to respectively obtain the malicious API sequence samples and the benign API sequence samples.
3. The integrated learner-based malware API call sequence detection method of claim 1, wherein the step (2) is specifically:
(2.1) extracting the characteristics of the malicious API sequence samples and the benign API sequence samples, wherein each API sequence sample corresponds to a 1 xq-dimensional characteristic vector to form a characteristic vector data set, and the characteristic vector data set is divided into a training set and a verification set according to a certain proportion, and the formula is as follows:
Mal' i =Feature_extraction(Mal i )
Benign' i =Feature_extraction(Benign i )
Figure FDA0003812344560000011
in the formula, feature _ extraction () is a Feature extraction algorithm, mal i Is the ith malicious API sequence sample; mal' i Is Mal i Obtaining a 1 xq dimensional feature vector after feature extraction; benign i Is the ith benign API sequence sample; benign' i Is Benign i Obtaining a 1 xq dimensional feature vector after feature extraction; API _ sequence is a feature vector sum set of malicious and benign API sequences, m is the number of malicious samples, and n is the number of benign samples;
(2.2) training the N base classification models by using a training set to obtain a trained base classifier; selecting each base classifier to output a label of Probasic or Presect value, wherein the Probasic is the possibility that the sample is a malicious sample, the value fluctuates within 0-1, the closer the value to 1, the more the base classifier tends to judge that the sample is a malicious sample, otherwise, the closer the value to 0, the more the sample is a benign sample, the Predict value is the predicted value of the base classifier, the value is 0 or 1, the value to 1, the base classifier judges that the sample is a malicious sample, and the value to 0, the benign sample.
4. The integrated learner-based malware API call sequence detection method of claim 1, wherein the step (3) is specifically:
(3.1) after the training of the N basic classification models in the (2.2) is finished, obtaining N corresponding basic classifiers; and then, the verification set is used as input to obtain a base tag vector set of the verification set by the base classifier, wherein the formula is as follows:
V i =[P i,1 ,P i,2 ,P i,3 ,…,P i,N ,L i ]
Figure FDA0003812344560000021
in the formula, V i Base tag vector, P, obtained for the ith feature vector i,1 ,P i,2 ,P i,3 ,…,P i,N Respectively, the output value, L, of the ith feature vector in the verification set for N base classifiers i A real label of the ith feature vector sample; v is a base tag vector set obtained from the verification set, and l is the number of samples in the verification set;
and (3.2) taking the base label vector set as a training set of the meta classification model, training the meta classifier, and combining the base classifier and the meta classifier after training to obtain the complete integrated learning device.
5. The ensemble learner-based malware API call sequence detection method of claim 3, wherein the ratio of training set to validation set in step (2.1) is 6.
6. The ensemble learner-based malware API call sequence detection method of claim 1, wherein the number N of base classification models in step (2) is greater than or equal to 5.
7. The ensemble learner based malicious program API call sequence detection method according to claim 2, wherein the malicious samples and the benign samples are both executable program samples of known tags.
CN202211015410.4A 2022-08-24 2022-08-24 Integrated learner-based malicious program API (application program interface) calling sequence detection method Pending CN115545091A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211015410.4A CN115545091A (en) 2022-08-24 2022-08-24 Integrated learner-based malicious program API (application program interface) calling sequence detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211015410.4A CN115545091A (en) 2022-08-24 2022-08-24 Integrated learner-based malicious program API (application program interface) calling sequence detection method

Publications (1)

Publication Number Publication Date
CN115545091A true CN115545091A (en) 2022-12-30

Family

ID=84726357

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211015410.4A Pending CN115545091A (en) 2022-08-24 2022-08-24 Integrated learner-based malicious program API (application program interface) calling sequence detection method

Country Status (1)

Country Link
CN (1) CN115545091A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117113352A (en) * 2023-10-25 2023-11-24 西安热工研究院有限公司 Method, system, equipment and medium for detecting malicious executable file of DCS upper computer

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117113352A (en) * 2023-10-25 2023-11-24 西安热工研究院有限公司 Method, system, equipment and medium for detecting malicious executable file of DCS upper computer
CN117113352B (en) * 2023-10-25 2024-02-06 西安热工研究院有限公司 Method, system, equipment and medium for detecting malicious executable file of DCS upper computer

Similar Documents

Publication Publication Date Title
Aslan et al. A new malware classification framework based on deep learning algorithms
Nari et al. Automated malware classification based on network behavior
CN109359439B (en) software detection method, device, equipment and storage medium
Lin et al. Identifying android malicious repackaged applications by thread-grained system call sequences
Lu Malware detection with lstm using opcode language
Ye et al. Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list
CN111639337B (en) Unknown malicious code detection method and system for massive Windows software
Sun et al. Malware family classification method based on static feature extraction
Kakisim et al. Metamorphic malware identification using engine-specific patterns based on co-opcode graphs
CN109614795B (en) Event-aware android malicious software detection method
CN110363003B (en) Android virus static detection method based on deep learning
Sun et al. An opcode sequences analysis method for unknown malware detection
Elkhawas et al. Malware detection using opcode trigram sequence with SVM
Jiang et al. Android malware family classification based on sensitive opcode sequence
Li et al. An adversarial machine learning method based on OpCode N-grams feature in malware detection
Park et al. Birds of a feature: Intrafamily clustering for version identification of packed malware
Smith et al. Dynamic analysis of executables to detect and characterize malware
Andresini et al. Dealing with class imbalance in android malware detection by cascading clustering and classification
Stiawan et al. Ransomware detection based on opcode behavior using k-nearest neighbors algorithm
Mimura Impact of benign sample size on binary classification accuracy
Sivakumar et al. Malware Detection Using The Machine Learning Based Modified Partial Swarm Optimization Approach
CN115545091A (en) Integrated learner-based malicious program API (application program interface) calling sequence detection method
CN110704841A (en) Convolutional neural network-based large-scale android malicious application detection system and method
Darshan et al. An empirical study to estimate the stability of random forest classifier on the hybrid features recommended by filter based feature selection technique
Ghanaei et al. Statistical approach towards malware classification and detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination