CN113709134B - Malicious software detection method and system based on N-gram and machine learning - Google Patents

Malicious software detection method and system based on N-gram and machine learning Download PDF

Info

Publication number
CN113709134B
CN113709134B CN202110972755.8A CN202110972755A CN113709134B CN 113709134 B CN113709134 B CN 113709134B CN 202110972755 A CN202110972755 A CN 202110972755A CN 113709134 B CN113709134 B CN 113709134B
Authority
CN
China
Prior art keywords
gram
machine learning
sample
malicious software
software
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110972755.8A
Other languages
Chinese (zh)
Other versions
CN113709134A (en
Inventor
产院东
郭乔进
胡杰
梁中岩
刘蔚棣
吴其华
杨冲昊
汪义飞
高沙沙
杨航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 28 Research Institute
Original Assignee
CETC 28 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 28 Research Institute filed Critical CETC 28 Research Institute
Priority to CN202110972755.8A priority Critical patent/CN113709134B/en
Publication of CN113709134A publication Critical patent/CN113709134A/en
Application granted granted Critical
Publication of CN113709134B publication Critical patent/CN113709134B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/52Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow
    • G06F21/53Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow by executing in a restricted environment, e.g. sandbox or secure virtual machine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a malicious software detection method and a system based on N-gram and machine learning. The sample file is subjected to dynamic behavior analysis by combining with the sandbox SNDBOX based on the artificial intelligence technology, key information of the sample file is obtained, wherein the key information comprises API call, parameter information and the like, and then the information is converted into a feature set by utilizing an N-gram algorithm. And then, utilizing TF-IDF to implement the protocol of N-gram feature set, and the feature set after the protocol only contains important features, so that the efficiency of the subsequent training machine learning classifier can be improved. And finally, converting the feature set into binary feature vectors, and transmitting the binary feature vectors to a plurality of machine learning classifiers, wherein the binary feature vectors comprise naive Bayes, decision trees, random forests, logic protocols and the like for training and testing. The trained classifier can assist security analysts in detecting malicious software.

Description

Malicious software detection method and system based on N-gram and machine learning
Technical Field
The invention belongs to the field of network security, and particularly relates to a malicious software detection method and system based on N-gram and machine learning.
Background
Malware is any software that deliberately destroys the normal functioning of a computer or network. Malware may cause damage after being implanted or somehow entering a target computer, and may take the form of executable code, scripts, active content, and other software. The behavior of malware includes theft of sensitive information, gaining unauthorized access to private systems or spying on, etc. The existing malicious software has wide attack range, from personnel to IT systems of large organizations, and national infrastructures such as nuclear power plants and water supply systems. As malware developers continue to improve detection escape techniques, existing malware variants continue to evolve. Recent sonic wall network threat reports indicate that sonic wall service discovered approximately 44 ten thousand malware variants in 2019, with an average of over 1200 malware released per day. Furthermore, the latest security report by Panda Lab indicated that there were over 200 tens of thousands of new malware binaries in 2019.
Malware technology falls into two categories: signature-based detection and anomaly-based detection. Conventional antivirus software detects whether software is malicious by comparing features of given software to a database containing known malware signatures. Signature-based detection relies on a library of known signatures and is therefore limited in its ability to detect unknown malware. An anomaly-based detection algorithm may detect unknown malware. Related studies have proposed algorithms that apply data mining to malware detection by using data mining techniques, learning normal programs to build rule sets, and applying rule sets to do malware detection. In addition, there are anomaly detection algorithms based on machine learning, where the PAYL tool is a tool that calculates the expected payload for each service (port) of the system, and during the learning phase, PAYL learns the normal program behavior to build a centroid model for each port. And then in the detection stage, comparing the effective load of the detection program with the centroid model, detecting the mahalanobis distance between the effective load and the centroid model, and if the detection program is far away from the centroid model, considering that the effective load is malicious. The current detection algorithms based on the abnormality are various, but most of them are difficult to meet the requirements of practical application in terms of precision and performance.
Disclosure of Invention
The invention aims to: aiming at the defects of the prior art, the invention provides a malicious software detection method and a system based on N-gram and machine learning.
In order to solve the technical problems, in a first aspect, a method for detecting malicious software based on N-gram and machine learning is disclosed, comprising the following steps:
step 1, collecting a malicious software sample and an application software sample, and dynamically analyzing the sample to obtain a dynamic analysis file;
step 2, obtaining sample key information based on a dynamic analysis file, and generating a first N-gram feature set;
step 3, carrying out feature reduction on the first N-gram feature set to obtain a second N-gram feature set;
step 4, converting the second N-gram feature set into a binary feature vector set, inputting a machine learning classification model for training and testing, and obtaining a malicious software classifier;
and 5, using a malicious software classifier to detect malicious software and assisting network security analysts to find the malicious software.
With reference to the first aspect, in one implementation manner, the malware sample in the step 1 includes worms, trojans and viruses, and the application software sample includes more than one application software; the malware sample and the application software sample are in PE (Portable executable) file format.
With reference to the first aspect, in one implementation manner, the dynamically analyzing sample in step 1 obtains a dynamically analyzing file, and the dynamically analyzing file is obtained through an artificial intelligence sandbox SNDBOX; the SNDBOX of the artificial intelligent sandbox scans the sample by using a static detection algorithm, then the sample is opened by simulating artificial operation by using a dynamic engine, the behavior of the sample in the execution process is analyzed, and the detection results of the static detection algorithm and the dynamic engine are comprehensively considered to generate a dynamic analysis file; the static detection algorithm comprises a static scanning and antivirus software detection engine.
Compared with code fragments generated based on code decompilation, the dynamic analysis file generated based on the artificial intelligence sandbox SNDBOX has more practicability, and can effectively improve the detection precision of malicious software. Because current code obfuscation techniques evolve rapidly, many families of traditional malware, on the basis of the code of the primary malware, new malware variants are produced by the code obfuscation techniques. Machine-learning classifiers trained based on code segments resulting from code decompilation cannot cope with new malware variants. In contrast, artificial intelligence sandboxes based SNDBOX can detect that malware is not significantly different at runtime because different malware of the same malware family only changes at the static code level. Therefore, the selection of the dynamic analysis file generated by SNDBOX with the artificial intelligent detection technology as the starting point of the classifier plays a decisive role in determining the accuracy of the classifier.
With reference to the first aspect, in one implementation manner, the obtaining sample key information in step 2 includes: and (3) according to the dynamic analysis file obtained in the step (1), the characteristics of key information such as API call, function parameters, parameter positions and the like are induced by analyzing the format of the dynamic analysis file, and then a characteristic writing analysis program is used for picking sample key information in the process of analyzing malicious software from the dynamic analysis file. Compared with the method for extracting the N-gram from the whole dynamic analysis file, the method for extracting the key information can greatly reduce the base number of the N-gram, and is beneficial to improving the overall performance and the accuracy of the system.
With reference to the first aspect, in one implementation manner, the generating the first N-gram feature set in the step 2 is based on the sample key information and is generated by using an N-gram algorithm.
In the step 2, a machine learning model is created by adopting an N-gram algorithm in a feature extraction stage, wherein the N-gram algorithm can be used for obtaining a feature vector with a fixed length, and is an algorithm based on a statistical language model. Each byte fragment is called a gram. By counting the occurrence frequencies of all the gram and filtering according to a reasonable threshold value, a key gram list, namely a text vector feature space, is formed, and each gram in the list is a feature vector dimension. The N-gram algorithm creates feature vectors that are an essential step in the subsequent machine learning training.
In combination with the first aspect, in one implementation, the step 3 uses a TF-IDF (term frequency-inverse text frequency index) algorithm to evaluate the importance of each N-gram for identifying malware, and if one N-gram appears with a high frequency in the N-gram set of malware and rarely in the N-gram set of normal application software, then the N-gram is the key N-gram feature set we need to choose. The N-gram feature set screened by the TF-IDF algorithm is noted as a second N-gram feature set.
Because the number of feature sets generated by the N-gram algorithm is too large, feature reduction is needed, namely feature dimension reduction is needed, and the number of features is reduced; because the basic unit of the N-gram feature set is a gram, namely a character string, the feature protocol adopts a TF-IDF algorithm for information retrieval and text mining, the algorithm can filter common N-gram features, and retains important N-gram features, thereby being beneficial to improving the training and testing precision and performance of a machine learning classification model in the next stage.
With reference to the first aspect, in one implementation manner, the machine learning classification model in the step 4 includes na iotave bayes, decision tree classification, random forest, and logistic regression model; dividing the binary feature vector set converted by the second N-gram feature set into a training sample set and a test sample set according to the proportion of 80% and 20%, and respectively distributing the training sample set and the test sample set to a training stage and a test stage; in order to verify the accuracy of different machine learning classification models, a mode of repeated tests is adopted for more than two times, and each selected training sample set and each selected test sample set are disjoint.
With reference to the first aspect, in one implementation manner, when the malware classifier is used to detect malware in step 5, two detection modes are supported:
mode 1: each machine learning classification model can independently detect detected software and judge whether the detected software is malicious software or not; selecting any machine learning classification model during detection;
mode 2: the four machine learning classification models respectively detect detected software and respectively calculate the evaluation indexes of the four machine learning classification models; and (3) obtaining comprehensive evaluation indexes by taking an average number of all the evaluation indexes, and judging whether the detected software is malicious software or not.
With reference to the first aspect, in one implementation manner, the evaluation indexes of the four machine learning classification models in the step 5 include model recognition degree, precision, recall rate and F1 score.
In a second aspect, a malware detection system based on N-gram and machine learning is disclosed, comprising a sandbox detection module, a feature generation module, a feature specification module, and a machine learning module.
The sandbox detection module is used for carrying out dynamic behavior analysis on the malicious software sample and the application software sample by adopting the artificial intelligent sandbox SNDBOX to obtain a dynamic analysis file;
the characteristic generation module is used for obtaining sample key information based on the dynamic analysis file; generating a first N-gram feature set by utilizing an N-gram algorithm; the sample key information is a speech segment sequence comprising API call, function parameters and parameter position information;
the feature specification module is used for carrying out feature specification on the first N-gram feature set by adopting a TF-IDF algorithm to obtain a second N-gram feature set;
the machine learning module is used for receiving a second N-gram feature set, converting the second N-gram feature set into a binary feature vector set, inputting a machine learning classification model for training and testing, and obtaining a malicious software classifier; malware detection is performed using a malware classifier.
The beneficial effects are that: aiming at the characteristics of strong destructiveness, high iteration speed, strong evasion detection capability and the like of malicious software, the invention provides a malicious software detection method and a system based on N-gram and machine learning. The method and the system comprehensively utilize sandbox dynamic analysis technology, N-gram feature algorithm, TF-IDF feature protocol algorithm and machine learning classification model to detect malicious software. On one hand, the method provides an effective feature extraction and expression algorithm, and the accuracy and training performance of the classifier are ensured by selecting a group of key features from an analysis file generated in a dynamic analysis process and performing feature reduction. On the other hand, the method inherits the capability of the machine learning field, provides a highly adaptive solution and supports various machine learning classification models. The malware classifier obtained through training can automatically classify the malware, support training larger-scale samples, provide effective help for malware analysis work, and greatly reduce the cost of malware analysis.
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.
Fig. 1 is a schematic flow chart of a detection method according to an embodiment of the present application.
Fig. 2 is a pseudo-code schematic diagram of a detection method provided in an embodiment of the present application to create binary feature vectors.
Fig. 3 is a schematic diagram of a bayesian classification process of a detection method according to an embodiment of the present application.
Detailed Description
Embodiments of the present invention will be described below with reference to the accompanying drawings.
The first embodiment of the invention discloses a malicious software detection method based on N-gram and machine learning, which is shown in figure 1 and comprises the following steps:
step 1, collecting a malicious software sample and an application software sample, and dynamically analyzing the sample to obtain a dynamic analysis file;
step 2, obtaining sample key information based on a dynamic analysis file, and generating a first N-gram feature set;
step 3, carrying out feature reduction on the first N-gram feature set to obtain a second N-gram feature set;
step 4, converting the second N-gram feature set into a binary feature vector set, inputting a machine learning classification model for training and testing, and obtaining a malicious software classifier;
and 5, detecting malicious software by using a malicious software classifier.
In a first embodiment, the data corpus selects 60 malware samples and 60 application software samples. The 60 selected malicious samples belong to different types of malicious software, including Trojan horse, virus, worm and the like, and the 60 normal application software also comprises multiple types of application software; the malicious software sample and the application software sample are in PE file format.
For containing both malware and application software samples, a software sample is selected for manual submission into the artificial intelligence sandbox SNDBOX. The SNDBOX of the artificial intelligent sandbox scans the file by using static scanning and static detection algorithms such as a antivirus software detection engine. And then the dynamic engine is combined to simulate manual operation to open the document, and the behavior of the document in the executing process is analyzed. And then, comprehensively considering the detection results of the static detection algorithm and the dynamic engine to generate a dynamic analysis file, and providing basis for subsequent analysis. In this embodiment, the artificial intelligence sandbox SNDBOX is an open source sandbox Cuckoo sendbox, which spans windows, android, linux and darwin four operating system platforms, and supports binary PE files (exe, dll, com), PDF files, office files, URLs (Uniform Resource Locator, uniform resource locators), HTML (Hyper Text Markup Language ) files, various scripts such as PHP (Hypertext Preprocessor ), VB (Visual Basic, programming language) and Python (computer programming language), jar packages (computer file format), zip files (compressed file format), and so on, and can analyze static binary data of malicious files and behaviors of processes, networks, files, and so on after dynamic operation.
In a first embodiment, the process of converting an original, ambiguous, broad input into a different feature set is referred to as feature extraction. The main goal of this process is to select features that can help build an efficient malware detection system. At present, in the direction of malware detection, common feature extraction algorithms comprise binary feature extraction, frequency weight feature extraction, hidden Markov models, N-gram algorithms and the like. Considering that a machine learning algorithm needs a fixed-length feature vector to create a learning model, the method adopts an N-gram algorithm in a feature extraction stage.
Firstly, extracting key information according to a dynamic analysis file obtained by sandbox detection, analyzing the format of the dynamic analysis file, and summarizing the characteristics of API call, function parameters and parameter position key information; and (3) writing an analysis program based on the characteristics, and extracting sample key information in the process of analyzing the malicious software from the dynamic analysis file. And then performing feature generation based on the sample key information by using an N-gram algorithm. N-gram is an algorithm based on a statistical language model, also known as a first order Markov chain. The basic idea is to perform sliding window operation with the size of N on the content in the text according to bytes to form a byte fragment sequence with the length of N. Each byte segment is called a gram, statistics is carried out on the occurrence frequency of all the grams, filtering is carried out according to a preset threshold value, a key gram list is formed, namely a vector feature space of the text is marked as a first N-gram feature set. Each gram in the first N-gram feature set is a feature vector dimension. In this embodiment, the range of N is 1-6, and the threshold is dynamically set according to different samples, and the N-gram with the highest occurrence frequency is generally reserved in the first 20%.
In the first embodiment, since the number of feature sets generated by the N-gram algorithm is excessively large, feature specifications are required. Since the basic unit of the N-gram feature set is a gram, i.e., a string, the feature specifications employ TF-IDF algorithms for information retrieval and text mining. The algorithm may evaluate how important a word is to a set of files. The core idea of the TF-IDF algorithm is: if a word appears in one file with a high word frequency (TF) and in other files with little occurrence, then the word or phrase is considered to have good classification and distinguishing capabilities, and is suitable for classification.
TF is the term frequency, which indicates the frequency with which terms appear in text. This number is typically normalized, typically by dividing the word frequency by the total number of articles, to prevent it from biasing toward long documents. The formula of word frequency TF is as follows:
Figure BDA0003226499950000071
wherein n is ij Is the word in file d j The number of occurrences in (b) is the denominator of the file d j The sum of the number of occurrences of all words in (a). Where j represents the index of a certain file in a group of files (j. Gtoreq.0), and i represents the index of a certain word t in a file (i. Gtoreq.0). tf (tf) ij Represented in file d j Middle word t i Frequency of occurrence.
IDF is the reverse document frequency, the IDF of a particular word, divided by the total number of documents, and the quotient obtained is then logarithmic. If the fewer documents containing the word t, the larger the IDF, the better the classification function of the term is. The IDF formula is as follows:
Figure BDA0003226499950000072
where D is the total number of documents in the corpus. { j: t i ∈d j The expression contains the word t i Is a number of files.
The formula of TF-IDF is as follows:
tf-idf ij =tf ij *idf i
TF-IDF is the product of word frequency and reverse document, high word frequency in a particular document, and low document frequency of the word in the entire document, can yield a high weighted TF-IDF. Thus, TF-IDF tends to filter out common words, preserving important words.
For N-gram feature sets, a TF-IDF algorithm is utilized to conduct feature reduction, importance of each N-gram on identifying malicious software is evaluated, and if one N-gram appears in the N-gram set of the malicious software with high frequency and rarely appears in the N-gram set of normal application software, the N-gram is the key N-gram feature set which needs to be selected. The N-gram feature set screened by the TF-IDF algorithm is noted as a second N-gram feature set. Training and testing of the machine learning classifier for the next stage.
In a first embodiment, the second N-gram feature set is converted into binary feature vectors by encoding, and the pseudo code converted into the binary feature vector set is shown in FIG. 2. And transmitting the converted binary feature vectors to a machine learning classification model for training and testing, wherein the machine learning classification model comprises naive Bayes, decision tree classification, random forests and logistic regression.
Naive bayes are based on bayes theorem, which expresses the probability of an event occurring, and the method of determining this probability is based on conditional prior knowledge related to the event. And the probability inference process using the corresponding prior knowledge is bayesian inference. In practical application of machine learning classification problems, the training process of the naive bayes classifier is to estimate class prior probabilities based on training sets and estimate conditional probabilities for each attribute.
The decision tree classification comprises three steps of feature selection, decision tree generation and pruning. Feature selection refers to selecting one feature from a plurality of features in training data as a classification standard of a current node, and how to select the feature has different quantitative evaluation standards so as to derive different decision tree algorithms. The decision tree generation recursively generates child nodes from top to bottom according to the selected feature evaluation criteria until the data set is inseparable and stops growing. Finally, the tree size is reduced by pruning, and the overfitting is relieved.
Random forests are a classification model that contains multiple decision trees. For one sample, each tree may produce a different classification result. The random forest selects a large number of votes as a final result by voting. Random forests can be applied to large-scale datasets as well as processing input samples with high-dimensional features.
When the logistic regression is classified, the optimal model parameters are solved through establishing a cost function and then through an optimization method. The essence of logistic regression is: assuming that the data obeys this distribution, maximum likelihood estimation is then used as an estimate of the parameters.
In the classification process, we use supervised machine learning algorithms, naive bayes, decision trees, random forests, and logistic regression. And after the binary feature vector set data converted by the second N-gram feature set are randomly mixed, dividing the binary feature vector set data into a training sample set and a test sample set according to the proportion of 80% and 20%, and respectively distributing the training sample set and the test sample set to a training stage and a test stage. In order to verify the accuracy of different machine learning algorithms, the present embodiment employs 10-fold cross-validation, randomly dividing the training sample set and the test sample set into 10 disjoint sets, respectively, with the objective that the training and testing phases are performed 10 times. In fig. 3, which is a confusion matrix of four classifiers in the experiment, the accuracy of the classification system can be shown. Each column of the confusion matrix represents a prediction category, and the total number of each column represents the number of data predicted to be the category; each row represents the true home class of data, and the total number of data for each row represents the number of data instances for that class. The values in each column represent the number of real data predicted as such.
In the first embodiment, in order to compare the classification performance of four machine learning algorithms, the evaluation of the accuracy is performed, and the indexes of the evaluation include model recognition degree, accuracy (Precision), recall (Recall), and F1 score (F1-score). First, terms and definitions related to precision evaluation are introduced:
true positive TP: the prediction is positive, and the actual is positive;
false positive FP: predicted positive, actually negative;
false negative FN: predicting negative and actually positive;
true negative TN: predicted negative, actually negative;
model recognition degree: (tp+tn)/total number of samples;
precision: p=tp/(tp+fp);
recall: r=tp/(tp+fn);
f1 fraction: f1 =2 (Precision x Recall)/(precision+recall);
for four machine learning algorithms, the accuracy evaluation results are shown in the following table in combination with the experimental results of confusion matrix statistics in fig. 3. Experimental results show that the overall model recognition degree and the logistic regression model are superior to other algorithms. Thus, the score of the logistic regression model was also highest in the F1-score comparison. The na iotave bayes algorithm is superior to other algorithms in terms of accuracy, indicating that its detection is considered part of malware, with 98% likelihood determining to be malware. In addition, the highest recall rate is the logistic regression algorithm.
Classifier Model recognition degree Precision P Recall rate R F1-score
Naive Bayes 82.8% 0.98 0.73 0.84
Decision tree 78.6% 0.80 0.87 0.83
Random forest 79.3% 0.80 0.89 0.84
Logistic regression 84.5% 0.81 0.97 0.89
In addition, the experimental results of the invention are compared with other similar research results, and the model recognition degree of each research result is shown in the following table. Compared with the related research, the method has the advantages that as the characteristic selection algorithm selects the characteristic with high quality, the classification precision of the machine learning classification model is effectively improved.
Classifier The project Study [1 ]] Study [2-4] Study [5 ]] Study [6]
Naive Bayes 82.8% 81.2% 80.6% 79.2%
Decision tree 78.6% 67.3%
Random forest 79.3% 77.2% 74.3% 75.1% 70.7%
Logistic regression 84.5% 82.3%
The references cited in the above studies are as follows:
[1]Sayfullina,L.;Eirola,E.;Komashinsky,D.;Palumbo,P.;Miche,Y.;Lendasse,A.;Karhunen,J.Efficient detection of zero-day android malware using normalized bernoulli naive bayes.In Proceedings of the IEEE Trustcom/BigDataSE/ISPA,Helsinki,Finland,20-22August 2015;Volume 1,pp.198-205.
[2]Garg,V.;Yadav,R.K.Malware Detection based on API Calls Frequency.In Proceedings of the 4th International Conference on Information Systems and Computer Networks,Mathura,India,21-22November 2019;pp.400-404.
[3]Salehi,Z.;Sami,A.;Ghiasi,M.Using feature generation from API calls for malware detection.Comput.Fraud Secur.2014,9,9-18
[4]Kumar,B.J.;Naveen,H.;Kumar,B.P.;Sharma,S.S;Villegas,J.Logistic regression for polymorphic malware detection using ANOVA F-test.In Proceedings of the International Conference on Innovations in Information,Embedded and Communication Systems,Coimbatore,India,17-18March 2017;pp.1-5.
[5]Devesa,J.;Santos,I.;Cantero,X.;Penya,Y.K.;Bringas,P.G.Automatic Behaviour-based Analysis and Classification System for Malware Detection.ICEIS 2010,2,395-399
[6]Salehi,Z;Ghiasi,M.;Sami,A.A Miner for Malware Detection Based on API Function Calls and Their Arguments.In Proceedings of The 16th CSI International Symposium on Artificial Intelligence and Signal Processing,Shiraz,Fars,Iran,2-3May2012;pp.563-568.
in the first embodiment, when the malware classifier is used for detecting malware in the step 5, two detection modes are supported:
mode 1: each machine learning classification model can independently detect detected software and judge whether the detected software is malicious software or not; selecting any machine learning classification model during detection;
mode 2: the four machine learning classification models respectively detect detected software and respectively calculate the evaluation indexes of the four machine learning classification models; the evaluation index comprises model recognition degree, precision, recall rate and F1 score. And (3) obtaining comprehensive evaluation indexes by taking an average number of all the evaluation indexes, and judging whether the detected software is malicious software or not. The comprehensive evaluation index is given in the form of probability, and is compared with a trusted threshold value, and if the comprehensive evaluation index is larger than the trusted threshold value, the comprehensive evaluation index is judged to be malicious software. The credible threshold value can be adjusted according to the training stage, and higher classification accuracy can be obtained by adjusting the credible threshold value.
The second embodiment of the invention discloses a malicious software detection system based on N-gram and machine learning, which comprises a sandbox detection module, a feature generation module, a feature specification module and a machine learning module.
The sandbox detection module is used for carrying out dynamic behavior analysis on the malicious software sample and the application software sample by adopting the artificial intelligent sandbox SNDBOX to obtain a dynamic analysis file;
the characteristic generation module is used for obtaining sample key information based on the dynamic analysis file; generating a first N-gram feature set by utilizing an N-gram algorithm; the sample key information is a speech segment sequence comprising API call, function parameters and parameter position information;
the feature specification module is used for carrying out feature specification on the first N-gram feature set by adopting a TF-IDF algorithm to obtain a second N-gram feature set;
the machine learning module is used for receiving a second N-gram feature set, converting the second N-gram feature set into a binary feature vector set, inputting a machine learning classification model for training and testing, and obtaining a malicious software classifier; malware detection is performed using a malware classifier.
The invention provides a method and a system for detecting malicious software based on N-gram and machine learning, and the method and the way for realizing the technical scheme are numerous, the above description is only a specific implementation mode of the invention, and it should be pointed out that a plurality of improvements and modifications can be made to those skilled in the art without departing from the principle of the invention, and the improvements and the modifications are also regarded as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims (5)

1. A malicious software detection method based on N-gram and machine learning is characterized by comprising the following steps:
step 1, collecting a malicious software sample and an application software sample, and dynamically analyzing the sample to obtain a dynamic analysis file;
step 2, obtaining sample key information based on a dynamic analysis file, and generating a first N-gram feature set;
step 3, carrying out feature reduction on the first N-gram feature set to obtain a second N-gram feature set;
step 4, converting the second N-gram feature set into a binary feature vector set, inputting a machine learning classification model for training and testing, and obtaining a malicious software classifier;
step 5, detecting malicious software by using a malicious software classifier;
the malicious software samples in the step 1 comprise worms, trojans and viruses, and the application software samples comprise more than one type of application software; the malicious software sample and the application software sample are in PE file format;
the sample is dynamically analyzed in the step 1 to obtain a dynamic analysis file, and the dynamic analysis file is obtained through an artificial intelligent sandbox SNDBOX; the SNDBOX of the artificial intelligent sandbox scans the sample by using a static detection algorithm, then the sample is opened by simulating artificial operation by using a dynamic engine, the behavior of the sample in the execution process is analyzed, and the detection results of the static detection algorithm and the dynamic engine are comprehensively considered to generate a dynamic analysis file; the static detection algorithm comprises a static scanning and antivirus software detection engine;
the obtaining sample key information in step 2 includes: analyzing the format of the dynamic analysis file, and summarizing the characteristics of the API call, the function parameters and the parameter position key information; writing an analysis program based on the characteristics, and picking sample key information in the process of analyzing the malicious software from the dynamic analysis file;
the first N-gram feature set generated in the step 2 is generated by adopting an N-gram algorithm based on sample key information;
step 3 adopts a TF-IDF algorithm to evaluate the importance of each N-gram to identify malware, and if one N-gram appears frequently in the N-gram set of malware but rarely in the N-gram set of application software, the N-gram is the selected key N-gram feature set, and the N-gram feature set screened by the TF-IDF algorithm is marked as the second N-gram feature set.
2. The method for detecting malware based on N-gram and machine learning according to claim 1, wherein the machine learning classification model in step 4 comprises naive bayes, decision tree classification, random forest, and logistic regression model; dividing the binary feature vector set converted by the second N-gram feature set into a training sample set and a test sample set according to the proportion of 80% and 20%, and respectively distributing the training sample set and the test sample set to a training stage and a test stage; with 10-fold cross-validation, the training sample set and the test sample set were randomly divided into 10 disjoint sets, respectively, with the objective that the training and testing phases were performed 10 times.
3. The method for detecting malicious software based on N-gram and machine learning according to claim 2, wherein two detection modes are supported when the malicious software classifier is used for detecting malicious software in the step 5:
mode 1: each machine learning classification model can independently detect detected software and judge whether the detected software is malicious software or not; selecting any machine learning classification model during detection;
mode 2: the four machine learning classification models respectively detect detected software and respectively calculate the evaluation indexes of the four machine learning classification models; and (3) obtaining comprehensive evaluation indexes by taking an average number of all the evaluation indexes, and judging whether the detected software is malicious software or not.
4. The method for detecting malware according to claim 3, wherein the evaluation indexes of the four machine learning classification models in the step 5 include model recognition degree, accuracy, recall rate and F1 score.
5. A malicious software detection system based on N-gram and machine learning, which is applied to the malicious software detection method based on N-gram and machine learning according to any one of claims 1-4, and is characterized by comprising a sandbox detection module, a feature generation module, a feature specification module and a machine learning module,
the sandbox detection module is used for carrying out dynamic behavior analysis on the malicious software sample and the application software sample by adopting the artificial intelligent sandbox SNDBOX to obtain a dynamic analysis file;
the characteristic generation module is used for obtaining sample key information based on the dynamic analysis file; generating a first N-gram feature set by utilizing an N-gram algorithm; the sample key information is a speech segment sequence comprising API call, function parameters and parameter position information;
the feature specification module is used for carrying out feature specification on the first N-gram feature set by adopting a TF-IDF algorithm to obtain a second N-gram feature set;
the machine learning module is used for receiving a second N-gram feature set, converting the second N-gram feature set into a binary feature vector set, inputting a machine learning classification model for training and testing, and obtaining a malicious software classifier; malware detection is performed using a malware classifier.
CN202110972755.8A 2021-08-24 2021-08-24 Malicious software detection method and system based on N-gram and machine learning Active CN113709134B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110972755.8A CN113709134B (en) 2021-08-24 2021-08-24 Malicious software detection method and system based on N-gram and machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110972755.8A CN113709134B (en) 2021-08-24 2021-08-24 Malicious software detection method and system based on N-gram and machine learning

Publications (2)

Publication Number Publication Date
CN113709134A CN113709134A (en) 2021-11-26
CN113709134B true CN113709134B (en) 2023-06-20

Family

ID=78654247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110972755.8A Active CN113709134B (en) 2021-08-24 2021-08-24 Malicious software detection method and system based on N-gram and machine learning

Country Status (1)

Country Link
CN (1) CN113709134B (en)

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104392174B (en) * 2014-10-23 2016-04-06 腾讯科技(深圳)有限公司 The generation method of the proper vector of application program dynamic behaviour and device
CN105138913A (en) * 2015-07-24 2015-12-09 四川大学 Malware detection method based on multi-view ensemble learning
CN106778268A (en) * 2016-11-28 2017-05-31 广东省信息安全测评中心 Malicious code detecting method and system
US11481492B2 (en) * 2017-07-25 2022-10-25 Trend Micro Incorporated Method and system for static behavior-predictive malware detection
US10733294B2 (en) * 2017-09-11 2020-08-04 Intel Corporation Adversarial attack prevention and malware detection system
WO2019075338A1 (en) * 2017-10-12 2019-04-18 Charles River Analytics, Inc. Cyber vaccine and predictive-malware-defense methods and systems
US11176589B2 (en) * 2018-04-10 2021-11-16 Ebay Inc. Dynamically generated machine learning models and visualization thereof
CN109753801B (en) * 2019-01-29 2022-04-22 重庆邮电大学 Intelligent terminal malicious software dynamic detection method based on system call

Also Published As

Publication number Publication date
CN113709134A (en) 2021-11-26

Similar Documents

Publication Publication Date Title
US10848519B2 (en) Cyber vaccine and predictive-malware-defense methods and systems
Yan et al. Detecting malware with an ensemble method based on deep neural network
Aslan et al. A new malware classification framework based on deep learning algorithms
Fan et al. Malicious sequential pattern mining for automatic malware detection
Bai et al. Famd: A fast multifeature android malware detection framework, design, and implementation
Ding et al. Control flow-based opcode behavior analysis for malware detection
Lu Malware detection with lstm using opcode language
Santos et al. Opcode-sequence-based semi-supervised unknown malware detection
Liu et al. A statistical pattern based feature extraction method on system call traces for anomaly detection
Sun et al. An opcode sequences analysis method for unknown malware detection
Laurenza et al. Malware triage for early identification of advanced persistent threat activities
CN111400713B (en) Malicious software population classification method based on operation code adjacency graph characteristics
Rahul et al. Analysis of machine learning models for malware detection
Demırcı et al. Static malware detection using stacked BiLSTM and GPT-2
Howard et al. Predicting signatures of future malware variants
Dewanje et al. A new malware detection model using emerging machine learning algorithms
Kakisim et al. Sequential opcode embedding-based malware detection method
Zhang et al. The classification and detection of malware using soft relevance evaluation
Partenza et al. Automatic identification of vulnerable code: Investigations with an ast-based neural network
Tang et al. Bhmdc: A byte and hex n-gram based malware detection and classification method
Masabo et al. Improvement of malware classification using hybrid feature engineering
De La Rosa et al. Efficient characterization and classification of malware using deep learning
Siddiqui et al. Detecting trojans using data mining techniques
CN113709134B (en) Malicious software detection method and system based on N-gram and machine learning
Nguyen et al. Lightgbm-based ransomware detection using api call sequences

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant