CN110704840A - Convolutional neural network CNN-based malicious software detection method - Google Patents

Convolutional neural network CNN-based malicious software detection method Download PDF

Info

Publication number
CN110704840A
CN110704840A CN201910854560.6A CN201910854560A CN110704840A CN 110704840 A CN110704840 A CN 110704840A CN 201910854560 A CN201910854560 A CN 201910854560A CN 110704840 A CN110704840 A CN 110704840A
Authority
CN
China
Prior art keywords
software
layer
cnn
neural network
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910854560.6A
Other languages
Chinese (zh)
Inventor
芦天亮
杜彦辉
李国友
傅依娴
吴警
张翼翔
暴雨轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHINESE PEOPLE'S PUBLIC SECURITY UNIVERSITY
Original Assignee
CHINESE PEOPLE'S PUBLIC SECURITY UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHINESE PEOPLE'S PUBLIC SECURITY UNIVERSITY filed Critical CHINESE PEOPLE'S PUBLIC SECURITY UNIVERSITY
Priority to CN201910854560.6A priority Critical patent/CN110704840A/en
Publication of CN110704840A publication Critical patent/CN110704840A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention provides a convolutional neural network CNN-based malicious software detection method, which comprises the following steps: step 1: collecting and analyzing a training set, and generating a report file in a json format through a Cuckoo sandbox; step 2: vectorizing the report in the json format to obtain a feature vector; and step 3: the feature vectors processed in the step 2 are used as input and transmitted into an untrained CNN for training and learning to obtain a trained CNN; and 4, step 4: and (3) after the software to be tested is processed in the same way as the software to be tested in the step (1) and the step (2), obtaining a feature vector of the software to be tested, putting the feature vector into the CNN trained in the step (3), and judging whether the software to be tested is malicious software or normal software through CNN model detection. Compared with other machine learning algorithms and antivirus software, the method provided by the application has the advantage that better technical effects can be obtained in the aspects of detection rate and accuracy.

Description

Convolutional neural network CNN-based malicious software detection method
Technical Field
The application relates to the field of network security, in particular to a Convolutional Neural Network (CNN) -based malicious software detection method.
Background
The explosive growth of malicious software and the serious threat to the user machine and the network environment gradually become the main contradictions in the network space security field.
The malware detection method is currently mainly divided into two phases: a feature extraction stage and a detection stage. In the feature extraction stage, the extracted features mainly include static features and dynamic features, the corresponding extraction means are respectively a static feature extraction technology and a dynamic feature extraction technology, and the corresponding detection methods are respectively a static detection method and a dynamic detection method.
The static detection method is mainly represented by a static signature. The static signature-based malware detection approach exploits the static nature of programs to distinguish benign from malware. Malware detection based on static signatures requires the examination of malware and the creation of a different signature for each newly discovered malware. The signature may be based on bytecode, binary assembly instructions, imported Dynamic Link Libraries (DLLs), or function and system calls. Schultz W et al combine features of malware resource segments, DLLs, program string constants, bytecode information with data mining algorithms for static analysis.
In the field of dynamic method analysis, benign and malware are distinguished by malware behavior, which involves two key problems, behavior characterization and software behavior analysis techniques. In the aspect of behavior characterization research, software behaviors are generally described mainly by using an API call sequence, a system feature count or software codes. In the aspect of research of software behavior analysis technology, the content of behavior feature description is mainly extracted from the viewpoint of dynamic analysis. A sandbox named TTAnalyze is implemented in foreign Bayer U and the like to run a sample and analyze sample execution flow information; willem obtains a malicious code behavior analysis report by using Cwsandbox designed in a distributed system laboratory of the university of Mannheim, Germany, and provides a learning method based on a Space Vector (VSM) model.
In China in the early days, HanXiao et al identify the malicious behavior type of a program based on user-layer function call combined with SVM machine learning algorithm and the like. XiaoLinGui et al extracts traffic information of a program in dynamic operation to identify malicious behavior of the program. Bye et al propose a fuzzy inference method, which calculates the probability of program maliciousness based on bayes theorem. In recent years, researchers are continuously trying to apply deep learning algorithms to malware detection directions, which is a relatively research-oriented development trend. Saxe et al used a feed forward neural network to classify the static analysis results, however, they did not consider the dynamic analysis results in the study. In the case of binary obfuscation, static analysis may not provide a satisfactory classification output. Huang et al, on the other hand, use up to four layers of feedforward neural networks, but focus on evaluating the multi-task learning concepts.
At present, the traditional malware analysis and detection technology cannot effectively protect the security of a computer system and an internet, mainly because of the following reasons:
(1) malware authors use various obfuscation techniques to generate new variants of the same malware to circumvent signature-based identification. Traditional protection technologies take antivirus software as an example, and currently, the antivirus software mainly uses a signature-based detection method, so that the defect is that the antivirus software can only detect known malware and cannot detect the malware modified by using code obfuscation and polymorphic technology.
(2) With the rapid increase of the amount of malicious software, the conventional feature code-based static scanning technology and software behavior-based malicious software detection technology are prone to false alarm and false negative, and new requirements in the field of information security cannot be met gradually.
(3) According to the malicious software detection based on the traditional machine learning algorithm, most of machine learning structures are shallow structures, and high-level abstract model training cannot be performed on malicious software, so that the detection effect needs to be improved.
Disclosure of Invention
Aiming at the defects existing in the traditional malicious software detection, the invention provides a novel malicious software detection method based on a Convolutional Neural Network (CNN).
A convolutional neural network CNN-based malware detection method, the method comprising the steps of:
step 1: collecting a training set and analyzing, wherein the training set is composed of normal software samples and malicious software samples, and the analyzing comprises the step of generating report files in a json format by the training set through a Cuckoo sandbox;
step 2: carrying out dynamic API sequence extraction on the report in the json format, and carrying out vectorization processing on the extracted software features to obtain feature vectors;
and step 3: constructing a Convolutional Neural Network (CNN) model, transmitting the feature vector processed in the step (2) as an input into an untrained CNN model for training and learning, and training the CNN to an optimal state by adjusting parameters to obtain a training CNN;
and 4, step 4: and (3) after the software to be tested is processed in the same way as the software to be tested in the steps 1 and 2, obtaining a feature vector of the software to be tested, putting the feature vector into the convolutional neural network model CNN trained in the step 3, and judging that the software to be tested is malicious software or normal software through the detection of the CNN model.
The step 2 specifically comprises: and writing a Python script to extract API behaviors from the json report, generating a txt document, and converting the txt document into a feature vector through One-Hot encoding.
The method further includes that the convolutional neural network CNN model in step 3 includes an input layer, a convolutional layer, a pooling layer, and an output layer, where the convolutional layer converts text information into picture features, and specifically includes the following operations:
let d(l)And d(l-1)Respectively output and input of the l-th layer, i.e. d(0)As a matrix of the original image, d(L)Is the output of the last layer L; let NI (l)And NO (l)The number of input and output feature matrices of the l-th layer respectively; the input of the l layer is the output of the (l-1) layer, i.e. NI (l)=NO (l-1)(ii) a Denote the jth output feature matrix of the ith layer as dj (l)Then the ith input feature matrix of the l-th layer is also the ith output feature matrix of the (l-1) -th layer, denoted as dj (l-1)(ii) a The output calculation formula of the convolutional layer is as follows:
Figure BDA0002197938330000041
wherein i is more than or equal to 0 and less than or equal to NI (l-1),0≤j≤NO (l-1)Mapping f is a nonlinear activation function, namely a sigmoid function;
Figure BDA0002197938330000042
a weight matrix connected from the ith input feature matrix to the jth output matrix in the convolution kernel;
Figure BDA0002197938330000043
is the offset term of the jth output feature matrix of the ith layer.
According to the method, the application of the convolutional neural network in the aspects of image and natural language processing is used for reference, the detection of malicious software by using the convolutional neural network is realized, the One-Hot coding is used for replacing a commonly used Word2Vec Skip-gram model, and a better software detection effect is achieved.
Drawings
FIG. 1 is a CNN-based malware detection flow;
FIG. 2 is a Cuckoo Sandbox structure;
FIG. 3 is an overview of a convolutional neural network architecture;
FIG. 4 is a convolution and pooling layer;
FIG. 5 Accuracy comparisons under different filters
Detailed Description
In order to solve the problems in the prior art, the application provides a method for detecting and analyzing malicious software based on a Convolutional Neural Network (CNN), and particularly, the application realizes the detection of the malicious software by using the convolutional neural network by taking reference to the application of the convolutional neural network in the aspects of image and natural language processing, and uses One-Hot coding to replace a commonly used Word2Vec Skip-gram model, so that a better software detection effect is achieved.
The methods for detecting and analyzing malware proposed by the present application are described in detail below.
1 System framework
The present application detects and analyzes malware by using Convolutional Neural Networks (CNN). Fig. 1 is a CNN-based malware detection flow.
As shown in FIG. 1, at step 1 after start, training sets are collected and analyzed, and this part of the work corresponds to the "collect reports" phase of FIG. 1. The training set is composed of normal PE (Executable) file samples and malicious PE file samples. The normal PE file sample refers to non-toxic and relatively safe software; the malicious PE file sample refers to malware that is harmful to the computer. And submitting the collected PE file samples, namely the training PE file set, to a Guest (client) for analysis through a Host (Host) in the Cuckoo sandbox, transmitting supervision records returned by the Guest, and generating a report file in a json format through an analysis component by the Host. The selection and collection of the normal PE file samples and the malicious PE file samples in the training set is to enable the deep learning CNN model to perform supervised learning in the following step 3, and fully learn the characteristics of the normal samples and the characteristics of the malicious samples, so that the PE samples to be tested can be effectively distinguished normally and maliciously.
And step 2, performing dynamic API sequence extraction on the collected json reports, wherein the part of work corresponds to the data preprocessing stage in the figure 1. And writing a Python script to extract API behaviors and exporting the API behaviors to a txt document, and performing feature vectorization on an API sequence through a word vector model to obtain the result that function information of each line in the txt document is expressed by 0 and 1.
And 3, constructing a Convolutional Neural Network (CNN) model, transmitting the feature vector preprocessed in the step 2 as input into the untrained convolutional neural network model (CNN) (namely the initialized CNN) for training and learning, and training the model to an optimal state through parameter adjustment to obtain the trained CNN. After the learning rate coefficient, the convolution kernel step length, the discarding rate, the activation function and the bias term of the model can be adjusted, experimental comparison is performed by setting a plurality of groups of filter heights, and finally, an optimal parameter value is selected to represent that the convolution neural network is trained to the optimal state through the plurality of groups of experimental comparison, as shown in the subsequent technical effect part.
The CNN model mainly comprises an input layer, a convolution layer, a pooling layer and an output layer. Wherein the input layer receives the incoming feature vectors; the convolution layer mainly converts a result processed by the word vector model into a two-dimensional matrix similar to an image; the pooling layer is to extract more remarkable features from the two-dimensional matrix and finally input the features into a softmax function for classification. This part of the work corresponds to the "convolutional neural network training" phase of fig. 1.
And 4, in the stage of malicious software detection, after the PE files to be tested, namely the test set, are subjected to the same processing as the step 1 of collecting reports and the step 2 of data preprocessing, putting the obtained feature vectors of the PE files to be tested into the convolutional neural network model CNN trained in the step 3, and judging the PE files to be tested to be malicious software or normal software through CNN model detection. This portion of the work corresponds to the "malware detection" phase of fig. 1.
The CNN-based malware detection process of the present application is briefly described above, and steps 1 to 4 of the steps shown in fig. 1 are described in detail below.
2 Collection report
Collecting reports is the process of collecting training sets and analyzing. The purpose of this step is to make the deep learning CNN model perform supervised learning in the following step 3, and fully learn the features of the normal sample and the features of the malicious sample, so as to finally perform effective normal and malicious differentiation on the PE sample to be tested.
The collected training set is composed of normal PE (Executable) file samples and malicious PE file samples. The normal PE file sample refers to non-toxic and relatively safe software; the malicious PE file sample refers to malware that is harmful to the computer. And submitting the collected PE file samples, namely the training PE file set, to a Guest (client) for analysis through a Host (Host) in the Cuckoo sandbox, transmitting supervision records returned by the Guest, and generating a report file in a json format through an analysis component by the Host.
A Sandbox Cuckoo Sandbox environment needs to be set up before the sample reports are collected. FIG. 2 is the structure of Cuckoo Sandbox. As shown in fig. 2, Cuckoo Sandbox mainly includes two major parts, namely Host Machine and guest Machine, which communicate with each other by establishing a virtual network. The Host Machine comprises Cuckoo Sandbox software, Virtual Machine software Virtual Box and various analysis components, and is mainly responsible for starting analysis, behavior monitoring, report file generation and the like of a sample. The Guest Machine is primarily responsible for executing malware and reporting the analysis results to the Host Machine.
The work flow of Cuckoo Sandbox is as follows:
firstly, Cuckoo Sandbox executes a main program script and simultaneously starts a Guest Machine virtual Machine.
And sending the PE file sample to be analyzed into a Guest Machine for execution by uploading the script and the monitoring script.
The PE file sample to be analyzed is executed in the Guest Machine, and meanwhile, various information of the sample is recorded by the monitoring script.
And after the sample is executed, the monitoring script sends the record to the Host Machine where the virtual Machine software is located through the virtual network and the sharing function of the virtual Machine software. And recording the result and generating a report file in a json format through the analysis component. The report file contains behavior information of the malicious software, so far, the behavior of the malicious software is successfully extracted through the behavior extraction engine.
The virtual Machine software uses the snapshot to restore the Guest Machine virtual Machine to an initial state.
3 data preprocessing
Because the convolutional neural network model does not accept the original text as input, it can only process numerical data, and the purpose of this step is to convert the extracted description of software behavior into data that the convolutional neural network model can process.
The data preprocessing is to extract dynamic API sequences of collected json reports, write Python scripts to extract API behaviors, and perform a feature vectorization process on the API sequences through a word vector model so as to convert the API sequences into texts capable of being processed by a convolutional neural network model.
The specific operation of extracting the API behavior is as follows: the sequence of API function calls extracted by Python is represented by a txt document, with each line of the document representing an API call. Each row is divided into two parts separated by a space. Json, corresponding to the category field in the original report, and the second part is the called API name.
For example, shown below is a json analysis report snip:
Figure BDA0002197938330000081
referring to the json analysis report fragment above, the main extracted information is the "category" and "api" of the json fragment, and other parameter information is omitted. Since json files are primarily composed of dictionaries, lists in the data structure. On the extraction process, the positions of the "category" and the "api" in the json segment are found first, and then specific function information is extracted. The following is the extraction code fragment.
Figure BDA0002197938330000101
Referring to the json analysis report fragment, the content corresponding to the "category" field is "system", and the content corresponding to the "api" field is "ldrgetproceduredaddress", and the combination of the above contents, "systemldrgetproceduredaddress", constitutes one line of the txt document. And the whole json report records a complex behavior process, and finally all function call sequences are assembled into a txt format. The truncated txt fragment is as follows:
.......
system LdrGetProcedureAddress
process NtAllocateVirtualMemory
process NtAllocateVirtualMemory
exception SetUnhandledExceptionFilter
registry RegOpenKeyExW
registry RegQueryValueExW
registry RegCloseKey
resource FindResourceExW
resource FindResourceExA
resource FindResourceExW
resource FindResourceExW
.......
in the following, a text-to-numerical conversion process is required, and text vectorization refers to a process of converting text into a numerical tensor. One-Hot encoding is One of the methods of text vectorization. One-Hot encoding uses an N-bit status register to encode the N states and only One bit is valid. It associates each word with a unique integer index i, and then converts this integer index i into a binary vector of length N (N being the dictionary size, corresponding to the above-mentioned N-bit status register), which is characterized by the fact that only the ith element is 1 and the remaining elements are 0.
Assume that there are the following three API call sequences in our sample:
API1API2API7
API3API2API5
API4API2API6
(1) firstly, segmenting the three API call sequences into words, acquiring a dictionary, and then carrying out index numbering on each word:
API1:1;API2:2;API3:3;API4:4;API5:5;API6:6;API7:7
(2) then, a vector of each feature word is obtained as follows:
API1->(1,0,0,0,0,0,0)
API2->(0,1,0,0,0,0,0)
API3->(0,0,1,0,0,0,0)
API4->(0,0,0,1,0,0,0)
API5->(0,0,0,0,1,0,0)
API6->(0,0,0,0,0,1,0)
API7->(0,0,0,0,0,0,1)
(3) finally, obtaining the feature vectors of three API calling sequences:
API1API2API7->(1,1,0,0,0,0,1)
API3API2API5->(0,1,1,0,1,0,0)
API4API2API6->(0,1,0,1,0,1,0)
the One-Hot coding is used, the value of the discrete features can be expanded to the Euclidean space, and in the classification process, the calculation of the distance between the features or the calculation of the similarity are generally carried out in the Euclidean space. In the present invention, an API function call sequence is input and converted into a mathematical vector using one-hot encoding. For discrete features, a one-hot coding is used for a distance-based model, and the condition of sparse features can be well processed.
4 convolutional neural network training
The convolutional neural network is the model used in fig. 1 to construct the "convolutional neural network training". The convolutional neural network follows the neural network of the common multi-layer perceptron structure, which is a feedforward network. The convolutional layer uses a convolution filter to extract features from the data samples. In the field of image processing, convolution filters are used primarily to identify features from images. Similar to images, in text processing (e.g., sentence classification, search, recommendation, etc.), we can use convolution filters for information extraction and high-level feature detection on short texts. Because the log containing the malicious executable program instructions consists of word sequences in the predefined dictionary, when the modeling method is selected, the method is obviously similar to a text document, and the detection of malicious software by using the convolutional neural network is realized by using the application of the convolutional neural network in the aspects of image and natural language processing.
Fig. 3 shows an overview of the convolutional neural network architecture. As shown in fig. 3, the convolutional neural network CNN model mainly includes an input layer, a convolutional layer, a pooling layer, and an output layer. Wherein the input layer receives the incoming feature vectors; the convolution layer mainly converts a result processed by the word vector model into a two-dimensional matrix similar to an image; the pooling layer extracts more remarkable characteristics from the two-dimensional matrix, enters an output layer, and finally inputs the output result of the output layer into a softmax function for classification to judge whether the output layer is the malicious software.
The implementation of the convolutional layer and the pooling layer is described in detail below.
4.1 convolutional layer dynamic feature extraction
The convolutional layer is used as an important layer for extracting the characteristics of the whole network structure and mainly comprises local sensing, weight sharing and multiple convolution kernel characteristics, the former two have the function of reducing the dimension, and the latter provides specific operation for re-extracting the characteristics with different granularities. And the same convolution kernel extracts the same characteristic aiming at different sub-matrixes of the whole image by adopting a local sensing and weight sharing mode. The method has the defect of insufficient feature extraction, and for the defect, the convolutional neural network introduces the concept of multiple convolution kernels, and the convolution kernels with different weights can extract different features for input. For example, by performing image processing on one image using 100 convolution checks, 100 feature matrices can be acquired, and 100 features can be learned. The following describes the main calculation process of the convolutional layer, which aims to convert text information into picture features, and in accordance with the present application, converts the result of the word vector model processing into a two-dimensional matrix similar to a picture.
Let d(l)And d(l-1)Respectively output and input of the l-th layer, i.e. d(0)As a matrix of the original image, d(L)Of the last layer LAnd (6) outputting. Let NI (l)And NO (l)The number of input and output feature matrices of the ith layer, respectively. The input of the l layer is the output of the (l-1) layer, i.e. NI (l)=NO (l-1). Denote the jth output feature matrix of the ith layer as dj (l)Then the ith input feature matrix of the l-th layer is also the ith output feature matrix of the (l-1) -th layer, denoted as dj (l-1). The output calculation formula of the convolutional layer is as follows:
Figure BDA0002197938330000141
wherein i is more than or equal to 0 and less than or equal to NI (l-1),0≤j≤NO (l-1)The mapping f is a nonlinear activation function, i.e. sigmoid function.The weight matrix is connected from the ith input feature matrix to the jth output matrix in the convolution kernel.
Figure BDA0002197938330000143
Is the offset term of the jth output feature matrix of the ith layer.
Fig. 4 shows the structure of the convolutional and pooling layers. With good feature extraction, the convolutional neural network architecture will help us to distinguish patterns of use between benign and malware families, and indeed the convolutional filter helps to find higher-order local features that do not change for small changes in the data.
As shown in fig. 4, the input of the leftmost box is an API call function sequence, the next step is to generate a word vector matrix, re-extract features through a convolution filter and a pooling layer, and finally send the most significant features into a softmax classifier.
The rows of the input matrix represent discrete API call functions, and the filters slide across the entire row of the matrix, similar to the application in natural language processing. We choose the filter width to be 128, which represents the dimension of the API call function vector. The input sample matrixes are respectively convolved by a plurality of convolution kernels, the lengths of the convolution kernels can be randomly selected, the processing mode is similar to an N-Gram algorithm, for example, the height of a filter with the length of 3 is actually used for extracting features of the dynamic behaviors of 3 adjacent APIs.
4.2 Secondary extraction of pool layer characteristics
In the field of image recognition, sometimes the images are too large, we need to reduce the number of training parameters, and the only purpose of the pooling layer is to reduce the spatial size of the images. Pooling is done on each depth dimension independently, so the depth of the image remains constant.
The pooling layer takes the result of the local feature extracted by the convolutional layer as input, and further extracts the most significant feature. Mainly comprises the following steps: the Max firing function and the Average firing function. Calculating the maximum value in the image pooling window as a sampling value to represent the region characteristic; and calculating a weighted average value in the image pooling window as a sampling value to represent the region characteristic. By summary statistical calculation, the problem of overfitting of the model is solved while the dimension of the feature matrix is reduced, and meanwhile, the characteristic of deformation invariance of the feature matrix is ensured due to the introduction of the pooling layer.
In general, Max-posing takes the maximum value for a small area, and assuming the window size of posing is 2x2, the result of Max-posing takes the largest 2x2 matrix in the middle of the matrix on the left side of the arrow, as shown on the right side of the arrow:
Figure DEST_PATH_IMAGE001
average-posing averages a small area, assuming that posing window size is 2x2, as shown in the left side of the arrow, averaging the top left corner value is 7/4, and averaging the top right corner value is 5/4, and sequentially processing to obtain a 2x2 matrix on the right side of the arrow:
Figure DEST_PATH_IMAGE002
in the present invention, the Max-posing approach is used on the convolution results, which reduces the output dimension while preserving the important global information captured by the filter, thus taking the maximum value in the column vectors, so that each column vector can be converted to a value of 1x 1
5 malware detection
In the 'malware detection stage', the trained model in the step 3 is verified. Firstly, downloading a batch of malicious PE files and normal PE files, performing data preprocessing on the malicious PE files and then directly putting the malicious PE files and the normal PE files into a trained model, and judging each test sample by the detection model according to early learning experience to obtain a conclusion whether the test sample is malicious or not.
6. Technical effects
The technical effect of the technical solution of the present application is described below by a specific example.
6.1 data set and sandbox Environment configuration
The method selects the malicious samples on the Windows platform with the largest use amount as experimental objects, and mainly downloads Windows malicious PE file sets from two public websites, namely https:// viral-analysis.com and www.malware-traffic-analysis.net, wherein 2400 samples are used; in addition, a normal sample set is downloaded from 360 official application malls, and a total of 1000 samples of 16 types of software (such as a browser, a file editor, office software, a media player and the like) are downloaded according to the use ratio, so that 3400 samples are finally counted.
Programming operating environment (computer a) of the convolutional neural network-based malware detection technique: deep learning framework tensoflow, win10x64, CPU core, 2.4GHz, memory bank DDR 428008 GB, SSD solid state disk 512G.
The Cuckoo sandbox environment (computer B) is configured as in table 1:
TABLE 1 sandbox configuration Environment
Figure BDA0002197938330000171
6.2.1 comparison of CNN results under different filters
Experiment one, firstly adjusting the main parameter values of CNN:
table 2 CNN parameter settings
Parameter(s) Value of Description of the invention
μ 0.001 Learning rate
Stride 1 Convolution kernel step size
Dropout 0.1 Discard rate
Activation ReLU Activating a function
Bias Constant Bias term
4 more filter heights (3, 4, 5, 6, respectively) are set. A comparison of the CNN classification result Accuracy using One-Hot encoding with the CNN classification result Accuracy using the Skip-gram model is shown in FIG. 5.
By selecting different filter heights, it can be seen that the CNN model after One-Hot encoding has better effect in Accuracy than the CNN model after Skip-gram. And when the height of the filter is 6, the detection result of the CNN model after One-Hot coding is the highest, and the CNN state is the best.
6.2.2 comparison with other conventional machine learning algorithms
Experiment two, we performed three cross-validation experiments to estimate the results on new data. In each experiment, we randomly partitioned the dataset into three equal sized partitions, trained on two partitions, and tested the remaining partitions, this process was repeated three times, leaving a different partition to test each time. We calculate the average of the three tests and finally obtain a reliable metric to measure the performance of the proposed convolutional neural network model over the entire data set. In addition, MLP, NaiveBayes, SVM and CNN (One-Hot) are selected for comparison, and the performance of the model is detected. The performance of the convolutional neural network model was quantitatively evaluated using three indices: accuracy (precision), Recall (Recall) and F1-score, experimental results are shown in table 3.
TABLE 3 different machine learning algorithm test results
Algorithm Accuracy Recall F1-score
MLP 0.92 0.91 0.91
NaiveBayes 0.84 0.75 0.76
SVM 0.90 0.89 0.88
CNN 0.94 0.92 0.93
As can be seen from the results in Table 3, the overall results of the CNN model in Accuracy, Recall and F1-score are all higher than those of other common machine learning algorithms; CNN is therefore more advantageous than several other machine-learned algorithms.
6.2.3 comparison with common antivirus software
Experiment three, we download another 100 PE malicious samples that are not duplicated with the experiment. Submitting on a virus Total, and calculating the detection rate R of all submitted samples under the antivirus software Clam AV, TotalDefence, ZoneAlarm and Malware bytes, wherein R is n/t, n is the count of the antivirus software detected as the malicious sample, and t is the Total number of all the antivirus software on the virus Total. CNN (One-Hot) was used for detection and comparison with the antivirus software Clam AV, TotalDefense, ZoneAlarm, Malware bytes. The results are shown in Table 4.
TABLE 4 comparison of detection rates for unknown malware
Antivirus software R
Clam AV 64%
TotalDefense 52%
ZoneAlarm 82%
Malwarebytes 43%
CNN 91%
As can be seen from the table, the detection rate of the CNN model to unknown malware reaches more than 90%, and compared with other common antivirus software, the method has a better detection rate.

Claims (3)

1. A convolutional neural network CNN-based malware detection method, the method comprising the steps of:
step 1: collecting a training set and analyzing, wherein the training set is composed of normal software samples and malicious software samples, and the analyzing comprises the step of generating report files in a json format by the training set through a Cuckoo sandbox;
step 2: carrying out dynamic API sequence extraction on the report in the json format, and carrying out vectorization processing on the extracted software features to obtain feature vectors;
and step 3: constructing a Convolutional Neural Network (CNN) model, transmitting the feature vectors processed in the step (2) as input into an untrained CNN model for training and learning, training the CNN to an optimal state by adjusting parameters, and finally obtaining a trained CNN model;
and 4, step 4: and (3) after the software to be tested is processed in the same way as the software to be tested in the steps 1 and 2, obtaining a feature vector of the software to be tested, putting the feature vector into the convolutional neural network model CNN trained in the step 3, and finally judging that the software to be tested is malicious software or normal software through the detection of the CNN model.
2. The malware detection method of claim 1, wherein the step 2 specifically comprises: and writing a Python script to extract API behaviors from the json report, generating a txt format document, and converting the txt format document into a feature vector through One-Hot coding.
3. The malware detection method as claimed in claim 1-2, wherein the convolutional neural network CNN model in step 3 includes an input layer, a convolutional layer, a pooling layer, and an output layer, wherein the convolutional layer converts text information into picture features, and specifically includes the following operations:
let d(l)And d(l-1)Respectively output and input of the l-th layer, i.e. d(0)As a matrix of the original image, d(L)Is the output of the last layer L; let NI (l)And NO (l)The number of input and output feature matrices of the l-th layer respectively; the input of the l layer is the output of the (l-1) layer, i.e. NI (l)=NO (l-1)(ii) a Denote the jth output feature matrix of the ith layer as dj (l)Then the ith input feature matrix of the l-th layer is also the ith output feature matrix of the (l-1) -th layer, denoted as dj (l-1)(ii) a The output calculation formula of the convolutional layer is as follows:
Figure FDA0002197938320000021
wherein i is more than or equal to 0 and less than or equal to NI (l-1),0≤j≤NO (l-1)Mapping f is a nonlinear activation function, namely a sigmoid function;
Figure FDA0002197938320000022
a weight matrix connected from the ith input feature matrix to the jth output matrix in the convolution kernel;
Figure FDA0002197938320000023
is the offset term of the jth output feature matrix of the ith layer.
CN201910854560.6A 2019-09-10 2019-09-10 Convolutional neural network CNN-based malicious software detection method Pending CN110704840A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910854560.6A CN110704840A (en) 2019-09-10 2019-09-10 Convolutional neural network CNN-based malicious software detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910854560.6A CN110704840A (en) 2019-09-10 2019-09-10 Convolutional neural network CNN-based malicious software detection method

Publications (1)

Publication Number Publication Date
CN110704840A true CN110704840A (en) 2020-01-17

Family

ID=69195176

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910854560.6A Pending CN110704840A (en) 2019-09-10 2019-09-10 Convolutional neural network CNN-based malicious software detection method

Country Status (1)

Country Link
CN (1) CN110704840A (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368304A (en) * 2020-03-31 2020-07-03 绿盟科技集团股份有限公司 Malicious sample category detection method, device and equipment
CN111461343A (en) * 2020-03-13 2020-07-28 北京百度网讯科技有限公司 Model parameter updating method and related equipment thereof
CN112347478A (en) * 2020-10-13 2021-02-09 北京天融信网络安全技术有限公司 Malicious software detection method and device
CN112464234A (en) * 2020-11-21 2021-03-09 西北工业大学 SVM-based malicious software detection method on cloud platform
CN112507330A (en) * 2020-11-04 2021-03-16 北京航空航天大学 Malicious software detection system based on distributed sandbox
CN112632541A (en) * 2020-12-29 2021-04-09 网神信息技术(北京)股份有限公司 Method and device for determining malicious degree of behavior, computer equipment and storage medium
CN112965789A (en) * 2021-03-25 2021-06-15 绿盟科技集团股份有限公司 Virtual machine memory space processing method, device, equipment and medium
CN112966272A (en) * 2021-03-31 2021-06-15 国网河南省电力公司电力科学研究院 Internet of things Android malicious software detection method based on countermeasure network
CN113139185A (en) * 2021-04-13 2021-07-20 北京建筑大学 Malicious code detection method and system based on heterogeneous information network
CN113139187A (en) * 2021-04-22 2021-07-20 北京启明星辰信息安全技术有限公司 Method and device for generating and detecting pre-training language model
CN113221109A (en) * 2021-03-30 2021-08-06 浙江工业大学 Intelligent malicious file analysis method based on generation countermeasure network
CN113378171A (en) * 2021-07-12 2021-09-10 东北大学秦皇岛分校 Android lasso software detection method based on convolutional neural network
CN113420293A (en) * 2021-06-22 2021-09-21 北京计算机技术及应用研究所 Android malicious application detection method and system based on deep learning
CN114139153A (en) * 2021-11-02 2022-03-04 武汉大学 Graph representation learning-based malware interpretability classification method
CN114510717A (en) * 2022-01-25 2022-05-17 上海斗象信息科技有限公司 ELF file detection method and device and storage medium
CN114692156A (en) * 2022-05-31 2022-07-01 山东省计算中心(国家超级计算济南中心) Memory segment malicious code intrusion detection method, system, storage medium and equipment
CN115438805A (en) * 2022-11-08 2022-12-06 江苏智云天工科技有限公司 Product defect detection method based on machine learning model in industrial quality inspection field
CN116226854A (en) * 2023-05-06 2023-06-06 江西萤火虫微电子科技有限公司 Malware detection method, system, readable storage medium and computer
US20230185913A1 (en) * 2020-01-31 2023-06-15 Palo Alto Networks, Inc. Building multi-representational learning models for static analysis of source code
US11775757B2 (en) 2020-05-04 2023-10-03 International Business Machines Corporation Automated machine-learning dataset preparation
US11783035B2 (en) 2020-01-31 2023-10-10 Palo Alto Networks, Inc. Multi-representational learning models for static analysis of source code
CN116975863A (en) * 2023-07-10 2023-10-31 福州大学 Malicious code detection method based on convolutional neural network
CN112965789B (en) * 2021-03-25 2024-05-03 绿盟科技集团股份有限公司 Virtual machine memory space processing method, device, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376220A (en) * 2018-02-01 2018-08-07 东巽科技(北京)有限公司 A kind of malice sample program sorting technique and system based on deep learning
CN108614970A (en) * 2018-04-03 2018-10-02 腾讯科技(深圳)有限公司 Detection method, model training method, device and the equipment of Virus
CN109165510A (en) * 2018-09-04 2019-01-08 中国民航大学 Android malicious application detection method based on binary channels convolutional neural networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376220A (en) * 2018-02-01 2018-08-07 东巽科技(北京)有限公司 A kind of malice sample program sorting technique and system based on deep learning
CN108614970A (en) * 2018-04-03 2018-10-02 腾讯科技(深圳)有限公司 Detection method, model training method, device and the equipment of Virus
CN109165510A (en) * 2018-09-04 2019-01-08 中国民航大学 Android malicious application detection method based on binary channels convolutional neural networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
云影实验室: "基于深度学习的恶意样本行为检测(含源码)", 《FREEBUF网络安全行业门户》 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11816214B2 (en) * 2020-01-31 2023-11-14 Palo Alto Networks, Inc. Building multi-representational learning models for static analysis of source code
US11783035B2 (en) 2020-01-31 2023-10-10 Palo Alto Networks, Inc. Multi-representational learning models for static analysis of source code
US20230185913A1 (en) * 2020-01-31 2023-06-15 Palo Alto Networks, Inc. Building multi-representational learning models for static analysis of source code
CN111461343A (en) * 2020-03-13 2020-07-28 北京百度网讯科技有限公司 Model parameter updating method and related equipment thereof
CN111461343B (en) * 2020-03-13 2023-08-04 北京百度网讯科技有限公司 Model parameter updating method and related equipment thereof
CN111368304B (en) * 2020-03-31 2022-07-05 绿盟科技集团股份有限公司 Malicious sample category detection method, device and equipment
CN111368304A (en) * 2020-03-31 2020-07-03 绿盟科技集团股份有限公司 Malicious sample category detection method, device and equipment
US11775757B2 (en) 2020-05-04 2023-10-03 International Business Machines Corporation Automated machine-learning dataset preparation
CN112347478B (en) * 2020-10-13 2021-08-24 北京天融信网络安全技术有限公司 Malicious software detection method and device
CN112347478A (en) * 2020-10-13 2021-02-09 北京天融信网络安全技术有限公司 Malicious software detection method and device
CN112507330B (en) * 2020-11-04 2022-06-28 北京航空航天大学 Malicious software detection system based on distributed sandbox
CN112507330A (en) * 2020-11-04 2021-03-16 北京航空航天大学 Malicious software detection system based on distributed sandbox
CN112464234B (en) * 2020-11-21 2024-04-05 西北工业大学 Malicious software detection method based on SVM on cloud platform
CN112464234A (en) * 2020-11-21 2021-03-09 西北工业大学 SVM-based malicious software detection method on cloud platform
CN112632541A (en) * 2020-12-29 2021-04-09 网神信息技术(北京)股份有限公司 Method and device for determining malicious degree of behavior, computer equipment and storage medium
CN112965789B (en) * 2021-03-25 2024-05-03 绿盟科技集团股份有限公司 Virtual machine memory space processing method, device, equipment and medium
CN112965789A (en) * 2021-03-25 2021-06-15 绿盟科技集团股份有限公司 Virtual machine memory space processing method, device, equipment and medium
CN113221109B (en) * 2021-03-30 2022-06-28 浙江工业大学 Intelligent malicious file analysis method based on generation countermeasure network
CN113221109A (en) * 2021-03-30 2021-08-06 浙江工业大学 Intelligent malicious file analysis method based on generation countermeasure network
CN112966272B (en) * 2021-03-31 2022-09-09 国网河南省电力公司电力科学研究院 Internet of things Android malicious software detection method based on countermeasure network
CN112966272A (en) * 2021-03-31 2021-06-15 国网河南省电力公司电力科学研究院 Internet of things Android malicious software detection method based on countermeasure network
CN113139185B (en) * 2021-04-13 2023-09-05 北京建筑大学 Malicious code detection method and system based on heterogeneous information network
CN113139185A (en) * 2021-04-13 2021-07-20 北京建筑大学 Malicious code detection method and system based on heterogeneous information network
CN113139187A (en) * 2021-04-22 2021-07-20 北京启明星辰信息安全技术有限公司 Method and device for generating and detecting pre-training language model
CN113139187B (en) * 2021-04-22 2023-12-19 北京启明星辰信息安全技术有限公司 Method and device for generating and detecting pre-training language model
CN113420293A (en) * 2021-06-22 2021-09-21 北京计算机技术及应用研究所 Android malicious application detection method and system based on deep learning
CN113378171A (en) * 2021-07-12 2021-09-10 东北大学秦皇岛分校 Android lasso software detection method based on convolutional neural network
CN114139153A (en) * 2021-11-02 2022-03-04 武汉大学 Graph representation learning-based malware interpretability classification method
CN114510717A (en) * 2022-01-25 2022-05-17 上海斗象信息科技有限公司 ELF file detection method and device and storage medium
CN114692156B (en) * 2022-05-31 2022-08-30 山东省计算中心(国家超级计算济南中心) Memory segment malicious code intrusion detection method, system, storage medium and equipment
CN114692156A (en) * 2022-05-31 2022-07-01 山东省计算中心(国家超级计算济南中心) Memory segment malicious code intrusion detection method, system, storage medium and equipment
CN115438805A (en) * 2022-11-08 2022-12-06 江苏智云天工科技有限公司 Product defect detection method based on machine learning model in industrial quality inspection field
CN116226854A (en) * 2023-05-06 2023-06-06 江西萤火虫微电子科技有限公司 Malware detection method, system, readable storage medium and computer
CN116975863A (en) * 2023-07-10 2023-10-31 福州大学 Malicious code detection method based on convolutional neural network

Similar Documents

Publication Publication Date Title
CN110704840A (en) Convolutional neural network CNN-based malicious software detection method
Chawla et al. Host based intrusion detection system with combined CNN/RNN model
Kalash et al. Malware classification with deep convolutional neural networks
CN111027069B (en) Malicious software family detection method, storage medium and computing device
Pinhero et al. Malware detection employed by visualization and deep neural network
CN109829306B (en) Malicious software classification method for optimizing feature extraction
Almomani et al. An automated vision-based deep learning model for efficient detection of android malware attacks
CN111382438B (en) Malware detection method based on multi-scale convolutional neural network
Kan et al. Towards light-weight deep learning based malware detection
CN110363003B (en) Android virus static detection method based on deep learning
CN114692156B (en) Memory segment malicious code intrusion detection method, system, storage medium and equipment
Mourtaji et al. Intelligent framework for malware detection with convolutional neural network
CN112437053B (en) Intrusion detection method and device
CN112464232A (en) Android system malicious software detection method based on mixed feature combination classification
Kakisim et al. Sequential opcode embedding-based malware detection method
Park et al. Birds of a feature: Intrafamily clustering for version identification of packed malware
Sharif et al. A deep learning based technique for the classification of malware images
CN113591962B (en) Network attack sample generation method and device
Wang et al. Malware detection using cnn via word embedding in cloud computing infrastructure
CN111383716B (en) Screening method, screening device, screening computer device and screening storage medium
Waghmare et al. A review on malware detection methods
CN111079143B (en) Trojan horse detection method based on multi-dimensional feature map
Albahar et al. Toward Robust Classifiers for PDF Malware Detection.
Baig et al. Malware Detection and Classification along with Trade-off Analysis for Number of Features, Feature Types, and Speed
Jiang et al. A pyramid stripe pooling-based convolutional neural network for malware detection and classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200117

RJ01 Rejection of invention patent application after publication