CN111382438B - Malware detection method based on multi-scale convolutional neural network - Google Patents
Malware detection method based on multi-scale convolutional neural network Download PDFInfo
- Publication number
- CN111382438B CN111382438B CN202010231067.1A CN202010231067A CN111382438B CN 111382438 B CN111382438 B CN 111382438B CN 202010231067 A CN202010231067 A CN 202010231067A CN 111382438 B CN111382438 B CN 111382438B
- Authority
- CN
- China
- Prior art keywords
- neural network
- executable file
- convolutional neural
- binary executable
- byte
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 77
- 238000001514 detection method Methods 0.000 title claims abstract description 74
- 238000000034 method Methods 0.000 claims abstract description 39
- 238000012549 training Methods 0.000 claims abstract description 36
- 238000011176 pooling Methods 0.000 claims abstract description 21
- 238000006243 chemical reaction Methods 0.000 claims abstract description 7
- 210000002569 neuron Anatomy 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 abstract description 17
- 230000008569 process Effects 0.000 description 10
- 230000006870 function Effects 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 7
- 238000010801 machine learning Methods 0.000 description 7
- 238000012795 verification Methods 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 5
- 238000007418 data mining Methods 0.000 description 5
- 239000011159 matrix material Substances 0.000 description 5
- 230000003068 static effect Effects 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 239000004698 Polyethylene Substances 0.000 description 3
- 238000002790 cross-validation Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 241000700605 Viruses Species 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- -1 (polyethylene) format structure Chemical group 0.000 description 1
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 241000282414 Homo sapiens Species 0.000 description 1
- 244000035744 Hura crepitans Species 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- ZXQYGBMAQZUVMI-GCMPRSNUSA-N gamma-cyhalothrin Chemical compound CC1(C)[C@@H](\C=C(/Cl)C(F)(F)F)[C@H]1C(=O)O[C@H](C#N)C1=CC=CC(OC=2C=CC=CC=2)=C1 ZXQYGBMAQZUVMI-GCMPRSNUSA-N 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 210000000857 visual cortex Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computer Security & Cryptography (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Computer Hardware Design (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Virology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a malicious software detection method based on a multi-scale convolutional neural network, which comprises the steps of firstly converting a binary executable file of a training sample into a hexadecimal character sequence with fixed length, then converting the hexadecimal character sequence with fixed length into a low-dimensional vector through word embedding, inputting the low-dimensional vector generated by conversion into the multi-scale convolutional neural network, training a detection model based on the multi-scale convolutional neural network, finally converting the binary executable file of software to be detected into a low-dimensional vector according to the steps, inputting the low-dimensional vector into the detection model based on the multi-scale convolutional neural network, classifying the software to be detected and outputting a detection result; the multi-scale convolutional neural network adopts a plurality of parallel feature extraction channels, and each feature extraction channel consists of a one-dimensional convolutional layer, a pooling layer and a first Dropout layer which are sequentially connected. The method solves the problem of low accuracy of the existing malicious software detection method.
Description
Technical Field
The invention belongs to the technical field of malware protection, and relates to a malware detection method based on a multi-scale convolutional neural network.
Background
The Internet is deeply changing the production and living modes of human beings, the world is deeply influenced by network attacks while the world is deeply benefited by network development, and the network space security problem has become a serious challenge puzzling the world. According to statistics of 2017 Internet security threat report issued by Symantec, the company captures 4.01 hundred million new malicious software in 2016, and 109 ten thousand new malicious software are released to the Internet every day. Such a huge number of malicious software has become the biggest security threat of the internet, and seriously affects the information security of countries around the world.
Malware refers to any software that detracts from the benefit of a user, and is a generic term for various hostile or intrusive software, including viruses, worms, trojan horses, rootkits, backdoors, botnets, spyware, and the like. Malware may affect not only an infected computer or device, but also other devices that communicate with the infected device. Thus, accurate identification and detection of malware is critical to network information security.
Signature-based methods are widely used in current malware detection systems, which can effectively detect known malware by extracting a particular byte sequence of a binary program to obtain a signature. However, the traditional detection method cannot identify and detect unknown malicious software types and new malicious software generated by simply jacketing or confusing known malicious software, meanwhile, the malicious software using the polymorphic deformation technology continuously and randomly changes the content of the binary file in the process of propagation, has no fixed characteristics, and cannot detect the malicious software by using a signature-based method. In addition, the speed of manually extracting virus signatures by analysts has not been able to match the malware growth speed, which all present serious challenges to the protection effort of malware. Researchers have therefore proposed a variety of data mining and machine learning based malware detection methods that represent executable files as features at different levels of abstraction, which are used to train classifiers to achieve intelligent detection of unknown malware. Based on the difference in the execution manner of the detection process, it is generally classified into a static method and a dynamic method. Static methods directly analyze byte code sequences, system call functions, control flow graphs, operation code sequences, etc. of samples without running executable file samples. Static methods can provide safer detection environments and faster detection speeds, but are susceptible to shelling and confusion techniques, and generally require shelling, decryption, normalization prior to analysis, resulting in lower detection accuracy and efficiency. The dynamic method runs a malicious software sample in a controlled environment (a virtual machine, a simulator, a sandbox and the like), analyzes interaction between the malicious software sample and a system, records a system call sequence, a system call parameter, a running instruction sequence, an information flow and the like of the malicious software sample, and further identifies malicious behaviors of the malicious software sample. The dynamic method can accurately identify the nature of the malicious behavior, and is still effective for the samples with the shell, deformation, polymorphism and confusion. However, the dynamic detection method generally needs to consume more time and system resources, is greatly affected by the running environment, cannot completely traverse all executable paths of the software, and has low reliability of detection results.
Disclosure of Invention
The embodiment of the invention aims to provide a malware detection method based on a multi-scale convolutional neural network, which aims to solve the problems that the existing static malware detection method based on data mining and machine learning is easily influenced by a shelling and confusion technology and needs shelling, decryption and normalization processing before analysis to cause low detection accuracy and efficiency, and the existing dynamic malware detection method based on data mining and machine learning consumes more time and system resources and cannot completely traverse all executable paths of malware to cause low reliability of detection results.
The technical scheme adopted by the embodiment of the invention is that the malicious software detection method based on the multi-scale convolutional neural network is carried out according to the following steps:
S1, converting a binary executable file of a training sample into a hexadecimal character sequence with a fixed length;
S2, converting a hexadecimal character sequence with a fixed length into a low-dimensional vector through word embedding;
s3, inputting the low-dimensional vector generated by conversion into a multi-scale convolutional neural network, and training a detection model based on the multi-scale convolutional neural network;
Step S4, converting the binary executable file of the software to be detected into a low-dimensional vector according to the steps S1-S2, inputting the low-dimensional vector into the detection model based on the multi-scale convolutional neural network obtained by training in the step S3, classifying the software to be detected, and outputting a detection result, wherein the detection result is malicious software or benign software;
The multi-scale convolutional neural network adopts a plurality of parallel feature extraction channels, and each feature extraction channel consists of a one-dimensional convolutional layer, a pooling layer and a first Dropout layer which are sequentially connected.
Further, the specific implementation process of converting the binary executable file of the training sample into the hexadecimal character sequence with the fixed length in the step S1 is as follows:
Firstly, setting a byte threshold value, processing a binary executable file of a training sample into a binary executable file with a fixed length according to the set byte threshold value, discarding bytes behind the binary executable file of the training sample with the byte length being larger than the byte threshold value, and filling spaces behind the binary executable file of the training sample with the byte length being smaller than the byte threshold value so that the byte length of the binary executable file of each training sample reaches the byte threshold value, so that the byte length of the binary executable file of each training sample is equal to the byte threshold value;
Then, the characters of each byte of the binary executable file of the training sample with fixed length are encoded, and the characters of each byte are converted into integer indexes from 1 to 257, so that the hexadecimal character sequence with fixed length is obtained.
Further, in the step S2, the word embedding is used to convert the hexadecimal character sequence with the fixed length into a low-dimensional vector, and the word2vec model is used to convert the hexadecimal character sequence with the fixed length into the low-dimensional vector;
the byte threshold is set to 3000.
Furthermore, the output ends of the plurality of parallel feature extraction channels are provided with splicing layers, and effective features extracted by the plurality of parallel feature extraction channels are spliced;
The step S4 adopts the detection model based on the multi-scale convolutional neural network obtained in the step S3, and the specific implementation process of classifying the software to be detected and outputting the detection result is as follows:
Firstly, inputting a low-dimensional vector generated by conversion into a detection model based on a multi-scale convolutional neural network, simultaneously sliding a plurality of parallel one-dimensional convolutional layers on the low-dimensional vector to carry out convolutional operation, and finally, sequentially carrying out feature splicing on features extracted by the plurality of parallel convolutional layers after passing through a pooling layer, a first Dropout layer and a splicing layer, and extracting effective features of a binary executable file of software to be detected;
And then, using a full connection layer to carry out nonlinear combination on the effective characteristics of the extracted binary executable file of the software to be detected, and obtaining a detection result.
Further, a second Dropout layer is arranged between the full-connection layer and the splicing layer.
Furthermore, the multi-scale convolutional neural network adopts 3 parallel feature extraction channels to perform effective feature extraction.
Further, 56 convolution kernels are used for the one-dimensional convolution layers of the 3 parallel feature extraction channels, and the window sizes of the convolution kernels are 9, 11 and 13 respectively, and the step sizes are 1.
Further, the step sizes of the pooling layers of the 3 parallel feature extraction channels are 9, 11 and 13 respectively.
Further, the full connection layer is provided with 16 neurons.
Furthermore, the pooling layers all adopt maximum pooling.
The embodiment of the invention has the beneficial effects that the malicious software detection method based on the multi-scale CNN is provided, and the convolution neural network can not process the original binary file and can only input the numerical characteristic to the original binary file, so the embodiment of the invention firstly converts the byte sequence of the binary file into the hexadecimal character sequence, then uses word embedding to convert each byte character in the hexadecimal character sequence into a vector with a low dimension and fixed length, and converts the binary file into the numerical original characteristic, so that the multi-scale convolution neural network can process the binary file. The method has the advantages that the method is capable of solving the problems that the static malicious software detection method based on data mining and machine learning is easily affected by the shelling and confusion technology, and the shelling, decryption and standardization processing are needed before analysis, so that the detection accuracy and efficiency are low, and the adaptability and accuracy of the detection method are effectively improved. The multi-scale convolutional neural network is used for directly learning effective feature representation from binary files, has strong mode expression capability, gets rid of dependence on feature engineering, can automatically and intelligently learn the feature representation of malicious software, is favorable for finding potential security threats, does not need professional knowledge in the field of malicious software detection, avoids complicated feature engineering work of a traditional machine learning method, and solves the problems that the existing dynamic detection method based on data mining and machine learning consumes more time and system resources and cannot completely traverse all executable paths of the malicious software to cause low reliability of detection results, thereby improving the detection rate of the malicious software and reducing the false alarm rate. The convolution kernels with different scales can extract features with different precision, simultaneously, a plurality of parallel convolution layers are utilized to carry out convolution operation with different window sizes, then the generated features are combined, more abundant and complete feature information in different scales in data is learned, the accuracy of malicious software detection is improved, the detection accuracy reaches 98.18%, the logarithmic loss value is 0.1503, and the AUC value is 0.997.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of a convolutional neural network structure.
Fig. 2 is a schematic illustration of convolution operation of a one-dimensional convolutional neural network model.
Fig. 3 is a schematic diagram of a multi-scale convolutional neural network architecture.
FIG. 4 is a graph of the accuracy variation of a multi-scale convolutional neural network model.
FIG. 5 is a graph of log-loss variation for a multi-scale convolutional neural network model.
FIG. 6 is a schematic diagram of a confusion matrix for a multi-scale convolutional neural network model.
FIG. 7 is a schematic diagram of a normalized confusion matrix for a multi-scale convolutional neural network model.
Fig. 8 is a ROC graph of a multi-scale convolutional neural network model.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The convolutional neural network (Convolutional Neural Network, CNN) is a representative algorithm of deep learning, is a feed-forward neural network, the connection mode among neurons of the feed-forward neural network is inspired by animal visual cortex tissues, the convolutional operation is used for characterization learning, and the input information is translated according to a hierarchical structure to extract characteristics unchanged. The structure of a common convolutional neural network is shown in fig. 1, and mainly comprises a convolutional layer, a pooling layer, a full-connection layer, a Dropout layer and the like. The function of the convolution layer is to extract the characteristics of the input data, wherein the convolution layer internally comprises a plurality of convolution kernels, and each convolution kernel corresponds to a weight coefficient and an offset. Each neuron in the convolution layer is connected to a plurality of neurons in a region of the preceding layer that is located close to the region, the size of the region being dependent on the size of the convolution kernel. When the convolution kernel works, the convolution kernel regularly sweeps the input features, dot product operation is carried out on the input features in the receptive field, and the offset is added to calculate and generate an output matrix. Pooling is a form of nonlinear downsampling in which the pooling layer contains a predefined pooling function (e.g., minimum, maximum, and average) that functions to replace feature map statistics of adjacent regions with results of a single point in the feature map. The pooling layer will continuously reduce the spatial dimensions of the features and thus the number of parameters and calculations will also decrease, which to some extent also controls the overfitting. The fully connected layer is located at the final part of the convolutional neural network and functions to nonlinearly combine the extracted features to obtain an output, i.e., the fully connected layer itself is not expected to have feature extraction capability, but rather attempts to complete the learning objective with existing high-order features. The Dropout layer implements average sampling of predictions for different networks, and randomly discards neurons (and their connections) from the neural network during training to prevent overcompensation of neurons. The Dropout layer results as if it were training different neural networks and then averaging the effects of a large number of networks. Since these networks may be overfitted in different ways, the network effects through the Dropout layer may reduce the overfitting.
The one-dimensional convolutional neural network refers to a one-dimensional neural network, and the one-dimensional convolutional neural network scans an input sequence from beginning to end by using the one-dimensional convolutional kernel to perform convolutional operation so as to find effective features of the input sequence. As shown in fig. 2, the input sequence is represented by a one-dimensional 7 x1 input vector (1, 2, -1, -2, 1) convolved with a one-dimensional convolution kernel 3 x1 weight vector (1, 0, -1). The convolution kernel has a window size of 3 and a step size of 1, moves over the input 7 x1 sequence to perform a convolution operation, and generates a 5 x1 output vector (-1,2,1,1,0) as it passes through the input sequence.
The multi-scale CNN-based malicious software detection method is based on single-scale CNN, and the framework of the detection method is shown in fig. 3, and mainly comprises the following steps: data preprocessing, word embedding and multi-scale one-dimensional convolutional neural network, wherein the data preprocessing converts an input binary executable file of malicious software into a numeric original characteristic so that the convolutional neural network can process; word embedding converts the original numeric features into feature representations with stronger semantic information, and can effectively reduce the dimension of the features; the multi-scale one-dimensional convolutional neural network extracts high-level abstract features with different scales, realizes the supplement and enhancement of the features with different scales, trains a detection model with strong mode expression capability, and finally realizes the detection of unknown malicious software; the method specifically comprises the following steps of:
s1, converting an input binary executable file into a hexadecimal character sequence through data preprocessing, wherein the method specifically comprises the following steps of:
The binary executable file is of different sizes and the CNN requires a fixed size input. Therefore, first, a byte threshold is set, and the input binary executable file is processed into a binary executable file with a fixed length according to the set byte threshold, for binary executable files with byte lengths greater than the byte threshold, the following bytes are discarded, for files with byte lengths less than the byte threshold, spaces are filled behind the files to make the byte lengths of each binary executable file equal to the byte threshold, the first 3000 bytes of the binary executable file are selected in this embodiment, for files with byte lengths greater than 3000 bytes, the following bytes are discarded, for files with byte lengths less than 3000 bytes, spaces are filled behind the files, so that each binary executable file is 3000 bytes long.
Then, each byte of the fixed length binary executable is encoded with 257 possibilities (including filled space characters), and each byte of the character is converted into an integer index of 1 to 257, so that each binary executable sample is encoded into a hexadecimal character sequence with a fixed length (length is 3000); the length of the hexadecimal character sequence is set to 3000 as a result of repeated test selection, so that the detection accuracy can be effectively ensured, the length of the hexadecimal character sequence is set to be too long, the detection accuracy can be improved to a certain degree, but the processing efficiency is low, the length of the hexadecimal character sequence is set to be too short, the processing efficiency is higher, and the detection accuracy can be reduced.
S2, the integer index sequence, namely the hexadecimal character sequence, has no special meaning, the convolutional neural network does not benefit discrete data, so characters of each byte of the hexadecimal character sequence are mapped to a low-dimensional vector through word embedding (word embedding) to form a vector space which is easy to process by the convolutional neural network, the characters of each byte have certain semantics, and the relation of original samples in the semantic space is reserved in the vector space. Word embedding is a feature learning technique in natural language processing, and a word is converted into a vector representation with a fixed length by using word embedding, so that mathematical processing is facilitated.
S3, inputting the low-dimensional vector generated by conversion into a multi-scale convolutional neural network, training a detection model based on the multi-scale convolutional neural network by using effective features of a binary executable file of a multi-scale CNN learning training sample, wherein the multi-scale CNN adopts a plurality of parallel feature extraction channels, each feature extraction channel consists of a one-dimensional convolutional layer, a pooling layer and a first Dropout layer which are sequentially connected, and a splicing layer is arranged between the plurality of parallel feature extraction channels and a full-connection layer.
The multi-scale CNN detection model architecture of the present embodiment is different from the single-scale CNN detection model, and is not layer by layer, but uses the output of the next layer or more as input, and instead uses a plurality of parallel feature extraction channels, here, 3 parallel feature extraction channels. The 3-scale one-dimensional convolution layers are performed simultaneously, 56 convolution kernels are used for each of the 3 one-dimensional convolution layers, and only the window sizes of the convolution kernels of the one-dimensional convolution layers are 9, 11 and 13 respectively, so that the convolution kernels of the 3 one-dimensional convolution layers slide on the low-dimensional vector generated by word embedding according to the window sizes of 9, 11 and 13 respectively and the step size of 1 to perform convolution operation. In order to ensure the detection accuracy, the pooling layer and the first Dropout layer are respectively connected behind three parallel one-dimensional convolution layers to form a feature extraction channel. Pooling sampling is performed to reduce the dimension of the active features, as shown in fig. 3, it can be seen that the desired pooling layer uses maximum pooling. The first Dropout layer is added to prevent overfitting, in each iteration, by randomly temporarily disconnecting some neurons in the network; and finally, arranging a splicing layer to splice the characteristics of the three channels.
S4, converting a binary executable file of the software to be detected into a low-dimensional vector according to the method, inputting the low-dimensional vector into a detection model based on a multi-scale convolutional neural network obtained by training, classifying the software to be detected, and outputting a detection result, wherein the method specifically comprises the following steps of:
Firstly inputting a low-dimensional vector generated by conversion into a detection model based on a multi-scale convolutional neural network, simultaneously sliding a plurality of parallel one-dimensional convolutional layers on the low-dimensional vector to carry out convolutional operation, and finally carrying out feature splicing after sequentially passing through a pooling layer, a first Dropout layer and a splicing layer on features extracted by the plurality of parallel convolutional layers, so as to extract effective features of a binary executable file of software to be detected;
And then, carrying out nonlinear combination on the extracted effective characteristics of the binary executable file of the software to be detected by using a full-connection layer to obtain a detection result, wherein the detection result is malicious software or benign software, and a second Dropout layer is arranged between the full-connection layer and the splicing layer so as to prevent overfitting.
Experimental results and analysis
(1) And (3) selecting a software sample:
Experimental evaluation used different periods of malware and benign software samples, including 7871 benign software samples and 8269 malware samples, of which 4103 malware samples were found 2011 ago and 4166 malware samples were newly found in recent years; 3918 benign software samples were collected from the completely new installed Windows XP SP3 system, and 3953 benign software samples were collected from the completely new installed 32-bit Windows 7 specialty system. All malware samples were collected from VXHeavens websites, and all sample formats were in Windows PE format. The dataset composition is shown in table 1.
Table 1 software sample statistics
Category(s) | Malware sample | Benign software samples |
Early sample | 4103 | 3918 |
Recent samples | 4166 | 3953 |
Totalizing | 8269 | 7871 |
(2) The evaluation index and the method are as follows:
Classification performance is evaluated mainly with two indicators: accuracy and log loss. Accuracy measures the proportion of correctly predicted samples to total samples in all predictions, which is often insufficient to evaluate the robustness of the predictions, and therefore also requires the use of logarithmic losses. The logarithmic Loss (Logarithmic Loss), also known as Cross-entropy Loss, is defined on a probabilistic estimate for measuring the magnitude of the gap between the predicted and real categories. Minimizing the log loss is substantially equivalent to maximizing the accuracy of the classifier, with a log loss value of 0 for a perfect classifier. The logarithmic loss function is calculated as follows:
Wherein Y is an output variable, that is, a detection result of the output software to be detected, X is an input variable, that is, a binary executable file of the software to be detected, L is a loss function, N is a number of test samples (binary executable files of the software to be detected), Y ij is a binary index, which represents a class j corresponding to an input i-th test sample, the class j refers to benign software or malicious software, p ij is a probability that the i-th test sample input by an input instance belongs to the class j, M is a total class number, and m=2 in this embodiment.
The performance of the classifier can also be evaluated using a ROC curve (Receiver Operating Characteristic), the vertical axis of which is the detection rate (True Positive Rate), the horizontal axis is the false positive rate (False Positive Rate), which reflects the relationship between the detection rate and the false positive rate as the detection threshold changes. The value of the area under the ROC curve (Area Under ROC Curve, AUC) is an index for evaluating the comparative synthesis of the classifier, the AUC value is typically between 0.5 and 1.0, and a larger AUC value generally indicates better performance of the classifier.
(3) And (3) super parameter debugging:
In the machine learning model, the parameters that need to be manually selected are called superparameters. The performance of CNN is greatly affected by the super parameter, and the problem of under fitting or over fitting can occur due to improper super parameter selection. GRIDSEARCHCV is a common method for searching optimal parameters of a model in sklearn library, GRIDSEARCH and CV are that is, grid searching and cross verification, GRIDSEARCHCV uses a cross verification method to sequentially adjust parameters within a specified parameter range, and uses the adjusted parameters to train a learner to find the parameter with the highest precision on a verification set from all the parameters. In this embodiment, GRIDSEARCHCV is used to search and debug the hyper-parameters of the convolutional neural network, and the debugging results are shown in table 2.
Table 2 multi-scale CNN hyper-parameter debugging results
Super parameter | Parameter options or ranges | Better value |
Embedding output dim | {16,24,32,40,48,56,64} | 40 |
Kernel_size of convolutional layer 1 | {7,9,11,13,15,17} | 9 |
Kernel_size of convolutional layer 2 | {7,9,11,13,15,17} | 11 |
Kernel_size of convolutional layer 3 | {7,9,11,13,15,17} | 13 |
Filters number for 3 convolutional layers | {8,16,24,36,48,56} | 56 |
Number of neurons in full connected layer | {16,32,64,96,128,160,192,224,256,288} | 16 |
Dropout | {0.1,0.2,0.3,0.4,0.5} | 0.1 |
optimizer | {SGD,RMSprop,Adagrad,Adam} | RMSprop |
batch_size | {10,20,40,60,80,100} | 40 |
epochs | {10,15,20,25,30} | 20 |
(4) Experimental results and analysis:
The multi-scale CNN model training is basically based on gradient descent, and the process of searching the direction with the fastest descending speed of the function value and iterating along the descending direction to quickly reach the local optimal solution is the gradient descent process. One epoch is trained once using all samples in the training set, and the total number of times the entire training set is used is the value of the epoch. The change in epoch value affects the number of updates of the weight value of the convolutional neural network.
The experiment used 80% sample training, 20% sample validation, and trained 40 iterations to find the better epoch value. As the number of iteration increases, the accuracy change curve of the multi-scale CNN model is shown in fig. 4, and the logarithmic loss change curve of the model is shown in fig. 5. As can be seen from fig. 4 and 5, when the epoch value increases from 0 to 5, the training accuracy and the verification accuracy of the multi-scale CNN model increase rapidly, and the training log loss and the verification log loss of the multi-scale CNN model decrease rapidly; when the epoch value is from 5 to 40, the training accuracy and the verification accuracy of the multi-scale CNN model are basically unchanged, the training log loss of the multi-scale CNN model is basically unchanged, and the verification log loss still changes and has a growing trend; the accuracy and log loss curves of figures 4 and 5 were analyzed together and the optimal value for epoch was chosen to be 20.
After confirming the training iteration number of the model 20, a ten-fold cross-validation experiment was performed. In this experiment, the accuracy of the 10-fold cross validation of the multi-scale CNN method proposed in this embodiment is 98.18%, the log loss is 0.1503, the confusion matrix is shown in fig. 6, and the normalized confusion matrix is shown in fig. 7. As can be seen from fig. 6 and fig. 7, the malware detection method provided by the embodiment of the invention obtains a more ideal result, and has higher classification accuracy.
The ROC curve of the multi-scale CNN-based malware detection model is shown in fig. 8, and reflects the relationship between the detection rate and the false alarm rate with the change of the detection threshold. The abscissa (0, 1) represents a perfect classifier that correctly classifies all samples. The closer the ROC curve is to the upper left corner, the better the performance of the classifier. As can be seen from fig. 8, the ROC curve of the model is very close to the upper left corner, and the performance is better. The AUC value of the multi-scale CNN-based malware detection model is 0.997, which has been very close to the optimal value 1 of AUC values.
(5) And (3) comparing experimental results:
In order to comprehensively evaluate the performance of the method proposed by the embodiment of the invention, the method of the embodiment of the invention is compared with a classical detection method, and the results are shown in table 3. As can be seen from table 3, most of indexes of the multi-scale CNN-based malware detection method provided by the embodiment of the present invention are better than those of the classical detection method, and slightly weaker than the byte sequence 3-grams. The byte sequence 3-gram method needs to traverse the whole executable file to extract the features, is greatly influenced by the window value, takes a great amount of time to select the feature value with high accuracy, and needs to perform feature selection or reduction to reduce the size of the feature vector. Because table 3 is a comparison of the 10-fold cross-validation results, it is difficult for the byte sequence 3-grams to extract features from the training data and perform feature selection only during feature engineering, both extracting features from the entire data set and performing feature selection, and the experimental results are slightly better than the real experimental results. The malicious software detection method based on the multi-scale CNN provided by the embodiment of the invention belongs to an end-to-end detection method, and compared with three detection methods based on a PE format structure, a DLL, an API and a byte sequence 3-gram, the malicious software detection method provided by the embodiment of the invention avoids a complicated characteristic engineering process. Compared with the detection method based on single-scale CNN, the detection method based on multi-scale CNN provided by the embodiment of the invention has the advantage that various performance indexes are improved to a certain extent.
Table 3 comparison of experimental results for different monitoring methods
Detection method | Accuracy (%) | Log loss | AUC |
PE (polyethylene) format structure | 96.84 | 0.1049 | 0.994 |
DLL and API | 96.08 | 0.1638 | 0.991 |
Byte sequence 3-grams | 98.8 | 0.0701 | 0.997 |
Single-scale CNN | 97.19 | 0.1165 | 0.996 |
Multiscale CNN | 98.18 | 0.1503 | 0.997 |
The convolution kernel with single scale can only use the same scale to extract the features, ignoring the features with other precision, and leading to incomplete information of the extracted feature expression. The embodiment provides a multi-scale CNN-based malicious software detection method, effective feature representation is directly learned from binary executable files through a multi-scale convolutional neural network, features with different precision are extracted by convolution kernels with different scales, the same data are subjected to convolution operation with different window sizes at the same time, and then the generated features are combined and the richer and complete feature information in different scales in the data is learned, so that the accuracy of malicious software detection is improved. The proposed method is reasonable in a conceptual sense and is also ideal in terms of results. The accuracy of the proposed multi-scale CNN malicious software detection method is 98.18%, the logarithmic loss is 0.1503, the AUC value is 0.997, and each performance index is superior to most classical detection methods, so that the method is a malicious software detection method with good performance such as robustness.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.
Claims (1)
1. The method for detecting the malicious software based on the multi-scale convolutional neural network is characterized by comprising the following steps of:
S1, converting a binary executable file of a training sample into a hexadecimal character sequence with a fixed length;
Firstly, setting a byte threshold value, processing a binary executable file of a training sample into a binary executable file with a fixed length according to the set byte threshold value, discarding bytes behind the binary executable file of the training sample with the byte length being larger than the byte threshold value, and filling spaces behind the binary executable file of the training sample with the byte length being smaller than the byte threshold value so that the byte length of the binary executable file of each training sample reaches the byte threshold value, so that the byte length of the binary executable file of each training sample is equal to the byte threshold value;
Then, coding each byte character of the binary executable file of the training sample with fixed length, and converting each byte character of the binary executable file into an integer index of 1 to 257 to obtain a hexadecimal character sequence with fixed length;
S2, converting a hexadecimal character sequence with a fixed length into a low-dimensional vector through word embedding;
word2vec model is adopted for word embedding;
the byte threshold is set to 3000;
s3, inputting the low-dimensional vector generated by conversion into a multi-scale convolutional neural network, and training a detection model based on the multi-scale convolutional neural network;
Step S4, converting the binary executable file of the software to be detected into a low-dimensional vector according to the steps S1-S2, inputting the low-dimensional vector into the detection model based on the multi-scale convolutional neural network obtained by training in the step S3, classifying the software to be detected, and outputting a detection result, wherein the method specifically comprises the following steps:
Firstly, inputting a low-dimensional vector generated by conversion into a detection model based on a multi-scale convolutional neural network, simultaneously sliding 3 parallel one-dimensional convolutional layers on the low-dimensional vector to carry out convolutional operation, and finally, sequentially carrying out feature splicing on features extracted by the 3 parallel convolutional layers after passing through a pooling layer, a first Dropout layer and a splicing layer, and extracting effective features of a binary executable file of software to be detected; 56 convolution kernels are used for the one-dimensional convolution layers, the convolution kernel windows of the one-dimensional convolution layers are respectively 9, 11 and 13, and the step sizes are 1; the step sizes of the pooling layers are 9, 11 and 13 respectively; the pooling layers adopt maximum pooling;
then, nonlinear combination is carried out on the effective characteristics of the extracted binary executable file of the software to be detected by using a full connection layer, and a detection result is obtained; a second Dropout layer is arranged between the full-connection layer and the splicing layer; the full connection layer is provided with 16 neurons.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010231067.1A CN111382438B (en) | 2020-03-27 | 2020-03-27 | Malware detection method based on multi-scale convolutional neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010231067.1A CN111382438B (en) | 2020-03-27 | 2020-03-27 | Malware detection method based on multi-scale convolutional neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111382438A CN111382438A (en) | 2020-07-07 |
CN111382438B true CN111382438B (en) | 2024-04-23 |
Family
ID=71215726
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010231067.1A Active CN111382438B (en) | 2020-03-27 | 2020-03-27 | Malware detection method based on multi-scale convolutional neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111382438B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112329016B (en) * | 2020-12-31 | 2021-03-23 | 四川大学 | Visual malicious software detection device and method based on deep neural network |
CN113011262B (en) * | 2021-02-18 | 2023-10-13 | 广州大学华软软件学院 | Multi-size cell nucleus identification device and method based on convolutional neural network |
CN113420294A (en) * | 2021-06-25 | 2021-09-21 | 杭州电子科技大学 | Malicious code detection method based on multi-scale convolutional neural network |
CN114692156B (en) * | 2022-05-31 | 2022-08-30 | 山东省计算中心(国家超级计算济南中心) | Memory segment malicious code intrusion detection method, system, storage medium and equipment |
CN116361801B (en) * | 2023-06-01 | 2023-09-01 | 山东省计算中心(国家超级计算济南中心) | Malicious software detection method and system based on semantic information of application program interface |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109446804A (en) * | 2018-09-27 | 2019-03-08 | 桂林电子科技大学 | A kind of intrusion detection method based on Analysis On Multi-scale Features connection convolutional neural networks |
CN110647745A (en) * | 2019-07-24 | 2020-01-03 | 浙江工业大学 | Detection method of malicious software assembly format based on deep learning |
CN110689011A (en) * | 2019-09-29 | 2020-01-14 | 河北工业大学 | Solar cell panel defect detection method of multi-scale combined convolution neural network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107133202A (en) * | 2017-06-01 | 2017-09-05 | 北京百度网讯科技有限公司 | Text method of calibration and device based on artificial intelligence |
-
2020
- 2020-03-27 CN CN202010231067.1A patent/CN111382438B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109446804A (en) * | 2018-09-27 | 2019-03-08 | 桂林电子科技大学 | A kind of intrusion detection method based on Analysis On Multi-scale Features connection convolutional neural networks |
CN110647745A (en) * | 2019-07-24 | 2020-01-03 | 浙江工业大学 | Detection method of malicious software assembly format based on deep learning |
CN110689011A (en) * | 2019-09-29 | 2020-01-14 | 河北工业大学 | Solar cell panel defect detection method of multi-scale combined convolution neural network |
Non-Patent Citations (1)
Title |
---|
陈涵泊 等.基于Asm2Vec 的恶意代码同源判定方法.《通信技术》.2019,第52卷(第12期),第3011-3012页. * |
Also Published As
Publication number | Publication date |
---|---|
CN111382438A (en) | 2020-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111382438B (en) | Malware detection method based on multi-scale convolutional neural network | |
Ni et al. | Malware identification using visualization images and deep learning | |
CN110704840A (en) | Convolutional neural network CNN-based malicious software detection method | |
CN113596007B (en) | Vulnerability attack detection method and device based on deep learning | |
CN109962909B (en) | Network intrusion anomaly detection method based on machine learning | |
Santos et al. | Opcode-sequence-based semi-supervised unknown malware detection | |
CN111382439A (en) | Malicious software detection method based on multi-mode deep learning | |
CN111400713B (en) | Malicious software population classification method based on operation code adjacency graph characteristics | |
WO2022227535A1 (en) | Method and system for recognizing mining malicious software, and storage medium | |
CN114003910B (en) | Malicious variety real-time detection method based on dynamic graph comparison learning | |
Kakisim et al. | Sequential opcode embedding-based malware detection method | |
Sun et al. | Android malware family classification based on deep learning of code images | |
CN108170467A (en) | Constraint qualification clusters and measure information software birthmark feature selection approach, computer | |
CN108959930A (en) | Malice PDF detection method, system, data storage device and detection program | |
Kornish et al. | Malware classification using deep convolutional neural networks | |
CN116541838A (en) | Malware detection method based on contrast learning | |
Wu et al. | Embedding vector generation based on function call graph for effective malware detection and classification | |
CN111737694B (en) | Malicious software homology analysis method based on behavior tree | |
KR20220009098A (en) | A Study on Malware Detection System Using Static Analysis and Stacking | |
CN112733144B (en) | Intelligent malicious program detection method based on deep learning technology | |
Waghmare et al. | A review on malware detection methods | |
CN114091021A (en) | Malicious code detection method for electric power enterprise safety protection | |
Qi et al. | A Malware Variant Detection Method Based on Byte Randomness Test. | |
CN114139153A (en) | Graph representation learning-based malware interpretability classification method | |
CN111079143A (en) | Trojan horse detection method based on multi-dimensional feature map |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |