CN108717512A - A kind of malicious code sorting technique based on convolutional neural networks - Google Patents
A kind of malicious code sorting technique based on convolutional neural networks Download PDFInfo
- Publication number
- CN108717512A CN108717512A CN201810469552.5A CN201810469552A CN108717512A CN 108717512 A CN108717512 A CN 108717512A CN 201810469552 A CN201810469552 A CN 201810469552A CN 108717512 A CN108717512 A CN 108717512A
- Authority
- CN
- China
- Prior art keywords
- malicious code
- convolutional neural
- neural networks
- sorting technique
- audio signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Virology (AREA)
- Image Analysis (AREA)
- Error Detection And Correction (AREA)
Abstract
The malicious code sorting technique based on convolutional neural networks that the invention discloses a kind of, malicious code is mapped as single pass signal by it, then the sound spectrograph of signal is generated according to signal processing method, it converts sound spectrograph to using Image Zooming Algorithm the gray-scale map of constant size, convolutional neural networks is finally used to realize the classification of malicious code.In the method for the invention, malicious code is mapped as to regenerate corresponding sound spectrograph after single pass signal, the enough contextual informations of the malicious code can be obtained, not only reflects the time domain and frequency domain information of signal, can also reflect part and the global information of signal;In addition, due to characteristics such as the local translation invariances of convolutional neural networks, situations such as capable of preferably obtaining the substantive characteristics of malicious code, and then effectively overcome code reordering, rubbish code insertion, the nicety of grading of malicious code is improved.
Description
Technical field
The present invention relates to malicious code classification fields, more particularly to the malicious code sorting technique based on signal analysis.
Background technology
With flourishing for internet, malicious code has become one of the principal element for threatening internet security,
Show the trend of rapid growth.The Static Analysis Method of malicious code is the common method that Classification and Identification is carried out to malicious code
One of, Static Analysis Method in the prior art includes the analysis method based on malicious code characteristics of image, such as Nataraj L
Et al. propose a kind of SPAM-GIST malicious codes sorting technique (Nataraj L, Manjunath B S.SPAM:Signal
Processing to Analyze Malware[Applications Corner][J].IEEE Signal Processing
Magazine, 2016,33 (2):105-117), malicious code binary file is mapped as image and carrys out Expressive Features, utilized
The global characteristics GIST of the multiple dimensioned and multidirectional feature extraction image of Gabor filter, and use this character representation malice generation
Code feature, then classifies to malicious code using nearest neighbor algorithm.However, the malicious code used in practical application is often
In the presence of the alias condition of deformation or rubbish code insertion etc., this makes the Static Analysis Method based on characteristics of image that can not have
The malicious code after identity confusion is imitated, and then causes the nicety of grading of malicious code low.Therefore, efficient identification how is obtained to obscure
The analysis method of malicious code afterwards is those skilled in the art's problem to be solved.
Invention content
The malicious code sorting technique based on convolutional neural networks that the present invention provides a kind of, solves evil in the prior art
The code classification technology of anticipating can not effective identity confusion malicious code, and then the problem that the precision that causes malicious code to be classified is low.
In order to solve the above technical problems, one aspect of the present invention, which is to provide one kind, being based on convolutional neural networks
Malicious code sorting technique, including step:Signal maps, and the binary file of the malicious code is mapped as audio signal
File;Sound spectrograph generates, and utilizes the sound spectrograph of malicious code described in the audio signal files Visual Production;Dimension modifying,
The sound spectrograph is changed to the fixed image of size using image interpolation method;Code classification, by the image after dimension modifying
The convolutional neural networks are inputted to classify to the malicious code.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, mapped in the signal
In, read 8bit be a unsigned int to convert the binary file of the malicious code to one-dimension array, then will
The one-dimension array is mapped as audio signal files.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, by the one-dimension array
When being mapped as audio signal files, setting channel number is 1, sampling frequency 44.1kHz, quantization digit 4byte.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, it is one to read 8bit
Unsigned int is further comprised with converting after one-dimension array the binary file of the malicious code to:It is using length
The non-overlapping scanning one-dimension array of 128 window, the comentropy of data in calculation window, if described information entropy is 0,
By the rejection of data in window.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, described information entropy is:Wherein, pkThe probability occurred for number k in window.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, the audio signal text
Part is wav file.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, the sound spectrograph generates
Including:(1) it carries out framing, adding window and discrete Fourier transform to the audio signal files of the malicious code to handle, i.e.,Wherein, s (n) is the audio signal files of the malicious code, and N is
Frame length when window size and audio signal progress framing is identical, and w (n) is rectangular window function, and X is the Fourier of s (n)
Coefficient;(2) the logarithmic amplitude spectrum A (n, k), wherein A (n, k)=10log of the audio signal files of the malicious code are calculated10(|
X(n,k)|+e-1);(3) gray scale transformation is carried out to the logarithmic amplitude spectrum A (n, k):Its
Middle Amax(n, k) indicates the maximum value of sound spectrograph gray level;(4) G (n, k) is saved as into single pass PNG images to obtain
The sound spectrograph gray level image of the malicious code.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, described image interpolation side
Method is bicubic interpolation algorithm:Wherein, (x, y) is to be waited in sound spectrograph
The pixel of interpolation, (xi,yj) (i, j=0,1,2,3) be pixel 4*4 neighborhood points, w (x) be Interpolation-Radix-Function
Wherein, a=0.5.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, the convolutional Neural net
Convolutional layer in network and to connect the nonlinear activation function of layer entirely include Relu functions, Leakly Relu functions or ELU functions.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, the dimension modifying
In, the sound spectrograph is changed to the fixed 128*128*1 images of size.
The beneficial effects of the invention are as follows:The invention discloses a kind of malicious code classification side based on convolutional neural networks
Malicious code is mapped as single pass signal by method, and the sound spectrograph of signal is then generated according to signal processing method, uses figure
It converts sound spectrograph to as scaling algorithm the gray-scale map of constant size, convolutional neural networks is finally used to realize point of malicious code
Class.In the method for the invention, malicious code is mapped as regenerating corresponding sound spectrograph after single pass signal, can obtained
The enough contextual informations of the malicious code are obtained, not only reflects the time domain and frequency domain information of signal, can also reflect letter
Number part and global information;In addition, due to features such as the local translation invariances of convolutional neural networks, can preferably be disliked
It anticipates the substantive characteristics of code, and then situations such as effectively overcome code reordering, rubbish code insertion, improves the classification essence of malicious code
Degree.
Description of the drawings
The present invention is based on an embodiment schematic diagrams of the malicious code sorting technique of convolutional neural networks by Fig. 1;
The binary file of malicious code is converted in Fig. 2 present invention the flow of an embodiment of audio signal files
Figure;
Fig. 3 is the stream for another embodiment for converting the binary file of malicious code in the present invention audio signal files
Cheng Tu;
Fig. 4 is generation signal sound spectrograph gray level image in the malicious code sorting technique the present invention is based on convolutional neural networks
An embodiment flow chart;
Fig. 5 is that the present invention is based on the malicious code sorting technique of convolutional neural networks and the SPAM-GIST malice generations
The classification results schematic diagram of code sorting technique;
Fig. 6 is that the present invention is based on the malicious code sorting techniques at convolutional Neural network and the SPAM-GIST malicious codes
One schematic diagram of the ROC curve of sorting technique;
Fig. 7 is the classification knot to different size sound spectrographs the present invention is based on the malicious code sorting technique of convolutional neural networks
A schematic diagram for fruit;
Fig. 8 is a schematic diagram of the recognition result of heretofore described MCCNN methods and MCCNN_ORI methods.
Specific implementation mode
To facilitate the understanding of the present invention, in the following with reference to the drawings and specific embodiments, the present invention will be described in more detail.
The preferred embodiment of the present invention is given in attached drawing.But the present invention can realize in many different forms, and it is unlimited
In this specification described embodiment.Make to the disclosure on the contrary, purpose of providing these embodiments is
Understand more thorough and comprehensive.
It should be noted that unless otherwise defined, all technical and scientific terms used in this specification with belong to
The normally understood meaning of those skilled in the art of the present invention is identical.Used term in the description of the invention
It is to be not intended to the limitation present invention to describe the purpose of specific embodiment.Term "and/or" packet used in this specification
Include any and all combinations of one or more relevant Listed Items.
Fig. 1 is an embodiment flow chart of the malicious code sorting technique the present invention is based on convolutional neural networks.Described
Malicious code sorting technique based on convolutional neural networks specifically includes following steps:
Step S1, signal mapping, audio signal files are mapped as by the binary file of the malicious code.First, right
In the binary file of the malicious code, be successively read 8bit be a unsigned int with by the two of the malicious code into
File processed is converted into one-dimension array, and the one-dimension array is converted to audio signal files according to the numerical value of the one-dimension array.
Step S2, sound spectrograph generate, and utilize the sound spectrograph of malicious code described in the audio signal files Visual Production.
The present invention is handled using Fourier transform pairs per frame signal, is then spliced the result of processing and is generated sound spectrograph.Wherein institute
The sound spectrograph of the malicious code signal of extraction, can obtain the enough contextual informations of the malicious code, including signal
The part and global information of time domain and frequency domain information and signal.Further, the present invention pre-processes sound spectrograph, obtains
The sound spectrograph gray level image that gray level is 256.
The sound spectrograph is changed to the fixed image of size by step S3, dimension modifying using image interpolation method.In order to
Sound spectrograph is analyzed using convolutional neural networks, all sound spectrographs are transformed to fix by the present invention using image interpolation method
The size of size.
Image after dimension modifying is inputted the convolutional neural networks with to the malice generation by step S4, code classification
Code is classified.Various Classifiers on Regional exists in the prior art to realize the classification of malicious code, such as random forest, supporting vector
Machine etc. has good recognition effect since convolutional neural networks have local translation invariance to the picture for deforming, moving,
Therefore it can obscure method efficiently against common malicious codes such as malicious code rearrangement, rubbish code insertions;In addition, convolution
Neural network has multiple convolution kernels, can more extract the substantive characteristics of malicious code sound spectrograph;Therefore, the present invention is specifically used
Convolutional neural networks realize the high-precision classification of malicious code.
The malicious code sorting technique based on convolutional neural networks described in the present embodiment, by the binary system of malicious code
File is converted into single pass audio signal files, and further obtains the sound spectrograph of the audio signal, and sound spectrograph is inputted institute
Convolutional neural networks are stated to obtain the generic of the malicious code.Wherein it is enough can to retain the malicious code for sound spectrograph
More contextual informations includes part and the global information of the time-frequency domain information of signal and signal;Convolutional neural networks have
Local translation invariance and the malicious code therefore can be extracted on the basis of sound spectrograph with a variety of convolution kernels
Substantive characteristics improves the precision of malicious code classification.
Preferably, can also include after converting the binary file of the malicious code to one-dimension array:It uses
The non-overlapping entire array of scanning of window that length is 128, and in calculation window data comentropy (Information
Entropy, Ent), if Ent=0, by the rejection of data in window;Otherwise, retain the data in window.Such operation
The malicious codes obfuscated manner such as rubbish code insertion can be overcome to a certain extent.Wherein described information entropy Ent formula are:
pkThe probability occurred for number k in window.
Fig. 2 is the flow for the embodiment for converting the binary file of malicious code in the present invention audio signal files
Figure.In fig. 2, for the binary file of the malicious code in given step S21:000000000000000011000100
……;
Step S22 reads binary data from the binary file of the malicious code successively as unit of 8bit,
And it is converted into the one-dimension array [0,0,196 ... ...] of unsigned int (ranging from 0~255);
The one-dimension array is mapped as a section audio signal by step S23, and the amplitude of the audio signal corresponds to described one
The audio signal of generation is finally saved as wav file by the value of dimension group.The size of the wherein described wav file is according to malice generation
Code binary file size, channel number (channel), sampling frequency (framerate) and quantization digit (sampwidth) and
It is fixed.
Fig. 3 is the stream for another embodiment for converting the binary file of malicious code in the present invention audio signal files
Cheng Tu.In figure 3, for the binary file of the malicious code in given step S31:0000000000000000110001
00……;
Step S32 reads binary data from the binary file of the malicious code successively as unit of 8bit,
And it is converted into the one-dimension array [0,0,196 ... ...] of unsigned int (ranging from 0~255);
Step S33, the non-overlapping entire array of scanning of window for the use of length being 128, and in calculation window data letter
Entropy (Information Entropy, Ent) is ceased, if Ent=0, by the rejection of data in window;Otherwise, retain in window
Data, obtain treated one-dimension array.Wherein described information entropy Ent formula are:
pkThe probability occurred for number k in window;
The treated one-dimension array is mapped as a section audio signal, the amplitude of the audio signal by step S34
The value of the corresponding one-dimension array, finally saves as wav file by the audio signal of generation.
In this embodiment, by calculating the comentropy of all data in specific length window and giving up comentropy for 0
Data in window will can to a certain extent overcome the malicious codes obfuscated manner such as rubbish code insertion, further improve
The accuracy of malicious code classification of the present invention.
Fig. 4 is generation signal sound spectrograph gray level image in the malicious code sorting technique the present invention is based on convolutional neural networks
An embodiment flow chart.In Fig. 4, step S41:Framing, adding window are carried out to the audio signal files of the malicious code
And discrete Fourier transform processing, i.e.,
Wherein, s (n) is the audio signal files of the malicious code, and N is that window size and the audio signal carry out
Frame length when framing is identical, and w (n) is rectangular window function, and X is the Fourier coefficient of s (n).
Step S42:Calculate the logarithmic amplitude spectrum A (n, k) of the audio signal files of the malicious code.
Since there are a small amount of X (n, k) to be equal to zero, so the logarithm in the audio signal files for calculating the malicious code shakes
Plus a small numerical value when width composes A (n, k), i.e.,
A (n, k)=10log10(|X(n,k)|+e-1);
Step S43:Gray scale transformation is carried out to the logarithmic amplitude spectrum A (n, k):
Wherein Amax(n, k) indicates the maximum value of sound spectrograph gray level.The maximum ash of single channel gray level image in a computer
It is 256 to spend grade, when one group of numerical value is converted into single channel gray-scale map, if there is the numerical value more than 255, computer in array
255 numerical value is will be greater than all to be replaced with 255.Therefore formula is usedTo logarithmic amplitude spectrum A (n,
K) gray level is converted, and this method can not only convert the size of gray level, but also when A (n, k) is less than 0, G
(n, k) is more than 255.
Step S44:The G (n, k) is saved as into single pass PNG images to obtain the sound spectrograph ash of the malicious code
Spend image.When G (n, k) is saved as single pass PNG images, the numerical value in G (n, k) more than 255 will be replaced by 255, be obtained
The sound spectrograph gray level image for being 256 to gray level.
Specifically, during generating the sound spectrograph gray level image of audio signal files of the malicious code, this hair
Bright to be determined to carry out framing, the setting of adding window to audio signal according to the duration of audio signal files, i.e., the present invention believes according to audio
The duration of number file determines that frame length and frame move, and then Fourier transformation will be carried out per frame signal, table 1 give sound intermediate frequency of the present invention
Frame length and frame corresponding to signal file difference duration move data.
Frame length and frame corresponding to 1 different audio signals file duration of table move
Preferably, in order to retaining the feature of sound spectrograph as possible, make the image after scaling that there is higher picture quality,
The present invention zooms in and out sound spectrograph using bicubic interpolation method.The bicubic interpolation method is chosen around image interpolation point
The gray value of 16 points makees cubic interpolation, considers not only the gray scale of 4 direct neighbor points and influences, and considers each neighbor point
Between gray-value variation rate influence.The bicubic interpolation algorithm needs selection Interpolation-Radix-Function to carry out fitting data, wherein use
Interpolation-Radix-Function is:
Wherein a=0.5.
The bicubic interpolation algorithmic formula used is as follows:
Wherein, (x, y) is the pixel of interpolation in sound spectrograph, (xi,yj) (i, j=0,1,2,3) be the interpolation picture
The 4*4 neighborhood points of vegetarian refreshments.
Preferably, the present invention carries out Classification and Identification using convolutional neural networks to malicious code.Extracting malicious code
After the sound spectrograph gray level image of signal, need to classify to sound spectrograph by sorting algorithm.The specifically used convolution of the present invention
Neural network (Convolutional Neural Network, CNN) classifies to sound spectrograph.Convolutional neural networks CNN is
A kind of multilayer neural network structure is mainly made of input layer, convolutional layer, pond layer, full articulamentum, output layer etc..Input layer
Input sound spectrograph gray level image;Convolutional layer extracts the feature of sound spectrograph;Pond layer utilizes the local correlations principle of image, reduces
Data volume to be treated;Feature Mapping is the malicious code classification finally predicted by output layer.Wherein, for convolutional Neural net
Network:
(1) convolutional layer is in convolutional layer, the input of last layer exported as current layer, in order to promote the table of network model
Danone power introduces nonlinear activation primitive.The propagated forward in kth layer of convolutional layer can be expressed as:
Wherein,For the net activation in l layers of j-th of channel,The output in j-th of channel of l layers,For convolution kernel,
For the bias term of convolutional layer, MjTo be used to calculateInput characteristic pattern subset, f is activation primitive, and * is convolution symbol.
(2) pond function is replaced using the general evaluation system feature of the adjacent output of a certain position in pond layer convolutional network
The output of network in the position, it can reduce the feature in network, while most of output being kept not change.Pond
Change function and down-sampling output characteristic pattern is carried out to input feature vector figure by following formula:
Wherein,For the net activation in l layers of j-th of channel, β is the weight coefficient of pond layer,For the biasing of pond layer
, down () is pond function, it is to input feature vector figureCharacteristic pattern is divided by sliding window method and multiple is not weighed
Folded image block, then in each image block pixel summation, average or maximum value.The present invention uses maximum pond function
Seek the maximum value in each image block.
(3) the two dimensional character figure that all convolutional layers export is spliced into one-dimensional characteristic by full articulamentum in full articulamentum
Input of the vector as full articulamentum.Full articulamentum, can be with by summing to weighted input and being exported by activation primitive
It is expressed as:
Wherein, l indicates that the number of plies, x are characterized image, and W is the weight matrix of current layer, and b is the bias term of current layer, and f is
Activation primitive.
(4) other structures are to promote the ability to express of network model, introduce nonlinear activation primitive, convolutional layer and complete
The common activation primitive of articulamentum includes:Relu, Leakly Rule, ELU etc..Convolutional layer and full articulamentum in the present invention
Activation primitive can specifically use Relu functions, mathematic(al) representation is as follows:
F (x)=max (0, x)
The result that neural network propagated forward obtains is become into probability distribution in addition, being returned using Softmax.Assuming that original
Neural network output be y1,y2,y3,....,yn, by Softmax recurrence handle after output be:
The present invention uses the network model similar to VGG16 to classify sound spectrograph.The convolutional neural networks model
As shown in table 2, wherein Feature indicates that the various structures in the convolutional neural networks, kernel size indicate core used
The size of function, stride indicate that mobile step-length, pad indicate filling, and 1 represents using 0 filling, keeps the defeated of current layer
Enter with output characteristic pattern it is equal in magnitude, 0 represent be not filled with, function indicate to function used by current layer.
2 convolutional neural networks model of table
After obtaining network model above, the loss function for training network is needed, the classification of malicious code is predicted with this.
The present invention weighs error using cross entropy (cross-entropy) loss function.Assuming that there is { (x1,y1),(x2,y2),(x3,
y3),......,(xM,yM) training sample belongs to k malicious code family, y is the malicious code family encoded with one_hot
K tie up categorization vector, for each sample, cross entropy can be calculated with following formula:
Wherein,Indicate i-th of numerical value in the categorization vector of sample m,It is returned by Softmax in output layer for sample m
I-th of numerical value in the probability distribution returned.Therefore, network mould can be optimized by the loss function of the whole training samples of training
Type.
The loss function of formula J be whole cross entropies and, the optimization algorithms training net such as stochastic gradient descent can be passed through
Network model.Finally, malicious code classification is predicted using trained network model.
Fig. 5 is that the present invention is based on the malicious code sorting technique of convolutional neural networks and the SPAM-GIST malice generations
The classification results schematic diagram of code sorting technique.In the present invention, used malicious code data collection comes from Microsoft and exists
Project Microsoft Malware Classification Challenge on Kaggle choose the 10860 of 9 classifications altogether
Malicious code binary file is tested, and table 3 gives the essential information for the malicious code data collection that the present invention uses.
3 malicious code data collection of table
Malicious code classification | Classification number | Quantity |
Ramniit | 0 | 1533 |
Lollipop | 1 | 2478 |
Kelihos_ver3 | 2 | 2942 |
Vundo | 3 | 475 |
Simda | 4 | 42 |
Tracur | 5 | 751 |
Kelihos_ver1 | 6 | 398 |
Obfuscator.ACY | 7 | 1228 |
Gatak | 8 | 1013 |
Specifically, the present invention experiment in, we by the 80% of malicious code data collection be used as training sample set, 20%
As test sample collection, and when the binary file of malicious code is mapped as wav file, setting channel number channel is
1, sampling frequency framerate are 44.1kHz, and quantization digit sampwidth is 4byte.In addition, in the present invention based on convolution
In the malicious code sorting technique of neural network, the sound spectrograph of generation is scaled the gray level image of 128*128*1 as convolution
The input of neural network.In the SPAM-GIST malicious codes sorting technique, when k nearest neighbor (K-Nearest Neighbor,
KNN) K=1 of sorting algorithm obtains best as a result, therefore in the present invention, the SPAM-GIST malicious codes sorting technique
In K values be 1.
Specifically, the present invention is using accuracy rate (Accuracy), macro precision ratio (macro_P), macro recall ratio (macro_
R), the malicious code sorting technique pair based on convolutional neural networks of four kinds of evaluation index evaluation present invention of macro F1 (macro_F1)
The classifying quality of malicious code.About the present invention another index ROC (Receicer Operating Charateristic,
ROC) curve, the longitudinal axis are real example rate (True Positive Rate, TPR), and horizontal axis is false positive example rate (False
Positive Rate, FPR), using the area under ROC curve, i.e. AUC (Area Under ROC Curve, AUC) evaluates institute
State the classification performance of convolutional neural networks.
For more classification problems, a confusion matrix will be corresponded to per the combination of classification two-by-two, then on each confusion matrix
Real example rate, false positive example rate, recall ratio and precision ratio are calculated, then calculates average value, obtains TPR, FPR, macro precision ratio (macro_
P), macro recall ratio (macro_R), each evaluation index calculation formula are as follows:
Wherein TP, FP, FN, TN indicate to be classified respectively device be identified as positive positive sample, be classified device be identified as it is positive negative
Sample is classified device and is identified as negative positive sample, is classified device and is identified as negative negative sample.TPR_S, FPR_S, P, R are each mixed
Confuse real example rate, false positive example rate, recall ratio and the precision ratio of matrix.
Specifically, Fig. 5 is that the present invention is based on the malicious code sorting technique of convolutional neural networks and the SPAM-GIST
The recognition result schematic diagram of malicious code sorting technique;Fig. 6 is the malicious code classification side the present invention is based on convolutional Neural network
One schematic diagram of the ROC curve of method and the SPAM-GIST malicious codes sorting technique;Table 4 gives in the present invention based on volume
The confusion matrix of the malicious code sorting technique of product neural network;Table 5 gives the SPAM-GIST malicious codes sorting technique
Confusion matrix.
The confusion matrix of malicious code sorting technique of the table 4 based on convolutional neural networks
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
0 | .976 | .006 | 0 | .003 | .003 | .003 | .003 | .006 | 0 |
1 | .013 | .983 | 0 | 0 | 0 | 0 | 0 | .004 | 0 |
2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | .99 | 0 | 0 | 0 | .01 | 0 |
4 | 0 | 0 | 0 | 0 | .86 | 0 | 0 | .14 | 0 |
5 | .013 | 0 | 0 | .007 | 0 | .973 | 0 | .007 | 0 |
6 | 0 | 0 | 0 | .014 | 0 | 0 | .972 | 0 | .014 |
7 | .032 | .004 | .026 | 0 | 0 | 0 | .004 | .915 | .013 |
8 | .011 | 0 | 0 | 0 | 0 | 0 | .011 | .022 | .956 |
The confusion matrix of 5 SPAM-GIST malicious code sorting techniques of table
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
0 | .879 | .01 | 0 | .01 | .003 | .038 | .003 | .006 | .051 |
1 | .004 | .964 | .006 | .004 | .002 | .012 | 0 | .004 | .004 |
2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | .965 | 0 | .012 | 0 | .023 | 0 |
4 | .182 | .091 | 0 | 0 | .636 | .091 | 0 | 0 | 0 |
5 | .007 | .014 | 0 | .014 | 0 | .951 | .014 | 0 | 0 |
6 | 0 | 0 | .014 | 0 | 0 | .028 | .944 | 0 | .014 |
7 | .046 | .025 | 0 | .013 | .004 | .025 | .004 | .866 | .017 |
8 | .019 | .005 | .024 | .015 | 0 | .019 | 0 | .01 | .908 |
From above-mentioned Fig. 5, Fig. 6 and table 4 and table 5 as can be seen that the malice of the present invention based on convolutional neural networks
Code classification method, no matter on the whole or from each malicious code family for, to the classifying quality of malicious code
It is substantially better than the SPAM-GIST malicious codes sorting technique, the i.e. malicious code based on convolutional neural networks of the invention
Sorting technique can preferably identify malicious code, disclosure satisfy that practical application wants malicious code classification accuracy
It asks.In addition, from fig. 6, it can be seen that the present invention the malicious code sorting technique based on convolutional neural networks AUC=0.978,
The AUC=0.953 of the SPAM-GIST malicious code sorting techniques, therefore the evil based on convolutional neural networks of the present invention
The classification performance of meaning code classification method is better than the SPAM-GIST malicious codes sorting technique.
Fig. 7 is the classification knot to different size sound spectrographs the present invention is based on the malicious code sorting technique of convolutional neural networks
A schematic diagram for fruit.In the present invention, further the sound spectrograph of malicious code is become by using the bicubic interpolation algorithm
More different size of image classifies to the malicious code based on convolutional neural networks of the present invention to verify different images size
The influence of the classification performance of method.In this experiment, count each family for test total amount of data (TNum) and by
The data volume (RNum) of Accurate classification simultaneously calculates Accuracy, P, R, since the Accuracy of each family in more classification is
P, therefore we only need to calculate P, R, obtain that the results are shown in Table 6, table 6 shows the evil the present invention is based on convolutional neural networks
Classification results of the code classification method of anticipating to the different size sound spectrographs of each family.
The present invention is based on the malicious code sorting techniques of convolutional neural networks to the different size sound spectrographs of each family for table 6
Classification results
As can be seen that various sizes of sound spectrograph is to the present invention is based on the malice generations of convolutional neural networks from table 6 and Fig. 7
The correct recognition rata Accuracy of code sorting technique influences little.
Fig. 8 is a schematic diagram of the recognition result of heretofore described MCCNN methods and MCCNN_ORI methods.In this hair
In bright, further respectively by the audio signal sound spectrograph directly converted by the binary file of malicious code and by malicious code
The audio signal sound spectrograph that binary file is converted after comentropy calculation processing is defeated as convolutional neural networks of the invention
Enter to carry out the classification impact analysis of the malicious code.Wherein the former is denoted as MCCNN_ORI methods, and the latter is denoted as the side MCCNN
Method.Specifically, in this experiment, the sound spectrograph size is 128*128*1.As can be seen from Figure 8, MCCNN_
ORI and MCCNN achieves almost the same experimental result.
In summary content it is found that the present invention the malicious code sorting technique based on convolutional neural networks by external input
Etc. parameters influence little, itself has stronger stability.Therefore, the malicious code of the invention based on convolutional neural networks point
Class method applicability and replicability are stronger.
In addition, the present invention has further counted in the MCCNN abandoned data volume in each sample, statistical result is such as
Shown in table 7, wherein SamNum indicates that sample number, AbdNum indicate abandoned data volume in sample.For example, working as in table 7
AbdNum is<1000, SamNum be 1203 when, indicate sample in abandoned data volume<1000 sample shares 1203.
Abandoned data volume in 7 MCCNN of table
AbdNum | SamNum |
<1000 | 1203 |
1000~5000 | 5701 |
5000~1000 | 1254 |
>10000 | 2702 |
As can be seen from Table 7, of the invention based on convolution when MCCNN_ORI ratios MCCNN is inserted into many rubbish codes
The malicious code sorting technique of neural network still can obtain identical classification results.Therefore, proposed by the present invention based on volume
The malicious code sorting technique of product neural network can be efficiently against rubbish code insertion.
In summary present disclosure is it is found that the invention discloses a kind of malicious codes based on convolutional neural networks point
Malicious code is mapped as single pass by class method by the characterization ability of incorporating signal processing techniques and convolutional neural networks
Signal, then generates the sound spectrograph of signal according to signal processing method, is converted sound spectrograph to using Image Zooming Algorithm constant
The gray-scale map of size finally uses convolutional neural networks to realize the classification of malicious code.In the method for the invention, by malice generation
Code is mapped as regenerating corresponding sound spectrograph after single pass signal, can obtain the enough contexts of the malicious code
Information not only reflects the time domain and frequency domain information of signal, can also reflect part and the global information of signal;In addition, due to
The characteristics such as the local translation invariance of convolutional neural networks can preferably obtain the substantive characteristics of malicious code, effectively overcome generation
Situations such as code rearrangement, rubbish code insertion, to improve the nicety of grading of malicious code.
Example the above is only the implementation of the present invention is not intended to limit the scope of the invention, every to utilize this hair
Equivalent structure transformation made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant technical fields,
Similarly it is included within the scope of the present invention.
Claims (10)
1. a kind of malicious code sorting technique based on convolutional neural networks, which is characterized in that including step:
Signal maps, and the binary file of the malicious code is mapped as audio signal files;
Sound spectrograph generates, and utilizes the sound spectrograph of malicious code described in the audio signal files Visual Production;
The sound spectrograph is changed to the fixed image of size by dimension modifying using image interpolation method;
Image after dimension modifying is inputted the convolutional neural networks to classify to the malicious code by code classification.
2. the malicious code sorting technique according to claim 1 based on convolutional neural networks, which is characterized in that described
In signal mapping, reading 8bit is a unsigned int to convert the binary file of the malicious code to a dimension
Then the one-dimension array is mapped as audio signal files by group.
3. the malicious code sorting technique according to claim 2 based on convolutional neural networks, which is characterized in that will be described
When one-dimension array is mapped as audio signal files, setting channel number is 1, sampling frequency 44.1kHz, quantization digit 4byte.
4. the malicious code sorting technique according to claim 2 based on convolutional neural networks, which is characterized in that read
8bit is that a unsigned int is further comprised with converting after one-dimension array the binary file of the malicious code to:
The non-overlapping scanning one-dimension array of window for the use of length being 128, the comentropy of data in calculation window, if the letter
It is 0 to cease entropy, then by the rejection of data in window.
5. the malicious code sorting technique according to claim 4 based on convolutional neural networks, which is characterized in that the letter
Ceasing entropy is:Wherein, pkThe probability occurred for number k in window.
6. the malicious code sorting technique based on convolutional neural networks according to any one of claim 2-5, feature
It is, the audio signal files are wav file.
7. the malicious code sorting technique according to claim 1 based on convolutional neural networks, which is characterized in that institute's predicate
Spectrogram generates:
(1) it carries out framing, adding window and discrete Fourier transform to the audio signal files of the malicious code to handle, i.e.,Wherein, s (n) is the audio signal files of the malicious code, N
Identical with frame length when audio signal progress framing for window size, w (n) is rectangular window function, and X is in Fu of s (n)
Leaf system number;
(2) the logarithmic amplitude spectrum A (n, k), wherein A (n, k)=10log of the audio signal files of the malicious code are calculated10(|X
(n,k)|+e-1);
(3) gray scale transformation is carried out to the logarithmic amplitude spectrum A (n, k):Wherein Amax(n,k)
Indicate the maximum value of sound spectrograph gray level;
(4) G (n, k) is saved as into single pass PNG images to obtain the sound spectrograph gray level image of the malicious code.
8. the malicious code sorting technique according to claim 1 based on convolutional neural networks, which is characterized in that the figure
As interpolation method is bicubic interpolation algorithm:
Wherein, (x, y) is the pixel of interpolation in sound spectrograph, (xi,yj) (i, j=0,1,2,3) be pixel 4*4 neighborhoods
Point, w (x) are Interpolation-Radix-Function:
Wherein, a=0.5.
9. the malicious code sorting technique according to claim 1 based on convolutional neural networks, which is characterized in that the volume
Product neural network in convolutional layer and connect full layer nonlinear activation function include Relu functions, Leakly Relu functions or
ELU functions.
10. the malicious code sorting technique according to claim 1 based on convolutional neural networks, which is characterized in that described
In dimension modifying, the sound spectrograph is changed to the fixed 128*128*1 images of size.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810469552.5A CN108717512B (en) | 2018-05-16 | 2018-05-16 | Malicious code classification method based on convolutional neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810469552.5A CN108717512B (en) | 2018-05-16 | 2018-05-16 | Malicious code classification method based on convolutional neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108717512A true CN108717512A (en) | 2018-10-30 |
CN108717512B CN108717512B (en) | 2021-06-18 |
Family
ID=63900133
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810469552.5A Active CN108717512B (en) | 2018-05-16 | 2018-05-16 | Malicious code classification method based on convolutional neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108717512B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110572393A (en) * | 2019-09-09 | 2019-12-13 | 河南戎磐网络科技有限公司 | Malicious software traffic classification method based on convolutional neural network |
CN110659495A (en) * | 2019-09-27 | 2020-01-07 | 山东理工大学 | Malicious code family classification method |
CN110704842A (en) * | 2019-09-27 | 2020-01-17 | 山东理工大学 | Malicious code family classification detection method |
CN110751955A (en) * | 2019-09-23 | 2020-02-04 | 山东大学 | Sound event classification method and system based on time-frequency matrix dynamic selection |
CN111291814A (en) * | 2020-02-15 | 2020-06-16 | 河北工业大学 | Crack identification algorithm based on convolution neural network and information entropy data fusion strategy |
CN111552963A (en) * | 2020-04-07 | 2020-08-18 | 哈尔滨工程大学 | Malicious software classification method based on structural entropy sequence |
CN111552965A (en) * | 2020-04-07 | 2020-08-18 | 哈尔滨工程大学 | Malicious software classification method based on PE (provider edge) header visualization |
CN111783088A (en) * | 2020-06-03 | 2020-10-16 | 杭州迪普科技股份有限公司 | Malicious code family clustering method and device and computer equipment |
CN113204764A (en) * | 2021-04-02 | 2021-08-03 | 武汉大学 | Unsigned binary indirect control flow identification method based on deep learning |
CN114579970A (en) * | 2022-05-06 | 2022-06-03 | 南京明博互联网安全创新研究院有限公司 | Convolutional neural network-based android malicious software detection method and system |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090281981A1 (en) * | 2008-05-06 | 2009-11-12 | Chen Barry Y | Discriminant Forest Classification Method and System |
CN105989288A (en) * | 2015-12-31 | 2016-10-05 | 武汉安天信息技术有限责任公司 | Deep learning-based malicious code sample classification method and system |
CN106096411A (en) * | 2016-06-08 | 2016-11-09 | 浙江工业大学 | A kind of Android malicious code family classification method based on bytecode image clustering |
CN107092829A (en) * | 2017-04-21 | 2017-08-25 | 中国人民解放军国防科学技术大学 | A kind of malicious code detecting method based on images match |
CN107392019A (en) * | 2017-07-05 | 2017-11-24 | 北京金睛云华科技有限公司 | A kind of training of malicious code family and detection method and device |
CN107609399A (en) * | 2017-09-09 | 2018-01-19 | 北京工业大学 | Malicious code mutation detection method based on NIN neutral nets |
CN107908963A (en) * | 2018-01-08 | 2018-04-13 | 北京工业大学 | A kind of automatic detection malicious code core feature method |
-
2018
- 2018-05-16 CN CN201810469552.5A patent/CN108717512B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090281981A1 (en) * | 2008-05-06 | 2009-11-12 | Chen Barry Y | Discriminant Forest Classification Method and System |
CN105989288A (en) * | 2015-12-31 | 2016-10-05 | 武汉安天信息技术有限责任公司 | Deep learning-based malicious code sample classification method and system |
CN106096411A (en) * | 2016-06-08 | 2016-11-09 | 浙江工业大学 | A kind of Android malicious code family classification method based on bytecode image clustering |
CN107092829A (en) * | 2017-04-21 | 2017-08-25 | 中国人民解放军国防科学技术大学 | A kind of malicious code detecting method based on images match |
CN107392019A (en) * | 2017-07-05 | 2017-11-24 | 北京金睛云华科技有限公司 | A kind of training of malicious code family and detection method and device |
CN107609399A (en) * | 2017-09-09 | 2018-01-19 | 北京工业大学 | Malicious code mutation detection method based on NIN neutral nets |
CN107908963A (en) * | 2018-01-08 | 2018-04-13 | 北京工业大学 | A kind of automatic detection malicious code core feature method |
Non-Patent Citations (1)
Title |
---|
杨益敏等: "基于字节码图像的Android恶意代码家族分类方法", 《万方数据库》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110572393A (en) * | 2019-09-09 | 2019-12-13 | 河南戎磐网络科技有限公司 | Malicious software traffic classification method based on convolutional neural network |
CN110751955B (en) * | 2019-09-23 | 2022-03-01 | 山东大学 | Sound event classification method and system based on time-frequency matrix dynamic selection |
CN110751955A (en) * | 2019-09-23 | 2020-02-04 | 山东大学 | Sound event classification method and system based on time-frequency matrix dynamic selection |
CN110704842A (en) * | 2019-09-27 | 2020-01-17 | 山东理工大学 | Malicious code family classification detection method |
CN110659495A (en) * | 2019-09-27 | 2020-01-07 | 山东理工大学 | Malicious code family classification method |
CN111291814A (en) * | 2020-02-15 | 2020-06-16 | 河北工业大学 | Crack identification algorithm based on convolution neural network and information entropy data fusion strategy |
CN111291814B (en) * | 2020-02-15 | 2023-06-02 | 河北工业大学 | Crack identification algorithm based on convolutional neural network and information entropy data fusion strategy |
CN111552963A (en) * | 2020-04-07 | 2020-08-18 | 哈尔滨工程大学 | Malicious software classification method based on structural entropy sequence |
CN111552965A (en) * | 2020-04-07 | 2020-08-18 | 哈尔滨工程大学 | Malicious software classification method based on PE (provider edge) header visualization |
CN111783088A (en) * | 2020-06-03 | 2020-10-16 | 杭州迪普科技股份有限公司 | Malicious code family clustering method and device and computer equipment |
CN111783088B (en) * | 2020-06-03 | 2023-04-28 | 杭州迪普科技股份有限公司 | Malicious code family clustering method and device and computer equipment |
CN113204764A (en) * | 2021-04-02 | 2021-08-03 | 武汉大学 | Unsigned binary indirect control flow identification method based on deep learning |
CN114579970A (en) * | 2022-05-06 | 2022-06-03 | 南京明博互联网安全创新研究院有限公司 | Convolutional neural network-based android malicious software detection method and system |
CN114579970B (en) * | 2022-05-06 | 2022-07-22 | 南京明博互联网安全创新研究院有限公司 | Convolutional neural network-based android malicious software detection method and system |
Also Published As
Publication number | Publication date |
---|---|
CN108717512B (en) | 2021-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108717512A (en) | A kind of malicious code sorting technique based on convolutional neural networks | |
CN109344618B (en) | Malicious code classification method based on deep forest | |
CN107316013B (en) | Hyperspectral image classification method based on NSCT (non-subsampled Contourlet transform) and DCNN (data-to-neural network) | |
CN111126386B (en) | Sequence domain adaptation method based on countermeasure learning in scene text recognition | |
Hossain et al. | Leaf shape identification based plant biometrics | |
US20180349743A1 (en) | Character recognition using artificial intelligence | |
CN111401375A (en) | Text recognition model training method, text recognition device and text recognition equipment | |
Jiang et al. | Cascaded subpatch networks for effective CNNs | |
CN108664911B (en) | Robust face recognition method based on image sparse representation | |
Nanehkaran et al. | Analysis and comparison of machine learning classifiers and deep neural networks techniques for recognition of Farsi handwritten digits | |
CN107679572A (en) | A kind of image discriminating method, storage device and mobile terminal | |
CN111415323B (en) | Image detection method and device and neural network training method and device | |
Kim et al. | Label-preserving data augmentation for mobile sensor data | |
CN110188827A (en) | A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model | |
CN115600200A (en) | Android malicious software detection method based on entropy spectrum density and adaptive contraction convolution | |
CN113255568B (en) | Bill image classification method and device, computer equipment and storage medium | |
Dogan | A New Global Pooling Method for Deep Neural Networks: Global Average of Top-K Max-Pooling. | |
Li | Saliency prediction based on multi-channel models of visual processing | |
Wang et al. | Coarse-to-fine image dehashing using deep pyramidal residual learning | |
Cho | Dynamic RNN-CNN based malware classifier for deep learning algorithm | |
CN114972886A (en) | Image steganography analysis method | |
Rui et al. | Data Reconstruction based on supervised deep auto-encoder | |
KR102184655B1 (en) | Improvement Of Regression Performance Using Asymmetric tanh Activation Function | |
Zhang et al. | Wasserstein generative recurrent adversarial networks for image generating | |
CN116368487A (en) | Method for malware detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |