CN108717512A - A kind of malicious code sorting technique based on convolutional neural networks - Google Patents

A kind of malicious code sorting technique based on convolutional neural networks Download PDF

Info

Publication number
CN108717512A
CN108717512A CN201810469552.5A CN201810469552A CN108717512A CN 108717512 A CN108717512 A CN 108717512A CN 201810469552 A CN201810469552 A CN 201810469552A CN 108717512 A CN108717512 A CN 108717512A
Authority
CN
China
Prior art keywords
malicious code
convolutional neural
neural networks
sorting technique
audio signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810469552.5A
Other languages
Chinese (zh)
Other versions
CN108717512B (en
Inventor
钱叶魁
卢喜东
杜江
杨瑞朋
雒朝峰
黄浩
李宇翀
王丙坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Campus Of Chinese People's Liberation Army Army Artillery Air Defense Academy
Original Assignee
Zhengzhou Campus Of Chinese People's Liberation Army Army Artillery Air Defense Academy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Campus Of Chinese People's Liberation Army Army Artillery Air Defense Academy filed Critical Zhengzhou Campus Of Chinese People's Liberation Army Army Artillery Air Defense Academy
Priority to CN201810469552.5A priority Critical patent/CN108717512B/en
Publication of CN108717512A publication Critical patent/CN108717512A/en
Application granted granted Critical
Publication of CN108717512B publication Critical patent/CN108717512B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Virology (AREA)
  • Image Analysis (AREA)
  • Error Detection And Correction (AREA)

Abstract

The malicious code sorting technique based on convolutional neural networks that the invention discloses a kind of, malicious code is mapped as single pass signal by it, then the sound spectrograph of signal is generated according to signal processing method, it converts sound spectrograph to using Image Zooming Algorithm the gray-scale map of constant size, convolutional neural networks is finally used to realize the classification of malicious code.In the method for the invention, malicious code is mapped as to regenerate corresponding sound spectrograph after single pass signal, the enough contextual informations of the malicious code can be obtained, not only reflects the time domain and frequency domain information of signal, can also reflect part and the global information of signal;In addition, due to characteristics such as the local translation invariances of convolutional neural networks, situations such as capable of preferably obtaining the substantive characteristics of malicious code, and then effectively overcome code reordering, rubbish code insertion, the nicety of grading of malicious code is improved.

Description

A kind of malicious code sorting technique based on convolutional neural networks
Technical field
The present invention relates to malicious code classification fields, more particularly to the malicious code sorting technique based on signal analysis.
Background technology
With flourishing for internet, malicious code has become one of the principal element for threatening internet security, Show the trend of rapid growth.The Static Analysis Method of malicious code is the common method that Classification and Identification is carried out to malicious code One of, Static Analysis Method in the prior art includes the analysis method based on malicious code characteristics of image, such as Nataraj L Et al. propose a kind of SPAM-GIST malicious codes sorting technique (Nataraj L, Manjunath B S.SPAM:Signal Processing to Analyze Malware[Applications Corner][J].IEEE Signal Processing Magazine, 2016,33 (2):105-117), malicious code binary file is mapped as image and carrys out Expressive Features, utilized The global characteristics GIST of the multiple dimensioned and multidirectional feature extraction image of Gabor filter, and use this character representation malice generation Code feature, then classifies to malicious code using nearest neighbor algorithm.However, the malicious code used in practical application is often In the presence of the alias condition of deformation or rubbish code insertion etc., this makes the Static Analysis Method based on characteristics of image that can not have The malicious code after identity confusion is imitated, and then causes the nicety of grading of malicious code low.Therefore, efficient identification how is obtained to obscure The analysis method of malicious code afterwards is those skilled in the art's problem to be solved.
Invention content
The malicious code sorting technique based on convolutional neural networks that the present invention provides a kind of, solves evil in the prior art The code classification technology of anticipating can not effective identity confusion malicious code, and then the problem that the precision that causes malicious code to be classified is low.
In order to solve the above technical problems, one aspect of the present invention, which is to provide one kind, being based on convolutional neural networks Malicious code sorting technique, including step:Signal maps, and the binary file of the malicious code is mapped as audio signal File;Sound spectrograph generates, and utilizes the sound spectrograph of malicious code described in the audio signal files Visual Production;Dimension modifying, The sound spectrograph is changed to the fixed image of size using image interpolation method;Code classification, by the image after dimension modifying The convolutional neural networks are inputted to classify to the malicious code.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, mapped in the signal In, read 8bit be a unsigned int to convert the binary file of the malicious code to one-dimension array, then will The one-dimension array is mapped as audio signal files.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, by the one-dimension array When being mapped as audio signal files, setting channel number is 1, sampling frequency 44.1kHz, quantization digit 4byte.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, it is one to read 8bit Unsigned int is further comprised with converting after one-dimension array the binary file of the malicious code to:It is using length The non-overlapping scanning one-dimension array of 128 window, the comentropy of data in calculation window, if described information entropy is 0, By the rejection of data in window.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, described information entropy is:Wherein, pkThe probability occurred for number k in window.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, the audio signal text Part is wav file.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, the sound spectrograph generates Including:(1) it carries out framing, adding window and discrete Fourier transform to the audio signal files of the malicious code to handle, i.e.,Wherein, s (n) is the audio signal files of the malicious code, and N is Frame length when window size and audio signal progress framing is identical, and w (n) is rectangular window function, and X is the Fourier of s (n) Coefficient;(2) the logarithmic amplitude spectrum A (n, k), wherein A (n, k)=10log of the audio signal files of the malicious code are calculated10(| X(n,k)|+e-1);(3) gray scale transformation is carried out to the logarithmic amplitude spectrum A (n, k):Its Middle Amax(n, k) indicates the maximum value of sound spectrograph gray level;(4) G (n, k) is saved as into single pass PNG images to obtain The sound spectrograph gray level image of the malicious code.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, described image interpolation side Method is bicubic interpolation algorithm:Wherein, (x, y) is to be waited in sound spectrograph The pixel of interpolation, (xi,yj) (i, j=0,1,2,3) be pixel 4*4 neighborhood points, w (x) be Interpolation-Radix-Function
Wherein, a=0.5.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, the convolutional Neural net Convolutional layer in network and to connect the nonlinear activation function of layer entirely include Relu functions, Leakly Relu functions or ELU functions.
In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, the dimension modifying In, the sound spectrograph is changed to the fixed 128*128*1 images of size.
The beneficial effects of the invention are as follows:The invention discloses a kind of malicious code classification side based on convolutional neural networks Malicious code is mapped as single pass signal by method, and the sound spectrograph of signal is then generated according to signal processing method, uses figure It converts sound spectrograph to as scaling algorithm the gray-scale map of constant size, convolutional neural networks is finally used to realize point of malicious code Class.In the method for the invention, malicious code is mapped as regenerating corresponding sound spectrograph after single pass signal, can obtained The enough contextual informations of the malicious code are obtained, not only reflects the time domain and frequency domain information of signal, can also reflect letter Number part and global information;In addition, due to features such as the local translation invariances of convolutional neural networks, can preferably be disliked It anticipates the substantive characteristics of code, and then situations such as effectively overcome code reordering, rubbish code insertion, improves the classification essence of malicious code Degree.
Description of the drawings
The present invention is based on an embodiment schematic diagrams of the malicious code sorting technique of convolutional neural networks by Fig. 1;
The binary file of malicious code is converted in Fig. 2 present invention the flow of an embodiment of audio signal files Figure;
Fig. 3 is the stream for another embodiment for converting the binary file of malicious code in the present invention audio signal files Cheng Tu;
Fig. 4 is generation signal sound spectrograph gray level image in the malicious code sorting technique the present invention is based on convolutional neural networks An embodiment flow chart;
Fig. 5 is that the present invention is based on the malicious code sorting technique of convolutional neural networks and the SPAM-GIST malice generations The classification results schematic diagram of code sorting technique;
Fig. 6 is that the present invention is based on the malicious code sorting techniques at convolutional Neural network and the SPAM-GIST malicious codes One schematic diagram of the ROC curve of sorting technique;
Fig. 7 is the classification knot to different size sound spectrographs the present invention is based on the malicious code sorting technique of convolutional neural networks A schematic diagram for fruit;
Fig. 8 is a schematic diagram of the recognition result of heretofore described MCCNN methods and MCCNN_ORI methods.
Specific implementation mode
To facilitate the understanding of the present invention, in the following with reference to the drawings and specific embodiments, the present invention will be described in more detail. The preferred embodiment of the present invention is given in attached drawing.But the present invention can realize in many different forms, and it is unlimited In this specification described embodiment.Make to the disclosure on the contrary, purpose of providing these embodiments is Understand more thorough and comprehensive.
It should be noted that unless otherwise defined, all technical and scientific terms used in this specification with belong to The normally understood meaning of those skilled in the art of the present invention is identical.Used term in the description of the invention It is to be not intended to the limitation present invention to describe the purpose of specific embodiment.Term "and/or" packet used in this specification Include any and all combinations of one or more relevant Listed Items.
Fig. 1 is an embodiment flow chart of the malicious code sorting technique the present invention is based on convolutional neural networks.Described Malicious code sorting technique based on convolutional neural networks specifically includes following steps:
Step S1, signal mapping, audio signal files are mapped as by the binary file of the malicious code.First, right In the binary file of the malicious code, be successively read 8bit be a unsigned int with by the two of the malicious code into File processed is converted into one-dimension array, and the one-dimension array is converted to audio signal files according to the numerical value of the one-dimension array.
Step S2, sound spectrograph generate, and utilize the sound spectrograph of malicious code described in the audio signal files Visual Production. The present invention is handled using Fourier transform pairs per frame signal, is then spliced the result of processing and is generated sound spectrograph.Wherein institute The sound spectrograph of the malicious code signal of extraction, can obtain the enough contextual informations of the malicious code, including signal The part and global information of time domain and frequency domain information and signal.Further, the present invention pre-processes sound spectrograph, obtains The sound spectrograph gray level image that gray level is 256.
The sound spectrograph is changed to the fixed image of size by step S3, dimension modifying using image interpolation method.In order to Sound spectrograph is analyzed using convolutional neural networks, all sound spectrographs are transformed to fix by the present invention using image interpolation method The size of size.
Image after dimension modifying is inputted the convolutional neural networks with to the malice generation by step S4, code classification Code is classified.Various Classifiers on Regional exists in the prior art to realize the classification of malicious code, such as random forest, supporting vector Machine etc. has good recognition effect since convolutional neural networks have local translation invariance to the picture for deforming, moving, Therefore it can obscure method efficiently against common malicious codes such as malicious code rearrangement, rubbish code insertions;In addition, convolution Neural network has multiple convolution kernels, can more extract the substantive characteristics of malicious code sound spectrograph;Therefore, the present invention is specifically used Convolutional neural networks realize the high-precision classification of malicious code.
The malicious code sorting technique based on convolutional neural networks described in the present embodiment, by the binary system of malicious code File is converted into single pass audio signal files, and further obtains the sound spectrograph of the audio signal, and sound spectrograph is inputted institute Convolutional neural networks are stated to obtain the generic of the malicious code.Wherein it is enough can to retain the malicious code for sound spectrograph More contextual informations includes part and the global information of the time-frequency domain information of signal and signal;Convolutional neural networks have Local translation invariance and the malicious code therefore can be extracted on the basis of sound spectrograph with a variety of convolution kernels Substantive characteristics improves the precision of malicious code classification.
Preferably, can also include after converting the binary file of the malicious code to one-dimension array:It uses The non-overlapping entire array of scanning of window that length is 128, and in calculation window data comentropy (Information Entropy, Ent), if Ent=0, by the rejection of data in window;Otherwise, retain the data in window.Such operation The malicious codes obfuscated manner such as rubbish code insertion can be overcome to a certain extent.Wherein described information entropy Ent formula are:
pkThe probability occurred for number k in window.
Fig. 2 is the flow for the embodiment for converting the binary file of malicious code in the present invention audio signal files Figure.In fig. 2, for the binary file of the malicious code in given step S21:000000000000000011000100 ……;
Step S22 reads binary data from the binary file of the malicious code successively as unit of 8bit, And it is converted into the one-dimension array [0,0,196 ... ...] of unsigned int (ranging from 0~255);
The one-dimension array is mapped as a section audio signal by step S23, and the amplitude of the audio signal corresponds to described one The audio signal of generation is finally saved as wav file by the value of dimension group.The size of the wherein described wav file is according to malice generation Code binary file size, channel number (channel), sampling frequency (framerate) and quantization digit (sampwidth) and It is fixed.
Fig. 3 is the stream for another embodiment for converting the binary file of malicious code in the present invention audio signal files Cheng Tu.In figure 3, for the binary file of the malicious code in given step S31:0000000000000000110001 00……;
Step S32 reads binary data from the binary file of the malicious code successively as unit of 8bit, And it is converted into the one-dimension array [0,0,196 ... ...] of unsigned int (ranging from 0~255);
Step S33, the non-overlapping entire array of scanning of window for the use of length being 128, and in calculation window data letter Entropy (Information Entropy, Ent) is ceased, if Ent=0, by the rejection of data in window;Otherwise, retain in window Data, obtain treated one-dimension array.Wherein described information entropy Ent formula are:
pkThe probability occurred for number k in window;
The treated one-dimension array is mapped as a section audio signal, the amplitude of the audio signal by step S34 The value of the corresponding one-dimension array, finally saves as wav file by the audio signal of generation.
In this embodiment, by calculating the comentropy of all data in specific length window and giving up comentropy for 0 Data in window will can to a certain extent overcome the malicious codes obfuscated manner such as rubbish code insertion, further improve The accuracy of malicious code classification of the present invention.
Fig. 4 is generation signal sound spectrograph gray level image in the malicious code sorting technique the present invention is based on convolutional neural networks An embodiment flow chart.In Fig. 4, step S41:Framing, adding window are carried out to the audio signal files of the malicious code And discrete Fourier transform processing, i.e.,
Wherein, s (n) is the audio signal files of the malicious code, and N is that window size and the audio signal carry out Frame length when framing is identical, and w (n) is rectangular window function, and X is the Fourier coefficient of s (n).
Step S42:Calculate the logarithmic amplitude spectrum A (n, k) of the audio signal files of the malicious code.
Since there are a small amount of X (n, k) to be equal to zero, so the logarithm in the audio signal files for calculating the malicious code shakes Plus a small numerical value when width composes A (n, k), i.e.,
A (n, k)=10log10(|X(n,k)|+e-1);
Step S43:Gray scale transformation is carried out to the logarithmic amplitude spectrum A (n, k):
Wherein Amax(n, k) indicates the maximum value of sound spectrograph gray level.The maximum ash of single channel gray level image in a computer It is 256 to spend grade, when one group of numerical value is converted into single channel gray-scale map, if there is the numerical value more than 255, computer in array 255 numerical value is will be greater than all to be replaced with 255.Therefore formula is usedTo logarithmic amplitude spectrum A (n, K) gray level is converted, and this method can not only convert the size of gray level, but also when A (n, k) is less than 0, G (n, k) is more than 255.
Step S44:The G (n, k) is saved as into single pass PNG images to obtain the sound spectrograph ash of the malicious code Spend image.When G (n, k) is saved as single pass PNG images, the numerical value in G (n, k) more than 255 will be replaced by 255, be obtained The sound spectrograph gray level image for being 256 to gray level.
Specifically, during generating the sound spectrograph gray level image of audio signal files of the malicious code, this hair Bright to be determined to carry out framing, the setting of adding window to audio signal according to the duration of audio signal files, i.e., the present invention believes according to audio The duration of number file determines that frame length and frame move, and then Fourier transformation will be carried out per frame signal, table 1 give sound intermediate frequency of the present invention Frame length and frame corresponding to signal file difference duration move data.
Frame length and frame corresponding to 1 different audio signals file duration of table move
Preferably, in order to retaining the feature of sound spectrograph as possible, make the image after scaling that there is higher picture quality, The present invention zooms in and out sound spectrograph using bicubic interpolation method.The bicubic interpolation method is chosen around image interpolation point The gray value of 16 points makees cubic interpolation, considers not only the gray scale of 4 direct neighbor points and influences, and considers each neighbor point Between gray-value variation rate influence.The bicubic interpolation algorithm needs selection Interpolation-Radix-Function to carry out fitting data, wherein use Interpolation-Radix-Function is:
Wherein a=0.5.
The bicubic interpolation algorithmic formula used is as follows:
Wherein, (x, y) is the pixel of interpolation in sound spectrograph, (xi,yj) (i, j=0,1,2,3) be the interpolation picture The 4*4 neighborhood points of vegetarian refreshments.
Preferably, the present invention carries out Classification and Identification using convolutional neural networks to malicious code.Extracting malicious code After the sound spectrograph gray level image of signal, need to classify to sound spectrograph by sorting algorithm.The specifically used convolution of the present invention Neural network (Convolutional Neural Network, CNN) classifies to sound spectrograph.Convolutional neural networks CNN is A kind of multilayer neural network structure is mainly made of input layer, convolutional layer, pond layer, full articulamentum, output layer etc..Input layer Input sound spectrograph gray level image;Convolutional layer extracts the feature of sound spectrograph;Pond layer utilizes the local correlations principle of image, reduces Data volume to be treated;Feature Mapping is the malicious code classification finally predicted by output layer.Wherein, for convolutional Neural net Network:
(1) convolutional layer is in convolutional layer, the input of last layer exported as current layer, in order to promote the table of network model Danone power introduces nonlinear activation primitive.The propagated forward in kth layer of convolutional layer can be expressed as:
Wherein,For the net activation in l layers of j-th of channel,The output in j-th of channel of l layers,For convolution kernel, For the bias term of convolutional layer, MjTo be used to calculateInput characteristic pattern subset, f is activation primitive, and * is convolution symbol.
(2) pond function is replaced using the general evaluation system feature of the adjacent output of a certain position in pond layer convolutional network The output of network in the position, it can reduce the feature in network, while most of output being kept not change.Pond Change function and down-sampling output characteristic pattern is carried out to input feature vector figure by following formula:
Wherein,For the net activation in l layers of j-th of channel, β is the weight coefficient of pond layer,For the biasing of pond layer , down () is pond function, it is to input feature vector figureCharacteristic pattern is divided by sliding window method and multiple is not weighed Folded image block, then in each image block pixel summation, average or maximum value.The present invention uses maximum pond function Seek the maximum value in each image block.
(3) the two dimensional character figure that all convolutional layers export is spliced into one-dimensional characteristic by full articulamentum in full articulamentum Input of the vector as full articulamentum.Full articulamentum, can be with by summing to weighted input and being exported by activation primitive It is expressed as:
Wherein, l indicates that the number of plies, x are characterized image, and W is the weight matrix of current layer, and b is the bias term of current layer, and f is Activation primitive.
(4) other structures are to promote the ability to express of network model, introduce nonlinear activation primitive, convolutional layer and complete The common activation primitive of articulamentum includes:Relu, Leakly Rule, ELU etc..Convolutional layer and full articulamentum in the present invention Activation primitive can specifically use Relu functions, mathematic(al) representation is as follows:
F (x)=max (0, x)
The result that neural network propagated forward obtains is become into probability distribution in addition, being returned using Softmax.Assuming that original Neural network output be y1,y2,y3,....,yn, by Softmax recurrence handle after output be:
The present invention uses the network model similar to VGG16 to classify sound spectrograph.The convolutional neural networks model As shown in table 2, wherein Feature indicates that the various structures in the convolutional neural networks, kernel size indicate core used The size of function, stride indicate that mobile step-length, pad indicate filling, and 1 represents using 0 filling, keeps the defeated of current layer Enter with output characteristic pattern it is equal in magnitude, 0 represent be not filled with, function indicate to function used by current layer.
2 convolutional neural networks model of table
After obtaining network model above, the loss function for training network is needed, the classification of malicious code is predicted with this. The present invention weighs error using cross entropy (cross-entropy) loss function.Assuming that there is { (x1,y1),(x2,y2),(x3, y3),......,(xM,yM) training sample belongs to k malicious code family, y is the malicious code family encoded with one_hot K tie up categorization vector, for each sample, cross entropy can be calculated with following formula:
Wherein,Indicate i-th of numerical value in the categorization vector of sample m,It is returned by Softmax in output layer for sample m I-th of numerical value in the probability distribution returned.Therefore, network mould can be optimized by the loss function of the whole training samples of training Type.
The loss function of formula J be whole cross entropies and, the optimization algorithms training net such as stochastic gradient descent can be passed through Network model.Finally, malicious code classification is predicted using trained network model.
Fig. 5 is that the present invention is based on the malicious code sorting technique of convolutional neural networks and the SPAM-GIST malice generations The classification results schematic diagram of code sorting technique.In the present invention, used malicious code data collection comes from Microsoft and exists Project Microsoft Malware Classification Challenge on Kaggle choose the 10860 of 9 classifications altogether Malicious code binary file is tested, and table 3 gives the essential information for the malicious code data collection that the present invention uses.
3 malicious code data collection of table
Malicious code classification Classification number Quantity
Ramniit 0 1533
Lollipop 1 2478
Kelihos_ver3 2 2942
Vundo 3 475
Simda 4 42
Tracur 5 751
Kelihos_ver1 6 398
Obfuscator.ACY 7 1228
Gatak 8 1013
Specifically, the present invention experiment in, we by the 80% of malicious code data collection be used as training sample set, 20% As test sample collection, and when the binary file of malicious code is mapped as wav file, setting channel number channel is 1, sampling frequency framerate are 44.1kHz, and quantization digit sampwidth is 4byte.In addition, in the present invention based on convolution In the malicious code sorting technique of neural network, the sound spectrograph of generation is scaled the gray level image of 128*128*1 as convolution The input of neural network.In the SPAM-GIST malicious codes sorting technique, when k nearest neighbor (K-Nearest Neighbor, KNN) K=1 of sorting algorithm obtains best as a result, therefore in the present invention, the SPAM-GIST malicious codes sorting technique In K values be 1.
Specifically, the present invention is using accuracy rate (Accuracy), macro precision ratio (macro_P), macro recall ratio (macro_ R), the malicious code sorting technique pair based on convolutional neural networks of four kinds of evaluation index evaluation present invention of macro F1 (macro_F1) The classifying quality of malicious code.About the present invention another index ROC (Receicer Operating Charateristic, ROC) curve, the longitudinal axis are real example rate (True Positive Rate, TPR), and horizontal axis is false positive example rate (False Positive Rate, FPR), using the area under ROC curve, i.e. AUC (Area Under ROC Curve, AUC) evaluates institute State the classification performance of convolutional neural networks.
For more classification problems, a confusion matrix will be corresponded to per the combination of classification two-by-two, then on each confusion matrix Real example rate, false positive example rate, recall ratio and precision ratio are calculated, then calculates average value, obtains TPR, FPR, macro precision ratio (macro_ P), macro recall ratio (macro_R), each evaluation index calculation formula are as follows:
Wherein TP, FP, FN, TN indicate to be classified respectively device be identified as positive positive sample, be classified device be identified as it is positive negative Sample is classified device and is identified as negative positive sample, is classified device and is identified as negative negative sample.TPR_S, FPR_S, P, R are each mixed Confuse real example rate, false positive example rate, recall ratio and the precision ratio of matrix.
Specifically, Fig. 5 is that the present invention is based on the malicious code sorting technique of convolutional neural networks and the SPAM-GIST The recognition result schematic diagram of malicious code sorting technique;Fig. 6 is the malicious code classification side the present invention is based on convolutional Neural network One schematic diagram of the ROC curve of method and the SPAM-GIST malicious codes sorting technique;Table 4 gives in the present invention based on volume The confusion matrix of the malicious code sorting technique of product neural network;Table 5 gives the SPAM-GIST malicious codes sorting technique Confusion matrix.
The confusion matrix of malicious code sorting technique of the table 4 based on convolutional neural networks
0 1 2 3 4 5 6 7 8
0 .976 .006 0 .003 .003 .003 .003 .006 0
1 .013 .983 0 0 0 0 0 .004 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 .99 0 0 0 .01 0
4 0 0 0 0 .86 0 0 .14 0
5 .013 0 0 .007 0 .973 0 .007 0
6 0 0 0 .014 0 0 .972 0 .014
7 .032 .004 .026 0 0 0 .004 .915 .013
8 .011 0 0 0 0 0 .011 .022 .956
The confusion matrix of 5 SPAM-GIST malicious code sorting techniques of table
0 1 2 3 4 5 6 7 8
0 .879 .01 0 .01 .003 .038 .003 .006 .051
1 .004 .964 .006 .004 .002 .012 0 .004 .004
2 0 0 1 0 0 0 0 0 0
3 0 0 0 .965 0 .012 0 .023 0
4 .182 .091 0 0 .636 .091 0 0 0
5 .007 .014 0 .014 0 .951 .014 0 0
6 0 0 .014 0 0 .028 .944 0 .014
7 .046 .025 0 .013 .004 .025 .004 .866 .017
8 .019 .005 .024 .015 0 .019 0 .01 .908
From above-mentioned Fig. 5, Fig. 6 and table 4 and table 5 as can be seen that the malice of the present invention based on convolutional neural networks Code classification method, no matter on the whole or from each malicious code family for, to the classifying quality of malicious code It is substantially better than the SPAM-GIST malicious codes sorting technique, the i.e. malicious code based on convolutional neural networks of the invention Sorting technique can preferably identify malicious code, disclosure satisfy that practical application wants malicious code classification accuracy It asks.In addition, from fig. 6, it can be seen that the present invention the malicious code sorting technique based on convolutional neural networks AUC=0.978, The AUC=0.953 of the SPAM-GIST malicious code sorting techniques, therefore the evil based on convolutional neural networks of the present invention The classification performance of meaning code classification method is better than the SPAM-GIST malicious codes sorting technique.
Fig. 7 is the classification knot to different size sound spectrographs the present invention is based on the malicious code sorting technique of convolutional neural networks A schematic diagram for fruit.In the present invention, further the sound spectrograph of malicious code is become by using the bicubic interpolation algorithm More different size of image classifies to the malicious code based on convolutional neural networks of the present invention to verify different images size The influence of the classification performance of method.In this experiment, count each family for test total amount of data (TNum) and by The data volume (RNum) of Accurate classification simultaneously calculates Accuracy, P, R, since the Accuracy of each family in more classification is P, therefore we only need to calculate P, R, obtain that the results are shown in Table 6, table 6 shows the evil the present invention is based on convolutional neural networks Classification results of the code classification method of anticipating to the different size sound spectrographs of each family.
The present invention is based on the malicious code sorting techniques of convolutional neural networks to the different size sound spectrographs of each family for table 6 Classification results
As can be seen that various sizes of sound spectrograph is to the present invention is based on the malice generations of convolutional neural networks from table 6 and Fig. 7 The correct recognition rata Accuracy of code sorting technique influences little.
Fig. 8 is a schematic diagram of the recognition result of heretofore described MCCNN methods and MCCNN_ORI methods.In this hair In bright, further respectively by the audio signal sound spectrograph directly converted by the binary file of malicious code and by malicious code The audio signal sound spectrograph that binary file is converted after comentropy calculation processing is defeated as convolutional neural networks of the invention Enter to carry out the classification impact analysis of the malicious code.Wherein the former is denoted as MCCNN_ORI methods, and the latter is denoted as the side MCCNN Method.Specifically, in this experiment, the sound spectrograph size is 128*128*1.As can be seen from Figure 8, MCCNN_ ORI and MCCNN achieves almost the same experimental result.
In summary content it is found that the present invention the malicious code sorting technique based on convolutional neural networks by external input Etc. parameters influence little, itself has stronger stability.Therefore, the malicious code of the invention based on convolutional neural networks point Class method applicability and replicability are stronger.
In addition, the present invention has further counted in the MCCNN abandoned data volume in each sample, statistical result is such as Shown in table 7, wherein SamNum indicates that sample number, AbdNum indicate abandoned data volume in sample.For example, working as in table 7 AbdNum is<1000, SamNum be 1203 when, indicate sample in abandoned data volume<1000 sample shares 1203.
Abandoned data volume in 7 MCCNN of table
AbdNum SamNum
<1000 1203
1000~5000 5701
5000~1000 1254
>10000 2702
As can be seen from Table 7, of the invention based on convolution when MCCNN_ORI ratios MCCNN is inserted into many rubbish codes The malicious code sorting technique of neural network still can obtain identical classification results.Therefore, proposed by the present invention based on volume The malicious code sorting technique of product neural network can be efficiently against rubbish code insertion.
In summary present disclosure is it is found that the invention discloses a kind of malicious codes based on convolutional neural networks point Malicious code is mapped as single pass by class method by the characterization ability of incorporating signal processing techniques and convolutional neural networks Signal, then generates the sound spectrograph of signal according to signal processing method, is converted sound spectrograph to using Image Zooming Algorithm constant The gray-scale map of size finally uses convolutional neural networks to realize the classification of malicious code.In the method for the invention, by malice generation Code is mapped as regenerating corresponding sound spectrograph after single pass signal, can obtain the enough contexts of the malicious code Information not only reflects the time domain and frequency domain information of signal, can also reflect part and the global information of signal;In addition, due to The characteristics such as the local translation invariance of convolutional neural networks can preferably obtain the substantive characteristics of malicious code, effectively overcome generation Situations such as code rearrangement, rubbish code insertion, to improve the nicety of grading of malicious code.
Example the above is only the implementation of the present invention is not intended to limit the scope of the invention, every to utilize this hair Equivalent structure transformation made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant technical fields, Similarly it is included within the scope of the present invention.

Claims (10)

1. a kind of malicious code sorting technique based on convolutional neural networks, which is characterized in that including step:
Signal maps, and the binary file of the malicious code is mapped as audio signal files;
Sound spectrograph generates, and utilizes the sound spectrograph of malicious code described in the audio signal files Visual Production;
The sound spectrograph is changed to the fixed image of size by dimension modifying using image interpolation method;
Image after dimension modifying is inputted the convolutional neural networks to classify to the malicious code by code classification.
2. the malicious code sorting technique according to claim 1 based on convolutional neural networks, which is characterized in that described In signal mapping, reading 8bit is a unsigned int to convert the binary file of the malicious code to a dimension Then the one-dimension array is mapped as audio signal files by group.
3. the malicious code sorting technique according to claim 2 based on convolutional neural networks, which is characterized in that will be described When one-dimension array is mapped as audio signal files, setting channel number is 1, sampling frequency 44.1kHz, quantization digit 4byte.
4. the malicious code sorting technique according to claim 2 based on convolutional neural networks, which is characterized in that read 8bit is that a unsigned int is further comprised with converting after one-dimension array the binary file of the malicious code to: The non-overlapping scanning one-dimension array of window for the use of length being 128, the comentropy of data in calculation window, if the letter It is 0 to cease entropy, then by the rejection of data in window.
5. the malicious code sorting technique according to claim 4 based on convolutional neural networks, which is characterized in that the letter Ceasing entropy is:Wherein, pkThe probability occurred for number k in window.
6. the malicious code sorting technique based on convolutional neural networks according to any one of claim 2-5, feature It is, the audio signal files are wav file.
7. the malicious code sorting technique according to claim 1 based on convolutional neural networks, which is characterized in that institute's predicate Spectrogram generates:
(1) it carries out framing, adding window and discrete Fourier transform to the audio signal files of the malicious code to handle, i.e.,Wherein, s (n) is the audio signal files of the malicious code, N Identical with frame length when audio signal progress framing for window size, w (n) is rectangular window function, and X is in Fu of s (n) Leaf system number;
(2) the logarithmic amplitude spectrum A (n, k), wherein A (n, k)=10log of the audio signal files of the malicious code are calculated10(|X (n,k)|+e-1);
(3) gray scale transformation is carried out to the logarithmic amplitude spectrum A (n, k):Wherein Amax(n,k) Indicate the maximum value of sound spectrograph gray level;
(4) G (n, k) is saved as into single pass PNG images to obtain the sound spectrograph gray level image of the malicious code.
8. the malicious code sorting technique according to claim 1 based on convolutional neural networks, which is characterized in that the figure As interpolation method is bicubic interpolation algorithm:
Wherein, (x, y) is the pixel of interpolation in sound spectrograph, (xi,yj) (i, j=0,1,2,3) be pixel 4*4 neighborhoods Point, w (x) are Interpolation-Radix-Function:
Wherein, a=0.5.
9. the malicious code sorting technique according to claim 1 based on convolutional neural networks, which is characterized in that the volume Product neural network in convolutional layer and connect full layer nonlinear activation function include Relu functions, Leakly Relu functions or ELU functions.
10. the malicious code sorting technique according to claim 1 based on convolutional neural networks, which is characterized in that described In dimension modifying, the sound spectrograph is changed to the fixed 128*128*1 images of size.
CN201810469552.5A 2018-05-16 2018-05-16 Malicious code classification method based on convolutional neural network Active CN108717512B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810469552.5A CN108717512B (en) 2018-05-16 2018-05-16 Malicious code classification method based on convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810469552.5A CN108717512B (en) 2018-05-16 2018-05-16 Malicious code classification method based on convolutional neural network

Publications (2)

Publication Number Publication Date
CN108717512A true CN108717512A (en) 2018-10-30
CN108717512B CN108717512B (en) 2021-06-18

Family

ID=63900133

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810469552.5A Active CN108717512B (en) 2018-05-16 2018-05-16 Malicious code classification method based on convolutional neural network

Country Status (1)

Country Link
CN (1) CN108717512B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110572393A (en) * 2019-09-09 2019-12-13 河南戎磐网络科技有限公司 Malicious software traffic classification method based on convolutional neural network
CN110659495A (en) * 2019-09-27 2020-01-07 山东理工大学 Malicious code family classification method
CN110704842A (en) * 2019-09-27 2020-01-17 山东理工大学 Malicious code family classification detection method
CN110751955A (en) * 2019-09-23 2020-02-04 山东大学 Sound event classification method and system based on time-frequency matrix dynamic selection
CN111291814A (en) * 2020-02-15 2020-06-16 河北工业大学 Crack identification algorithm based on convolution neural network and information entropy data fusion strategy
CN111552963A (en) * 2020-04-07 2020-08-18 哈尔滨工程大学 Malicious software classification method based on structural entropy sequence
CN111552965A (en) * 2020-04-07 2020-08-18 哈尔滨工程大学 Malicious software classification method based on PE (provider edge) header visualization
CN111783088A (en) * 2020-06-03 2020-10-16 杭州迪普科技股份有限公司 Malicious code family clustering method and device and computer equipment
CN113204764A (en) * 2021-04-02 2021-08-03 武汉大学 Unsigned binary indirect control flow identification method based on deep learning
CN114579970A (en) * 2022-05-06 2022-06-03 南京明博互联网安全创新研究院有限公司 Convolutional neural network-based android malicious software detection method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090281981A1 (en) * 2008-05-06 2009-11-12 Chen Barry Y Discriminant Forest Classification Method and System
CN105989288A (en) * 2015-12-31 2016-10-05 武汉安天信息技术有限责任公司 Deep learning-based malicious code sample classification method and system
CN106096411A (en) * 2016-06-08 2016-11-09 浙江工业大学 A kind of Android malicious code family classification method based on bytecode image clustering
CN107092829A (en) * 2017-04-21 2017-08-25 中国人民解放军国防科学技术大学 A kind of malicious code detecting method based on images match
CN107392019A (en) * 2017-07-05 2017-11-24 北京金睛云华科技有限公司 A kind of training of malicious code family and detection method and device
CN107609399A (en) * 2017-09-09 2018-01-19 北京工业大学 Malicious code mutation detection method based on NIN neutral nets
CN107908963A (en) * 2018-01-08 2018-04-13 北京工业大学 A kind of automatic detection malicious code core feature method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090281981A1 (en) * 2008-05-06 2009-11-12 Chen Barry Y Discriminant Forest Classification Method and System
CN105989288A (en) * 2015-12-31 2016-10-05 武汉安天信息技术有限责任公司 Deep learning-based malicious code sample classification method and system
CN106096411A (en) * 2016-06-08 2016-11-09 浙江工业大学 A kind of Android malicious code family classification method based on bytecode image clustering
CN107092829A (en) * 2017-04-21 2017-08-25 中国人民解放军国防科学技术大学 A kind of malicious code detecting method based on images match
CN107392019A (en) * 2017-07-05 2017-11-24 北京金睛云华科技有限公司 A kind of training of malicious code family and detection method and device
CN107609399A (en) * 2017-09-09 2018-01-19 北京工业大学 Malicious code mutation detection method based on NIN neutral nets
CN107908963A (en) * 2018-01-08 2018-04-13 北京工业大学 A kind of automatic detection malicious code core feature method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨益敏等: "基于字节码图像的Android恶意代码家族分类方法", 《万方数据库》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110572393A (en) * 2019-09-09 2019-12-13 河南戎磐网络科技有限公司 Malicious software traffic classification method based on convolutional neural network
CN110751955B (en) * 2019-09-23 2022-03-01 山东大学 Sound event classification method and system based on time-frequency matrix dynamic selection
CN110751955A (en) * 2019-09-23 2020-02-04 山东大学 Sound event classification method and system based on time-frequency matrix dynamic selection
CN110704842A (en) * 2019-09-27 2020-01-17 山东理工大学 Malicious code family classification detection method
CN110659495A (en) * 2019-09-27 2020-01-07 山东理工大学 Malicious code family classification method
CN111291814A (en) * 2020-02-15 2020-06-16 河北工业大学 Crack identification algorithm based on convolution neural network and information entropy data fusion strategy
CN111291814B (en) * 2020-02-15 2023-06-02 河北工业大学 Crack identification algorithm based on convolutional neural network and information entropy data fusion strategy
CN111552963A (en) * 2020-04-07 2020-08-18 哈尔滨工程大学 Malicious software classification method based on structural entropy sequence
CN111552965A (en) * 2020-04-07 2020-08-18 哈尔滨工程大学 Malicious software classification method based on PE (provider edge) header visualization
CN111783088A (en) * 2020-06-03 2020-10-16 杭州迪普科技股份有限公司 Malicious code family clustering method and device and computer equipment
CN111783088B (en) * 2020-06-03 2023-04-28 杭州迪普科技股份有限公司 Malicious code family clustering method and device and computer equipment
CN113204764A (en) * 2021-04-02 2021-08-03 武汉大学 Unsigned binary indirect control flow identification method based on deep learning
CN114579970A (en) * 2022-05-06 2022-06-03 南京明博互联网安全创新研究院有限公司 Convolutional neural network-based android malicious software detection method and system
CN114579970B (en) * 2022-05-06 2022-07-22 南京明博互联网安全创新研究院有限公司 Convolutional neural network-based android malicious software detection method and system

Also Published As

Publication number Publication date
CN108717512B (en) 2021-06-18

Similar Documents

Publication Publication Date Title
CN108717512A (en) A kind of malicious code sorting technique based on convolutional neural networks
CN109344618B (en) Malicious code classification method based on deep forest
CN107316013B (en) Hyperspectral image classification method based on NSCT (non-subsampled Contourlet transform) and DCNN (data-to-neural network)
CN111126386B (en) Sequence domain adaptation method based on countermeasure learning in scene text recognition
Hossain et al. Leaf shape identification based plant biometrics
US20180349743A1 (en) Character recognition using artificial intelligence
CN111401375A (en) Text recognition model training method, text recognition device and text recognition equipment
Jiang et al. Cascaded subpatch networks for effective CNNs
CN108664911B (en) Robust face recognition method based on image sparse representation
Nanehkaran et al. Analysis and comparison of machine learning classifiers and deep neural networks techniques for recognition of Farsi handwritten digits
CN107679572A (en) A kind of image discriminating method, storage device and mobile terminal
CN111415323B (en) Image detection method and device and neural network training method and device
Kim et al. Label-preserving data augmentation for mobile sensor data
CN110188827A (en) A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model
CN115600200A (en) Android malicious software detection method based on entropy spectrum density and adaptive contraction convolution
CN113255568B (en) Bill image classification method and device, computer equipment and storage medium
Dogan A New Global Pooling Method for Deep Neural Networks: Global Average of Top-K Max-Pooling.
Li Saliency prediction based on multi-channel models of visual processing
Wang et al. Coarse-to-fine image dehashing using deep pyramidal residual learning
Cho Dynamic RNN-CNN based malware classifier for deep learning algorithm
CN114972886A (en) Image steganography analysis method
Rui et al. Data Reconstruction based on supervised deep auto-encoder
KR102184655B1 (en) Improvement Of Regression Performance Using Asymmetric tanh Activation Function
Zhang et al. Wasserstein generative recurrent adversarial networks for image generating
CN116368487A (en) Method for malware detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant