CN108717512A

CN108717512A - A kind of malicious code sorting technique based on convolutional neural networks

Info

Publication number: CN108717512A
Application number: CN201810469552.5A
Authority: CN
Inventors: 钱叶魁; 卢喜东; 杜江; 杨瑞朋; 雒朝峰; 黄浩; 李宇翀; 王丙坤
Original assignee: Zhengzhou Campus Of Chinese People's Liberation Army Army Artillery Air Defense Academy
Current assignee: Zhengzhou Campus Of Chinese People's Liberation Army Army Artillery Air Defense Academy
Priority date: 2018-05-16
Filing date: 2018-05-16
Publication date: 2018-10-30
Anticipated expiration: 2038-05-16
Also published as: CN108717512B

Abstract

The malicious code sorting technique based on convolutional neural networks that the invention discloses a kind of, malicious code is mapped as single pass signal by it, then the sound spectrograph of signal is generated according to signal processing method, it converts sound spectrograph to using Image Zooming Algorithm the gray-scale map of constant size, convolutional neural networks is finally used to realize the classification of malicious code.In the method for the invention, malicious code is mapped as to regenerate corresponding sound spectrograph after single pass signal, the enough contextual informations of the malicious code can be obtained, not only reflects the time domain and frequency domain information of signal, can also reflect part and the global information of signal；In addition, due to characteristics such as the local translation invariances of convolutional neural networks, situations such as capable of preferably obtaining the substantive characteristics of malicious code, and then effectively overcome code reordering, rubbish code insertion, the nicety of grading of malicious code is improved.

Description

A kind of malicious code sorting technique based on convolutional neural networks

Technical field

The present invention relates to malicious code classification fields, more particularly to the malicious code sorting technique based on signal analysis.

Background technology

With flourishing for internet, malicious code has become one of the principal element for threatening internet security, Show the trend of rapid growth.The Static Analysis Method of malicious code is the common method that Classification and Identification is carried out to malicious code One of, Static Analysis Method in the prior art includes the analysis method based on malicious code characteristics of image, such as Nataraj L Et al. propose a kind of SPAM-GIST malicious codes sorting technique (Nataraj L, Manjunath B S.SPAM:Signal Processing to Analyze Malware[Applications Corner][J].IEEE Signal Processing Magazine, 2016,33 (2):105-117), malicious code binary file is mapped as image and carrys out Expressive Features, utilized The global characteristics GIST of the multiple dimensioned and multidirectional feature extraction image of Gabor filter, and use this character representation malice generation Code feature, then classifies to malicious code using nearest neighbor algorithm.However, the malicious code used in practical application is often In the presence of the alias condition of deformation or rubbish code insertion etc., this makes the Static Analysis Method based on characteristics of image that can not have The malicious code after identity confusion is imitated, and then causes the nicety of grading of malicious code low.Therefore, efficient identification how is obtained to obscure The analysis method of malicious code afterwards is those skilled in the art's problem to be solved.

Invention content

The malicious code sorting technique based on convolutional neural networks that the present invention provides a kind of, solves evil in the prior art The code classification technology of anticipating can not effective identity confusion malicious code, and then the problem that the precision that causes malicious code to be classified is low.

In order to solve the above technical problems, one aspect of the present invention, which is to provide one kind, being based on convolutional neural networks Malicious code sorting technique, including step：Signal maps, and the binary file of the malicious code is mapped as audio signal File；Sound spectrograph generates, and utilizes the sound spectrograph of malicious code described in the audio signal files Visual Production；Dimension modifying, The sound spectrograph is changed to the fixed image of size using image interpolation method；Code classification, by the image after dimension modifying The convolutional neural networks are inputted to classify to the malicious code.

In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, mapped in the signal In, read 8bit be a unsigned int to convert the binary file of the malicious code to one-dimension array, then will The one-dimension array is mapped as audio signal files.

In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, by the one-dimension array When being mapped as audio signal files, setting channel number is 1, sampling frequency 44.1kHz, quantization digit 4byte.

In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, it is one to read 8bit Unsigned int is further comprised with converting after one-dimension array the binary file of the malicious code to：It is using length The non-overlapping scanning one-dimension array of 128 window, the comentropy of data in calculation window, if described information entropy is 0, By the rejection of data in window.

In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, described information entropy is：Wherein, p_kThe probability occurred for number k in window.

In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, the audio signal text Part is wav file.

In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, the sound spectrograph generates Including：(1) it carries out framing, adding window and discrete Fourier transform to the audio signal files of the malicious code to handle, i.e.,Wherein, s (n) is the audio signal files of the malicious code, and N is Frame length when window size and audio signal progress framing is identical, and w (n) is rectangular window function, and X is the Fourier of s (n) Coefficient；(2) the logarithmic amplitude spectrum A (n, k), wherein A (n, k)=10log of the audio signal files of the malicious code are calculated₁₀(| X(n,k)|+e^-1)；(3) gray scale transformation is carried out to the logarithmic amplitude spectrum A (n, k)：Its Middle A_max(n, k) indicates the maximum value of sound spectrograph gray level；(4) G (n, k) is saved as into single pass PNG images to obtain The sound spectrograph gray level image of the malicious code.

In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, described image interpolation side Method is bicubic interpolation algorithm：Wherein, (x, y) is to be waited in sound spectrograph The pixel of interpolation, (x_i,y_j) (i, j=0,1,2,3) be pixel 4*4 neighborhood points, w (x) be Interpolation-Radix-Function

Wherein, a=0.5.

In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, the convolutional Neural net Convolutional layer in network and to connect the nonlinear activation function of layer entirely include Relu functions, Leakly Relu functions or ELU functions.

In another embodiment of malicious code sorting technique the present invention is based on convolutional neural networks, the dimension modifying In, the sound spectrograph is changed to the fixed 128*128*1 images of size.

The beneficial effects of the invention are as follows：The invention discloses a kind of malicious code classification side based on convolutional neural networks Malicious code is mapped as single pass signal by method, and the sound spectrograph of signal is then generated according to signal processing method, uses figure It converts sound spectrograph to as scaling algorithm the gray-scale map of constant size, convolutional neural networks is finally used to realize point of malicious code Class.In the method for the invention, malicious code is mapped as regenerating corresponding sound spectrograph after single pass signal, can obtained The enough contextual informations of the malicious code are obtained, not only reflects the time domain and frequency domain information of signal, can also reflect letter Number part and global information；In addition, due to features such as the local translation invariances of convolutional neural networks, can preferably be disliked It anticipates the substantive characteristics of code, and then situations such as effectively overcome code reordering, rubbish code insertion, improves the classification essence of malicious code Degree.

Description of the drawings

The present invention is based on an embodiment schematic diagrams of the malicious code sorting technique of convolutional neural networks by Fig. 1；

The binary file of malicious code is converted in Fig. 2 present invention the flow of an embodiment of audio signal files Figure；

Fig. 3 is the stream for another embodiment for converting the binary file of malicious code in the present invention audio signal files Cheng Tu；

Fig. 4 is generation signal sound spectrograph gray level image in the malicious code sorting technique the present invention is based on convolutional neural networks An embodiment flow chart；

Fig. 5 is that the present invention is based on the malicious code sorting technique of convolutional neural networks and the SPAM-GIST malice generations The classification results schematic diagram of code sorting technique；

Fig. 6 is that the present invention is based on the malicious code sorting techniques at convolutional Neural network and the SPAM-GIST malicious codes One schematic diagram of the ROC curve of sorting technique；

Fig. 7 is the classification knot to different size sound spectrographs the present invention is based on the malicious code sorting technique of convolutional neural networks A schematic diagram for fruit；

Fig. 8 is a schematic diagram of the recognition result of heretofore described MCCNN methods and MCCNN_ORI methods.

Specific implementation mode

To facilitate the understanding of the present invention, in the following with reference to the drawings and specific embodiments, the present invention will be described in more detail. The preferred embodiment of the present invention is given in attached drawing.But the present invention can realize in many different forms, and it is unlimited In this specification described embodiment.Make to the disclosure on the contrary, purpose of providing these embodiments is Understand more thorough and comprehensive.

It should be noted that unless otherwise defined, all technical and scientific terms used in this specification with belong to The normally understood meaning of those skilled in the art of the present invention is identical.Used term in the description of the invention It is to be not intended to the limitation present invention to describe the purpose of specific embodiment.Term "and/or" packet used in this specification Include any and all combinations of one or more relevant Listed Items.

Fig. 1 is an embodiment flow chart of the malicious code sorting technique the present invention is based on convolutional neural networks.Described Malicious code sorting technique based on convolutional neural networks specifically includes following steps：

Step S1, signal mapping, audio signal files are mapped as by the binary file of the malicious code.First, right In the binary file of the malicious code, be successively read 8bit be a unsigned int with by the two of the malicious code into File processed is converted into one-dimension array, and the one-dimension array is converted to audio signal files according to the numerical value of the one-dimension array.

Step S2, sound spectrograph generate, and utilize the sound spectrograph of malicious code described in the audio signal files Visual Production. The present invention is handled using Fourier transform pairs per frame signal, is then spliced the result of processing and is generated sound spectrograph.Wherein institute The sound spectrograph of the malicious code signal of extraction, can obtain the enough contextual informations of the malicious code, including signal The part and global information of time domain and frequency domain information and signal.Further, the present invention pre-processes sound spectrograph, obtains The sound spectrograph gray level image that gray level is 256.

The sound spectrograph is changed to the fixed image of size by step S3, dimension modifying using image interpolation method.In order to Sound spectrograph is analyzed using convolutional neural networks, all sound spectrographs are transformed to fix by the present invention using image interpolation method The size of size.

Image after dimension modifying is inputted the convolutional neural networks with to the malice generation by step S4, code classification Code is classified.Various Classifiers on Regional exists in the prior art to realize the classification of malicious code, such as random forest, supporting vector Machine etc. has good recognition effect since convolutional neural networks have local translation invariance to the picture for deforming, moving, Therefore it can obscure method efficiently against common malicious codes such as malicious code rearrangement, rubbish code insertions；In addition, convolution Neural network has multiple convolution kernels, can more extract the substantive characteristics of malicious code sound spectrograph；Therefore, the present invention is specifically used Convolutional neural networks realize the high-precision classification of malicious code.

The malicious code sorting technique based on convolutional neural networks described in the present embodiment, by the binary system of malicious code File is converted into single pass audio signal files, and further obtains the sound spectrograph of the audio signal, and sound spectrograph is inputted institute Convolutional neural networks are stated to obtain the generic of the malicious code.Wherein it is enough can to retain the malicious code for sound spectrograph More contextual informations includes part and the global information of the time-frequency domain information of signal and signal；Convolutional neural networks have Local translation invariance and the malicious code therefore can be extracted on the basis of sound spectrograph with a variety of convolution kernels Substantive characteristics improves the precision of malicious code classification.

Preferably, can also include after converting the binary file of the malicious code to one-dimension array：It uses The non-overlapping entire array of scanning of window that length is 128, and in calculation window data comentropy (Information Entropy, Ent), if Ent=0, by the rejection of data in window；Otherwise, retain the data in window.Such operation The malicious codes obfuscated manner such as rubbish code insertion can be overcome to a certain extent.Wherein described information entropy Ent formula are：

p_kThe probability occurred for number k in window.

Fig. 2 is the flow for the embodiment for converting the binary file of malicious code in the present invention audio signal files Figure.In fig. 2, for the binary file of the malicious code in given step S21：000000000000000011000100 ……；

Step S22 reads binary data from the binary file of the malicious code successively as unit of 8bit, And it is converted into the one-dimension array [0,0,196 ... ...] of unsigned int (ranging from 0~255)；

The one-dimension array is mapped as a section audio signal by step S23, and the amplitude of the audio signal corresponds to described one The audio signal of generation is finally saved as wav file by the value of dimension group.The size of the wherein described wav file is according to malice generation Code binary file size, channel number (channel), sampling frequency (framerate) and quantization digit (sampwidth) and It is fixed.

Fig. 3 is the stream for another embodiment for converting the binary file of malicious code in the present invention audio signal files Cheng Tu.In figure 3, for the binary file of the malicious code in given step S31：0000000000000000110001 00……；

Step S32 reads binary data from the binary file of the malicious code successively as unit of 8bit, And it is converted into the one-dimension array [0,0,196 ... ...] of unsigned int (ranging from 0~255)；

Step S33, the non-overlapping entire array of scanning of window for the use of length being 128, and in calculation window data letter Entropy (Information Entropy, Ent) is ceased, if Ent=0, by the rejection of data in window；Otherwise, retain in window Data, obtain treated one-dimension array.Wherein described information entropy Ent formula are：

p_kThe probability occurred for number k in window；

The treated one-dimension array is mapped as a section audio signal, the amplitude of the audio signal by step S34 The value of the corresponding one-dimension array, finally saves as wav file by the audio signal of generation.

In this embodiment, by calculating the comentropy of all data in specific length window and giving up comentropy for 0 Data in window will can to a certain extent overcome the malicious codes obfuscated manner such as rubbish code insertion, further improve The accuracy of malicious code classification of the present invention.

Fig. 4 is generation signal sound spectrograph gray level image in the malicious code sorting technique the present invention is based on convolutional neural networks An embodiment flow chart.In Fig. 4, step S41：Framing, adding window are carried out to the audio signal files of the malicious code And discrete Fourier transform processing, i.e.,

Wherein, s (n) is the audio signal files of the malicious code, and N is that window size and the audio signal carry out Frame length when framing is identical, and w (n) is rectangular window function, and X is the Fourier coefficient of s (n).

Step S42：Calculate the logarithmic amplitude spectrum A (n, k) of the audio signal files of the malicious code.

Since there are a small amount of X (n, k) to be equal to zero, so the logarithm in the audio signal files for calculating the malicious code shakes Plus a small numerical value when width composes A (n, k), i.e.,

A (n, k)=10log₁₀(|X(n,k)|+e^-1)；

Step S43：Gray scale transformation is carried out to the logarithmic amplitude spectrum A (n, k)：

Wherein A_max(n, k) indicates the maximum value of sound spectrograph gray level.The maximum ash of single channel gray level image in a computer It is 256 to spend grade, when one group of numerical value is converted into single channel gray-scale map, if there is the numerical value more than 255, computer in array 255 numerical value is will be greater than all to be replaced with 255.Therefore formula is usedTo logarithmic amplitude spectrum A (n, K) gray level is converted, and this method can not only convert the size of gray level, but also when A (n, k) is less than 0, G (n, k) is more than 255.

Step S44：The G (n, k) is saved as into single pass PNG images to obtain the sound spectrograph ash of the malicious code Spend image.When G (n, k) is saved as single pass PNG images, the numerical value in G (n, k) more than 255 will be replaced by 255, be obtained The sound spectrograph gray level image for being 256 to gray level.

Specifically, during generating the sound spectrograph gray level image of audio signal files of the malicious code, this hair Bright to be determined to carry out framing, the setting of adding window to audio signal according to the duration of audio signal files, i.e., the present invention believes according to audio The duration of number file determines that frame length and frame move, and then Fourier transformation will be carried out per frame signal, table 1 give sound intermediate frequency of the present invention Frame length and frame corresponding to signal file difference duration move data.

Frame length and frame corresponding to 1 different audio signals file duration of table move

Preferably, in order to retaining the feature of sound spectrograph as possible, make the image after scaling that there is higher picture quality, The present invention zooms in and out sound spectrograph using bicubic interpolation method.The bicubic interpolation method is chosen around image interpolation point The gray value of 16 points makees cubic interpolation, considers not only the gray scale of 4 direct neighbor points and influences, and considers each neighbor point Between gray-value variation rate influence.The bicubic interpolation algorithm needs selection Interpolation-Radix-Function to carry out fitting data, wherein use Interpolation-Radix-Function is：

Wherein a=0.5.

The bicubic interpolation algorithmic formula used is as follows:

Wherein, (x, y) is the pixel of interpolation in sound spectrograph, (x_i,y_j) (i, j=0,1,2,3) be the interpolation picture The 4*4 neighborhood points of vegetarian refreshments.

Preferably, the present invention carries out Classification and Identification using convolutional neural networks to malicious code.Extracting malicious code After the sound spectrograph gray level image of signal, need to classify to sound spectrograph by sorting algorithm.The specifically used convolution of the present invention Neural network (Convolutional Neural Network, CNN) classifies to sound spectrograph.Convolutional neural networks CNN is A kind of multilayer neural network structure is mainly made of input layer, convolutional layer, pond layer, full articulamentum, output layer etc..Input layer Input sound spectrograph gray level image；Convolutional layer extracts the feature of sound spectrograph；Pond layer utilizes the local correlations principle of image, reduces Data volume to be treated；Feature Mapping is the malicious code classification finally predicted by output layer.Wherein, for convolutional Neural net Network：

(1) convolutional layer is in convolutional layer, the input of last layer exported as current layer, in order to promote the table of network model Danone power introduces nonlinear activation primitive.The propagated forward in kth layer of convolutional layer can be expressed as：

Wherein,For the net activation in l layers of j-th of channel,The output in j-th of channel of l layers,For convolution kernel, For the bias term of convolutional layer, M_jTo be used to calculateInput characteristic pattern subset, f is activation primitive, and * is convolution symbol.

(2) pond function is replaced using the general evaluation system feature of the adjacent output of a certain position in pond layer convolutional network The output of network in the position, it can reduce the feature in network, while most of output being kept not change.Pond Change function and down-sampling output characteristic pattern is carried out to input feature vector figure by following formula：

Wherein,For the net activation in l layers of j-th of channel, β is the weight coefficient of pond layer,For the biasing of pond layer , down () is pond function, it is to input feature vector figureCharacteristic pattern is divided by sliding window method and multiple is not weighed Folded image block, then in each image block pixel summation, average or maximum value.The present invention uses maximum pond function Seek the maximum value in each image block.

(3) the two dimensional character figure that all convolutional layers export is spliced into one-dimensional characteristic by full articulamentum in full articulamentum Input of the vector as full articulamentum.Full articulamentum, can be with by summing to weighted input and being exported by activation primitive It is expressed as：

Wherein, l indicates that the number of plies, x are characterized image, and W is the weight matrix of current layer, and b is the bias term of current layer, and f is Activation primitive.

(4) other structures are to promote the ability to express of network model, introduce nonlinear activation primitive, convolutional layer and complete The common activation primitive of articulamentum includes：Relu, Leakly Rule, ELU etc..Convolutional layer and full articulamentum in the present invention Activation primitive can specifically use Relu functions, mathematic(al) representation is as follows：

F (x)=max (0, x)

The result that neural network propagated forward obtains is become into probability distribution in addition, being returned using Softmax.Assuming that original Neural network output be y₁,y₂,y₃,....,y_n, by Softmax recurrence handle after output be：

The present invention uses the network model similar to VGG16 to classify sound spectrograph.The convolutional neural networks model As shown in table 2, wherein Feature indicates that the various structures in the convolutional neural networks, kernel size indicate core used The size of function, stride indicate that mobile step-length, pad indicate filling, and 1 represents using 0 filling, keeps the defeated of current layer Enter with output characteristic pattern it is equal in magnitude, 0 represent be not filled with, function indicate to function used by current layer.

2 convolutional neural networks model of table

After obtaining network model above, the loss function for training network is needed, the classification of malicious code is predicted with this. The present invention weighs error using cross entropy (cross-entropy) loss function.Assuming that there is { (x¹,y¹),(x²,y²),(x³, y³),......,(x^M,y^M) training sample belongs to k malicious code family, y is the malicious code family encoded with one_hot K tie up categorization vector, for each sample, cross entropy can be calculated with following formula：

Wherein,Indicate i-th of numerical value in the categorization vector of sample m,It is returned by Softmax in output layer for sample m I-th of numerical value in the probability distribution returned.Therefore, network mould can be optimized by the loss function of the whole training samples of training Type.

The loss function of formula J be whole cross entropies and, the optimization algorithms training net such as stochastic gradient descent can be passed through Network model.Finally, malicious code classification is predicted using trained network model.

Fig. 5 is that the present invention is based on the malicious code sorting technique of convolutional neural networks and the SPAM-GIST malice generations The classification results schematic diagram of code sorting technique.In the present invention, used malicious code data collection comes from Microsoft and exists Project Microsoft Malware Classification Challenge on Kaggle choose the 10860 of 9 classifications altogether Malicious code binary file is tested, and table 3 gives the essential information for the malicious code data collection that the present invention uses.

3 malicious code data collection of table

Malicious code classification	Classification number	Quantity
			Ramniit	0	1533
Lollipop	1	2478
			Kelihos_ver3	2	2942
Vundo	3	475
			Simda	4	42
Tracur	5	751
			Kelihos_ver1	6	398
Obfuscator.ACY	7	1228
			Gatak	8	1013

Specifically, the present invention experiment in, we by the 80% of malicious code data collection be used as training sample set, 20% As test sample collection, and when the binary file of malicious code is mapped as wav file, setting channel number channel is 1, sampling frequency framerate are 44.1kHz, and quantization digit sampwidth is 4byte.In addition, in the present invention based on convolution In the malicious code sorting technique of neural network, the sound spectrograph of generation is scaled the gray level image of 128*128*1 as convolution The input of neural network.In the SPAM-GIST malicious codes sorting technique, when k nearest neighbor (K-Nearest Neighbor, KNN) K=1 of sorting algorithm obtains best as a result, therefore in the present invention, the SPAM-GIST malicious codes sorting technique In K values be 1.

Specifically, the present invention is using accuracy rate (Accuracy), macro precision ratio (macro_P), macro recall ratio (macro_ R), the malicious code sorting technique pair based on convolutional neural networks of four kinds of evaluation index evaluation present invention of macro F1 (macro_F1) The classifying quality of malicious code.About the present invention another index ROC (Receicer Operating Charateristic, ROC) curve, the longitudinal axis are real example rate (True Positive Rate, TPR), and horizontal axis is false positive example rate (False Positive Rate, FPR), using the area under ROC curve, i.e. AUC (Area Under ROC Curve, AUC) evaluates institute State the classification performance of convolutional neural networks.

For more classification problems, a confusion matrix will be corresponded to per the combination of classification two-by-two, then on each confusion matrix Real example rate, false positive example rate, recall ratio and precision ratio are calculated, then calculates average value, obtains TPR, FPR, macro precision ratio (macro_ P), macro recall ratio (macro_R), each evaluation index calculation formula are as follows:

Wherein TP, FP, FN, TN indicate to be classified respectively device be identified as positive positive sample, be classified device be identified as it is positive negative Sample is classified device and is identified as negative positive sample, is classified device and is identified as negative negative sample.TPR_S, FPR_S, P, R are each mixed Confuse real example rate, false positive example rate, recall ratio and the precision ratio of matrix.

Specifically, Fig. 5 is that the present invention is based on the malicious code sorting technique of convolutional neural networks and the SPAM-GIST The recognition result schematic diagram of malicious code sorting technique；Fig. 6 is the malicious code classification side the present invention is based on convolutional Neural network One schematic diagram of the ROC curve of method and the SPAM-GIST malicious codes sorting technique；Table 4 gives in the present invention based on volume The confusion matrix of the malicious code sorting technique of product neural network；Table 5 gives the SPAM-GIST malicious codes sorting technique Confusion matrix.

The confusion matrix of malicious code sorting technique of the table 4 based on convolutional neural networks

	0	1	2	3	4	5	6	7	8
										0	.976	.006	0	.003	.003	.003	.003	.006	0
1	.013	.983	0	0	0	0	0	.004	0
										2	0	0	1	0	0	0	0	0	0
3	0	0	0	.99	0	0	0	.01	0
										4	0	0	0	0	.86	0	0	.14	0
5	.013	0	0	.007	0	.973	0	.007	0
										6	0	0	0	.014	0	0	.972	0	.014
7	.032	.004	.026	0	0	0	.004	.915	.013
										8	.011	0	0	0	0	0	.011	.022	.956

The confusion matrix of 5 SPAM-GIST malicious code sorting techniques of table

	0	1	2	3	4	5	6	7	8
										0	.879	.01	0	.01	.003	.038	.003	.006	.051
1	.004	.964	.006	.004	.002	.012	0	.004	.004
										2	0	0	1	0	0	0	0	0	0
3	0	0	0	.965	0	.012	0	.023	0
										4	.182	.091	0	0	.636	.091	0	0	0
5	.007	.014	0	.014	0	.951	.014	0	0
										6	0	0	.014	0	0	.028	.944	0	.014
7	.046	.025	0	.013	.004	.025	.004	.866	.017
										8	.019	.005	.024	.015	0	.019	0	.01	.908

From above-mentioned Fig. 5, Fig. 6 and table 4 and table 5 as can be seen that the malice of the present invention based on convolutional neural networks Code classification method, no matter on the whole or from each malicious code family for, to the classifying quality of malicious code It is substantially better than the SPAM-GIST malicious codes sorting technique, the i.e. malicious code based on convolutional neural networks of the invention Sorting technique can preferably identify malicious code, disclosure satisfy that practical application wants malicious code classification accuracy It asks.In addition, from fig. 6, it can be seen that the present invention the malicious code sorting technique based on convolutional neural networks AUC=0.978, The AUC=0.953 of the SPAM-GIST malicious code sorting techniques, therefore the evil based on convolutional neural networks of the present invention The classification performance of meaning code classification method is better than the SPAM-GIST malicious codes sorting technique.

Fig. 7 is the classification knot to different size sound spectrographs the present invention is based on the malicious code sorting technique of convolutional neural networks A schematic diagram for fruit.In the present invention, further the sound spectrograph of malicious code is become by using the bicubic interpolation algorithm More different size of image classifies to the malicious code based on convolutional neural networks of the present invention to verify different images size The influence of the classification performance of method.In this experiment, count each family for test total amount of data (TNum) and by The data volume (RNum) of Accurate classification simultaneously calculates Accuracy, P, R, since the Accuracy of each family in more classification is P, therefore we only need to calculate P, R, obtain that the results are shown in Table 6, table 6 shows the evil the present invention is based on convolutional neural networks Classification results of the code classification method of anticipating to the different size sound spectrographs of each family.

The present invention is based on the malicious code sorting techniques of convolutional neural networks to the different size sound spectrographs of each family for table 6 Classification results

As can be seen that various sizes of sound spectrograph is to the present invention is based on the malice generations of convolutional neural networks from table 6 and Fig. 7 The correct recognition rata Accuracy of code sorting technique influences little.

Fig. 8 is a schematic diagram of the recognition result of heretofore described MCCNN methods and MCCNN_ORI methods.In this hair In bright, further respectively by the audio signal sound spectrograph directly converted by the binary file of malicious code and by malicious code The audio signal sound spectrograph that binary file is converted after comentropy calculation processing is defeated as convolutional neural networks of the invention Enter to carry out the classification impact analysis of the malicious code.Wherein the former is denoted as MCCNN_ORI methods, and the latter is denoted as the side MCCNN Method.Specifically, in this experiment, the sound spectrograph size is 128*128*1.As can be seen from Figure 8, MCCNN_ ORI and MCCNN achieves almost the same experimental result.

In summary content it is found that the present invention the malicious code sorting technique based on convolutional neural networks by external input Etc. parameters influence little, itself has stronger stability.Therefore, the malicious code of the invention based on convolutional neural networks point Class method applicability and replicability are stronger.

In addition, the present invention has further counted in the MCCNN abandoned data volume in each sample, statistical result is such as Shown in table 7, wherein SamNum indicates that sample number, AbdNum indicate abandoned data volume in sample.For example, working as in table 7 AbdNum is<1000, SamNum be 1203 when, indicate sample in abandoned data volume<1000 sample shares 1203.

Abandoned data volume in 7 MCCNN of table

AbdNum	SamNum
		<1000	1203
1000~5000	5701
		5000~1000	1254
>10000	2702

As can be seen from Table 7, of the invention based on convolution when MCCNN_ORI ratios MCCNN is inserted into many rubbish codes The malicious code sorting technique of neural network still can obtain identical classification results.Therefore, proposed by the present invention based on volume The malicious code sorting technique of product neural network can be efficiently against rubbish code insertion.

In summary present disclosure is it is found that the invention discloses a kind of malicious codes based on convolutional neural networks point Malicious code is mapped as single pass by class method by the characterization ability of incorporating signal processing techniques and convolutional neural networks Signal, then generates the sound spectrograph of signal according to signal processing method, is converted sound spectrograph to using Image Zooming Algorithm constant The gray-scale map of size finally uses convolutional neural networks to realize the classification of malicious code.In the method for the invention, by malice generation Code is mapped as regenerating corresponding sound spectrograph after single pass signal, can obtain the enough contexts of the malicious code Information not only reflects the time domain and frequency domain information of signal, can also reflect part and the global information of signal；In addition, due to The characteristics such as the local translation invariance of convolutional neural networks can preferably obtain the substantive characteristics of malicious code, effectively overcome generation Situations such as code rearrangement, rubbish code insertion, to improve the nicety of grading of malicious code.

Example the above is only the implementation of the present invention is not intended to limit the scope of the invention, every to utilize this hair Equivalent structure transformation made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant technical fields, Similarly it is included within the scope of the present invention.

Claims

1. a kind of malicious code sorting technique based on convolutional neural networks, which is characterized in that including step：

Signal maps, and the binary file of the malicious code is mapped as audio signal files；

Sound spectrograph generates, and utilizes the sound spectrograph of malicious code described in the audio signal files Visual Production；

The sound spectrograph is changed to the fixed image of size by dimension modifying using image interpolation method；

Image after dimension modifying is inputted the convolutional neural networks to classify to the malicious code by code classification.

2. the malicious code sorting technique according to claim 1 based on convolutional neural networks, which is characterized in that described In signal mapping, reading 8bit is a unsigned int to convert the binary file of the malicious code to a dimension Then the one-dimension array is mapped as audio signal files by group.

3. the malicious code sorting technique according to claim 2 based on convolutional neural networks, which is characterized in that will be described When one-dimension array is mapped as audio signal files, setting channel number is 1, sampling frequency 44.1kHz, quantization digit 4byte.

4. the malicious code sorting technique according to claim 2 based on convolutional neural networks, which is characterized in that read 8bit is that a unsigned int is further comprised with converting after one-dimension array the binary file of the malicious code to： The non-overlapping scanning one-dimension array of window for the use of length being 128, the comentropy of data in calculation window, if the letter It is 0 to cease entropy, then by the rejection of data in window.

5. the malicious code sorting technique according to claim 4 based on convolutional neural networks, which is characterized in that the letter Ceasing entropy is：Wherein, p_kThe probability occurred for number k in window.

6. the malicious code sorting technique based on convolutional neural networks according to any one of claim 2-5, feature It is, the audio signal files are wav file.

7. the malicious code sorting technique according to claim 1 based on convolutional neural networks, which is characterized in that institute's predicate Spectrogram generates：

(1) it carries out framing, adding window and discrete Fourier transform to the audio signal files of the malicious code to handle, i.e.,Wherein, s (n) is the audio signal files of the malicious code, N Identical with frame length when audio signal progress framing for window size, w (n) is rectangular window function, and X is in Fu of s (n) Leaf system number；

(2) the logarithmic amplitude spectrum A (n, k), wherein A (n, k)=10log of the audio signal files of the malicious code are calculated₁₀(|X (n,k)|+e^-1)；

(3) gray scale transformation is carried out to the logarithmic amplitude spectrum A (n, k)：Wherein A_max(n,k) Indicate the maximum value of sound spectrograph gray level；

(4) G (n, k) is saved as into single pass PNG images to obtain the sound spectrograph gray level image of the malicious code.

8. the malicious code sorting technique according to claim 1 based on convolutional neural networks, which is characterized in that the figure As interpolation method is bicubic interpolation algorithm：

Wherein, (x, y) is the pixel of interpolation in sound spectrograph, (x_i,y_j) (i, j=0,1,2,3) be pixel 4*4 neighborhoods Point, w (x) are Interpolation-Radix-Function：

Wherein, a=0.5.

9. the malicious code sorting technique according to claim 1 based on convolutional neural networks, which is characterized in that the volume Product neural network in convolutional layer and connect full layer nonlinear activation function include Relu functions, Leakly Relu functions or ELU functions.

10. the malicious code sorting technique according to claim 1 based on convolutional neural networks, which is characterized in that described In dimension modifying, the sound spectrograph is changed to the fixed 128*128*1 images of size.