CN110472417B

CN110472417B - Convolutional neural network-based malicious software operation code analysis method

Info

Publication number: CN110472417B
Application number: CN201910776705.5A
Authority: CN
Inventors: 陈璨; 赵立超; 李丹; 史闻博; 庄宇鹏
Original assignee: Northeastern University Qinhuangdao Branch
Current assignee: Northeastern University Qinhuangdao Branch
Priority date: 2019-08-22
Filing date: 2019-08-22
Publication date: 2021-03-30
Anticipated expiration: 2039-08-22
Also published as: CN110472417A

Abstract

The invention discloses a method for analyzing malicious software operation codes based on a convolutional neural network, which comprises the following steps: acquiring Dalvik byte codes; acquiring an operation code sequence and representing the operation code sequence by using a one-hot vector; converting the unique heat vector into a vector with a fixed size, multiplying the vector by a random weight matrix, and inputting the vector to a convolutional neural network; outputting a feature mapping set matrix C in the convolutional layer; in k-max pooling, performing maximum merging operation on the matrix C, extracting the most important k eigenvalues and outputting an eigenvector Z; the vector Z forms a full connection layer, and the vector Z is operated in the full connection layer to obtain an output characteristic y; processing the output characteristic y by using a softmax function to obtain a relative probability distribution p; calculating a cross entropy loss function Lk; gradually adjusting the minimum loss function and the parameter values of the corresponding model by using a gradient descent method; iteratively updating the model parameters and optimizing the detection model based on the output calculations. The invention has the characteristic of high detection accuracy.

Description

Convolutional neural network-based malicious software operation code analysis method

Technical Field

The invention relates to the field of malicious software detection, in particular to a malicious software operation code analysis method based on a convolutional neural network.

Background

Currently, the Android malware mainly comprises Hindroid of static analysis, which is a method for linking application programs based on a meta path and a system for dynamically identifying Android. The methods have the defects of insufficient accuracy, low classification success rate, low efficiency, incomplete extracted operation code sequence and the like.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a convolutional neural network-based malicious software operation code analysis method which has the characteristic of high detection accuracy.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a malicious software operation code analysis method based on a convolutional neural network comprises the following steps:

s1, obtaining a training sample; the training sample is an executive of a known type of software, the type including benign and malicious;

s2, performing decompiling processing on the training sample to obtain a smali file of the training sample, and acquiring Dalvik byte codes from the smali file;

s3, acquiring an operation code sequence file according to the Android operation code constant list, and representing the operation code sequence file by using a unique heat vector;

s4, converting the unique heat vector into a vector with a fixed size in the embedding layer;

s5, multiplying the converted unique heat vector by a random weight matrix to generate a matrix M, and inputting the matrix M into a convolutional neural network;

s6, performing convolution processing on the matrix M in the convolution layer by using a convolution kernel to obtain feature mapping, and extracting feature mapping set matrixes C of different convolution kernels;

s7, in k-max pooling, performing maximum merging operation on the matrix C, and extracting the most important k characteristic values to obtain a characteristic vector Z of the hidden layer;

s8, forming a full connection layer by using the vector Z with a fixed size, and operating the vector Z in the full connection layer to obtain an output characteristic y; processing the output characteristic y by using a softmax function, converting the multi-class output values into relative probabilities, and obtaining relative probability distribution p;

s9, calculating a cross entropy loss function L according to the obtained probability distribution p and label probability distribution_k；

S10, according to the calculated cross entropy loss function L_kEstimating the difference between the predicted value and the true value of the model, and optimizing by using a gradient descent method to obtain an optimal Android program detection model;

s11, iteratively updating model parameters and optimizing the Android program detection model based on output calculation and the minimized loss function.

Further, in the operation code sequence represented by the unique heat vector in step S3, each operation code corresponds to one of the 256 positions, the operation code that appears is represented as 1, and the operation code that does not appear is represented as 0.

Further, in step S7, k is 3 or 5.

Further, in step S8, the vector Z is operated in the full connection layer, the sum of the weights of the previous layer is calculated, and the accurate combination of each element is obtained to obtain the output feature y; the softmax function processes the output features y in the manner of a process classifier.

Further, in step S10, when the loss function is minimized, the minimized loss function and the parameter values of the corresponding model are gradually adjusted by using a gradient descent optimization method, so as to obtain an optimal Android program detection model.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in:

the method adopts the convolutional neural network to realize effective android software static feature extraction, adopts a static analysis tool to extract the operation code sequence from the apk file of the android software, and is more complete and efficient than the existing method; the invention utilizes k-max pooling strategy to realize the acquisition of relative position information of the operation code, the adopted k-max pooling can take the best k from all characteristic values and reserve the original orders of the characteristic values, the characteristics of weak information are eliminated, important characteristics are reserved, and the relative position information of the characteristics is also reflected. The invention trains the classification model using cross entropy loss, which can better evaluate the quality of the neural network than using mean square error.

Drawings

FIG. 1 is a flow chart of an assay method of the present invention;

FIG. 2 is a flowchart of the apk pre-processing operation of the present invention;

FIG. 3 is a schematic flow chart of the detection system proposed by the present invention;

FIG. 4 is a diagram of the generation of the convolutional neural network input of the present invention;

fig. 5 is an example of k-max (k-2) pooling of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, the present invention discloses a method for analyzing malware operation codes based on a convolutional neural network, which comprises the following steps:

s2, performing decompiling processing on the training sample by using the apktool to obtain a smali file of the training sample, acquiring Dalvik byte codes from the smali file, and discarding an operand; the apk pre-processing procedure is shown in FIG. 2.

S3, acquiring an operation code sequence file according to the Android operation code constant list, wherein an operation code sequence vector is represented by X (X1, X2., Xn), wherein n is the operation code length of apk and is represented by n X o (o is 256) unique heat vector;

and S4, converting the unique heat vector into a vector with a fixed size in the embedding layer so as to improve the calculation efficiency and enable apk information to be more concentrated and unified.

S5, converting the converted unique heat vector X_iMultiplying by a random weight matrix W_e∈R^o×g(o 256, g is the dimension of the embedding space), generating a matrix M e R^n×gInputting to a convolutional neural network; the convolutional neural network input is generated as shown in fig. 4.

the specific convolution operation of the convolution kernel is defined as follows:

C_j＝f₁(Conv(M,w_j)+b_j)

wherein b is_jAnd w_jIs the corresponding deviation parameter and weight parameter of the jth convolution kernel, f₁For the corresponding RELU activation function, the j convolution kernels perform a complete convolution operation on the n-h +1 window, and the extracted feature maps of the different convolution kernels are summarized as follows:

C＝[c₁|c₂…|c_p]^T

s7, as shown in FIG. 5, in k-max pooling, performing maximum merging operation on the matrix C, extracting the most important k feature values, and obtaining a feature vector Z of the hidden layer, which are described as follows:

s8, forming a full connection layer by using the vector Z with a fixed size, and operating the vector Z in the full connection layer to obtain an output characteristic y; this reduces the burden on the model and also prevents the model from being used excessively. Processing the output characteristic y by using a softmax function, converting the multi-class output values into relative probabilities, and obtaining relative probability distribution p;

the operation of outputting y is as follows:

y＝f₂(W_f·z)+b′

wherein W_fAnd b' is the weight matrix and deviation of the fully-connected layer, function f₂Is the RELU activation function.

The formula of the softmax function is defined as follows:

p(l＝i|y)＝expy_i/∑y_i

wherein y is_iIs the output of the ith neuron of the fully-connected layer, and is the label of the data sample, indicating that the data sample is a malicious sample or a benign sample.

S9, calculating a cross entropy loss function L according to the obtained probability distribution p and label probability distribution_k(ii) a The invention uses the cross entropy loss function, and can better evaluate the quality of the neural network than the mean square error.

Loss function L_kIs defined as follows:

wherein L is_kDenotes a loss function calculated by inputting the kth sample, and p (i ═ i | y) is the ith value of the softmax function output vector, and denotes the probability that the sample belongs to the ith class. Label l_iPreceded by a summation symbol ranging from 1 to the number of classes T. T2 represents the number of outputs of the full connection layer.

L represents the average of the loss function for a batch of b data samples, which can be interpreted as the difference between the predicted value and the true value obtained by modeling, defined as follows:

s11, iteratively updating model parameters and optimizing the provided Android program detection model based on output calculation and a minimized loss function;

s12, after the model training is finished, inputting a new test set test model.

In step S3, in the operation code sequence represented by the unique heat vector, each operation code corresponds to one of the 256 positions, the operation code that appears is represented by 1, and the operation code that does not appear is represented by 0.

In step S7, k is 3 or 5. And when k is 3 or 5, the effect is optimal, so that the characteristics of weak information are eliminated, important characteristics are reserved, and the overfitting problem of the model is reduced.

In step S8, the vector Z is operated in the full connection layer, the sum of the weights of the previous layer is calculated, and the accurate combination of each element is obtained to obtain the output feature y; the softmax function processes the output features y in the manner of a process classifier.

In step S10, when the loss function is minimized, the minimized loss function and the parameter values of the corresponding model are gradually adjusted by using a gradient descent optimization method, so as to obtain an optimal Android program detection model.

The analysis method is realized by adopting Python, and mainly used experiment platforms comprise: pandas, sklern, matplotlib, numpy. The system development environment is 64-bit Ubuntu 17.10. The proposed neural network detection model was developed using Tensorflow and Torch framework environments, and the Tesla K80 GPU was used for the experiments and training of the detection method.

The method adopts the convolutional neural network to realize effective android software static feature extraction, and adopts a static analysis tool to extract an operation code sequence from an apk file of the android software; the k-max pooling strategy is utilized in the convolutional layer, the acquisition of relative position information of the operation codes is realized, the variable length sequence is organized into a fixed length to be input, the number of model parameters is reduced, and the overfitting problem of the model is favorably reduced. And finally, training a classification model by using cross entropy loss to better evaluate the quality of the neural network. Compared with other methods, the method has certain advantages in accuracy, precision and other indexes through experiments on different data sets, and the method has good performance on different data sets.

The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solution of the present invention by those skilled in the art should fall within the protection scope defined by the claims of the present invention without departing from the spirit of the present invention.

Claims

1. A malicious software operation code analysis method based on a convolutional neural network is characterized by comprising the following steps: the method comprises the following steps:

C_j＝f₁(Conv(M,w_j)+b_j)

C＝[c₁|c₂…|c_p]^T

Loss function L_kIs defined as follows:

wherein L is_kRepresenting the loss function calculated by inputting the kth sample, p (i ═ i | y) is the ith value of the softmax function output vector, representing the probability that the sample belongs to the ith class, and label l_iThe front is provided with a summation symbol ranging from 1 to T types, wherein T is 2 to represent the output number of the full connection layer;

2. The convolutional neural network-based malware operation code analysis method of claim 1, wherein: in step S3, each opcode in the opcode sequence represented by the unique heat vector corresponds to one of the 256 positions, and an occurrence of an opcode is represented as 1 and an absence thereof is represented as 0.

3. The convolutional neural network-based malware operation code analysis method of claim 1, wherein: in step S7, k is 3 or 5.

4. The convolutional neural network-based malware operation code analysis method of claim 1, wherein: in step S8, the vector Z is operated in the full connection layer, the sum of the weights of the previous layer is calculated, and the accurate combination of each element is obtained to obtain the output feature y; the softmax function processes the output features y in the manner of a process classifier.

5. The convolutional neural network-based malware operation code analysis method of claim 1, wherein: in step S10, when the loss function is minimized, the minimized loss function and the parameter values of the corresponding model are gradually adjusted by using a gradient descent optimization method, so as to obtain an optimal Android program detection model.