Robust detection method and system for lightweight high-precision malicious software identification model
Technical Field
The invention belongs to the field of software security technology protection, and relates to a robust detection method and system for a lightweight high-precision malicious software identification model.
Background
The popularization of various intelligent terminal devices indirectly causes the flooding of malicious software. Most of the detection technologies adopted by various antivirus vendors at present are signature-based methods, and rely on a retrieval technology for increasing the load of a virus library to detect more malicious software, but the detection technologies cannot detect unknown novel malicious software all the time. The infinite malware compels the malware detection technology to have strong generalization ability and robustness. The model can obtain better performance by depending on the sandbox to extract the dynamic behavior information of the software, but the offline detection method has no applicability to the real-time detection of the malicious software. The byte file of the software can be directly visualized into a gray-scale image form, and the method for identifying the malicious software by using the end-to-end convolutional neural network not only breaks away from the complicated characteristic engineering, but also has excellent performance.
The deep detection model scale of malware is generally proportional to detection accuracy and inversely proportional to temporal performance. The large-scale depth model can usually obtain higher precision, but the depth model has the characteristics of consuming a large amount of hardware resources in the equipment and having long reasoning time, and is not beneficial to the deployment on the light-weight equipment; small-scale models can be deployed on lightweight devices, but with far-ranging accuracy compared to that achieved by large-scale models. The popular application of the model must satisfy two requirements: the order is sufficiently small and the accuracy is sufficiently high. Therefore, how to effectively reduce the size of the high-precision large-scale model and simultaneously keep or reduce the precision of the large-scale model as small as possible is one of the key problems to be faced in the field.
Second, malware developers often bypass the detection of models by modifying program structure and adding unnecessary benign components to generate confusing samples. Therefore, how to effectively detect the malware malicious obfuscated sample is an important difficult problem to be solved urgently.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a robust detection method of a lightweight high-precision malicious software identification model;
the invention also provides a robust detection system of the lightweight high-precision malicious software identification model.
The invention aims to reduce the weight of a large-scale model with high precision to a great extent, keep higher precision and have strong discrimination capability on a mixed sample.
Interpretation of terms:
1. benign byte file: byte representation files of normal files or software collected from system or third-party software libraries;
2. malicious byte files: bytes of an executable file of malware represent the file;
3. ResNet101 model: migrating the pre-training model.
In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:
a robust detection method for a lightweight high-precision malware identification model comprises the following steps:
step 1: acquiring a byte file data set of application software, wherein the byte file data set comprises benign byte files and multi-class malicious byte files, and all the byte files are visualized into a gray level graph;
step 2: training a plurality of convolution automatic encoders according to malicious categories, transferring a decoding part into a generation model through generation of an antagonistic network, training the generation of the antagonistic network, generating a gray level image of each category of malicious antagonistic samples, and adding the gray level image into a malicious data set;
and step 3: training a large-scale teacher model and a convolution automatic encoder, and migrating the encoder of the convolution automatic encoder into a small-scale student model;
and 4, step 4: by using a model compression technology based on knowledge distillation, the probability distribution and the real labels of multi-class prediction of a large-scale teacher model are respectively used as soft supervision and hard supervision information of a student model, and the loss values of the two are synthesized to be used as total loss to feed back the student model;
and 5: and predicting the class of the student model by using the finally obtained student model, namely detecting the result.
According to the present invention, the specific implementation process of visualizing the byte file into the gray-scale map in step 1 includes: taking every two hexadecimal numbers as a pixel point of the gray level image, wherein the decimal value is the pixel value; setting the width w, converting all bytes in the byte file into a matrix form in a row-by-row tiling mode, and further visualizing the matrix form into a gray scale image.
According to the invention, the specific implementation process of the step 2 comprises the following steps:
step 2.1: training a plurality of convolution automatic encoders in an isolation mode according to the category of the malicious software, and performing feedback by using a reconstruction error through reconstructing an original input image;
step 2.2: migrating an encoder of a convolution automatic encoder into a generation model of a generative confrontation network, and adopting a one-dimensional random number z conforming to standard normal distribution N (0,1) as the input of the generation model for generating sample data G (z);
the generated sample data G (z) and the original image x are used as the input of a discrimination model D in the generative confrontation network, the type judgment is carried out, and the objective function of the generative confrontation network is shown as the formula (I):
d (x) represents the probability that the discrimination model D classifies the original image x as true;
and generating countermeasure samples of each malicious category as an expanded tagged malicious sample set by using a generative model of the generative countermeasure network.
Further preferably, the convolutional automatic encoder comprises an encoder and a decoder, wherein the encoder is formed by stacking a plurality of transposed convolutional layers, a pooling layer, a BN layer and a single fully-connected layer, and the decoder is formed by stacking a plurality of convolutional layers, pooling layers and BN layers; wherein the BN layer follows each convolution and pooling operation;
given an input image X ∈ X ═ X { (X)0,x1,x2...,xmThe encoder E compresses x into compressed data f (x), the decoder reconstructs the compressed data f (x) into a reconstructed image x', and the operation process of the convolution automatic encoder is shown in formulas (II) and (III):
f(x)=s(Wx+b)(Ⅱ)
x′=g(f(x))=s(W′f(x)+b′)(Ⅲ)
in the formulas (II) and (III), s () is an activation function, W and W 'are weight matrix sets, b and b' are bias value sets, and g () is a deconvolution mapping function;
the mean square error is used as the difference value of the original image x and the reconstructed image x' to perform feedback training on the convolution automatic encoder, and the reconstruction error e is shown as the formula (IV):
in the formula (IV), m and n are the height and width of the image respectively, and xij、x′ijThe pixel values of the corresponding positions of the real image x and the reconstructed image x' are respectively.
Further preferably, a one-dimensional random number z conforming to a standard normal distribution N (0,1) is used as a prior distribution of the generator in the original image x, so as to implement mapping x ″ on the class data space by generating a model;
constructing a discrimination model of a full-connection layer which has the same structure as an encoder in a convolution automatic encoder and the tail part of which is added with a layer of neuron number which is the total category number n, and taking an original image x and a generated image x' as input to carry out category discrimination;
according to the invention, the specific implementation process of step 3 preferably comprises the following steps:
step 3.1: the original benign and malicious gray level images generated in the step 1 and the gray level images of the various malicious countermeasure samples generated in the step 2 are sent to a large-scale teacher model together for multi-class supervised training;
step 3.2: the original benign and malicious gray level images generated in the step 1 and the gray level images of various types of malicious countermeasure samples generated in the step 2 are sent to a convolution automatic encoder together for self-supervision training;
step 3.3: the method is characterized in that an encoder in an automatic encoder which is trained is used as a student model, a dense layer is added at the tail of the student model, multi-classification is facilitated, and the number of neurons in the dense layer is the total number n of classes and is used for multi-classification.
Preferably, the number of neurons in the last full connection layer of the existing ResNet101 model is modified into the total category number n, so that the large-scale teacher model is obtained.
Preferably, in step 4, the KL divergence is used as a measure of the difference between the probability distribution of each class predicted by the large-scale teacher model and the probability distribution of each class of the sample corresponding to the student model prediction, and the difference value between the two is shown as the following formula (v):
in the formula (V), p is the probability distribution of each class of the corresponding sample predicted by the student model, and q is the probability distribution of each class predicted by the teacher model; p is a radical ofi、qiThe probability value of predicting the input sample into the ith category for the student model and the teacher model respectively, and n is the total category number.
Setting a hyper-parameter tau for softening probability distribution of various types of corresponding samples of a large-scale teacher model and a student model, and setting a hyper-parameter alpha for adjusting influence factors of soft and hard supervision on total loss, wherein the influence factors are respectively shown in formulas (VI) and (VII):
L=αKL(p,q)+(1-α)CE(y,p)(Ⅶ)
in the formulae (VI) and (VII),
and expressing the predicted value of the ith category, p is the predicted probability distribution of the student model for each category, and y is a real label.
KL is KL divergence between large-scale teacher model prediction distribution and various types of probability distribution of samples corresponding to student model prediction, and CE is cross entropy of prediction of the student models and real labels;
preferably, in step 4, the sum of the loss values of the two is used as the total loss to feed back the student model, which means that: and setting a hyper-parameter alpha, wherein alpha is used as a soft supervision loss corresponding weight, 1-alpha is used as a hard supervision loss corresponding weight, and the weighted sum of the loss values of the soft supervision and the hard supervision is the total loss of the student model.
A robust detection system of a lightweight high-precision malware identification model is used for operating a robust detection method of the lightweight high-precision malware identification model and comprises a sample generation module, a knowledge distillation module and a detection module;
the sample generation module is to: generating a corresponding category malicious sample data set by using a convolution automatic encoder, a generated countermeasure network and a transfer learning idea; the knowledge distillation module is used for: adopting knowledge distillation to draw the knowledge of the large-scale teacher model into the student model; the detection module is used for: and predicting unknown samples by adopting the trained student model.
The invention has the beneficial effects that:
1. the invention generally uses an automatic encoder and transfer learning to easily make the model quickly converge to a better state.
2. According to the method, only byte files are processed, and then the end-to-end deep convolution model is adopted to automatically extract high-order features and judge potential modes, so that the problem that a classification algorithm highly depends on the integrity of feature space extracted by a fussy feature engineering is solved, and the real-time requirement of malicious detection can be met.
3. And the GAN is used for generating data of all samples according to the types, so that the problem of huge samples required by a teacher model is solved on one hand, and the screening capability of the model on unknown confusing samples is improved by the generated confrontation samples on the other hand.
4. By adopting a knowledge distillation-based model compression technology, knowledge of a large-scale teacher model is extracted into a small-scale student model, so that the hardware resource consumption is reduced, the reasoning speed is improved, and the feasibility is injected for the deployment of the teacher model on small-scale terminal equipment.
Drawings
FIG. 1 is a flow diagram of a robust detection method for a lightweight high-accuracy malware identification model;
FIG. 2 is an overall architecture diagram of a robust detection method for a lightweight high-precision malware identification model;
FIG. 3 is an exemplary illustration of a byte file visualized as a grayscale map;
FIG. 4 is a schematic structural diagram of a malicious countermeasure sample generation module of each category;
FIG. 5 is a schematic diagram of an example of training of a generative confrontation network;
fig. 6 is a schematic diagram of a network structure of a convolutional auto-encoder.
Detailed Description
The invention is further defined in the following, but not limited to, the figures and examples in the description.
Example 1
A robust detection method for a lightweight high-precision malware identification model is shown in fig. 1 and 2, and comprises the following steps:
step 1: acquiring a byte file data set of application software, wherein the byte file data set comprises benign byte files and multi-class malicious byte files, and all the byte files are visualized into a gray level graph; as shown in fig. 3, the specific implementation process for visualizing the byte file into the gray scale map includes: taking every two hexadecimal numbers (eight binary digits) as a pixel point of the gray scale image, wherein decimal values (0-255) are pixel values; setting the width w, converting all bytes in the byte file into a matrix form in a row-by-row tiling mode, and further visualizing the matrix form into a gray scale image.
Step 2: as shown in fig. 4, a plurality of convolutional automatic encoders are trained according to malicious categories, a decoding portion is migrated into a generation model by generating a countermeasure network, the countermeasure network is generated by training, a gray-scale map of malicious countermeasure samples of each category is generated, and the gray-scale map is added to a malicious data set; the specific implementation process comprises the following steps:
step 2.1: training a plurality of Convolutional Automatic Encoders (CAEs) according to the category of the malicious software in an isolation manner, and performing feedback by using a reconstruction error through reconstructing an original input image; the convolutional automatic encoder comprises an encoder and a decoder, as shown in fig. 6, the encoder is formed by stacking a plurality of transposed convolutional layers, a pooling layer, a BN (batch norm) layer and a single fully-connected layer, and the decoder is formed by stacking a plurality of convolutional layers, pooling layers and BN layers; wherein the BN layer follows each convolution and pooling operation;
given an input image X ∈ X ═ X { (X)0,x1,x2...,xmThe encoder E compresses x into compressed data f (x), the decoder reconstructs the compressed data f (x) into a reconstructed image x', and the operation process of the convolution automatic encoder is shown in formulas (II) and (III):
f(x)=s(Wx+b)(Ⅱ)
x′=g(f(x))=s(W′f(x)+b′)(Ⅲ)
in the formulas (II) and (III), s () is an activation function, W and W 'are weight matrix sets, b and b' are bias value sets, and g () is a deconvolution mapping function;
the Mean Square Error (MSE) is used as the difference value of the original image x and the reconstructed image x' to perform feedback training on the convolution automatic encoder, and the reconstruction error e is shown as the formula (IV):
in the formula (IV), m and n are the height and width (rows and columns) of the image respectively, and xij、x′ijThe pixel values of the corresponding positions of the real image x and the reconstructed image x' are respectively.
Step 2.2: migrating an encoder of a convolution automatic encoder into a generation model of a generation countermeasure network (GAN), and adopting a one-dimensional random number z conforming to a standard normal distribution N (0,1) as an input of the generation model for generating sample data G (z);
the generated sample data G (z) and the original image x are used as the input of a discrimination model D in the generative confrontation network, the type judgment is carried out, and the objective function of the generative confrontation network is shown as the formula (I):
d (x) represents the probability that the discrimination model D classifies the original image x as true;
and generating countermeasure samples of each malicious category as an expanded tagged malicious sample set by using a generative model of the generative countermeasure network.
As shown in fig. 5, a one-dimensional random number z conforming to a standard normal distribution N (0,1) is used as a prior distribution of a generator in an original image x, so as to implement mapping x ″ on the class data space by a generation model;
constructing a discrimination model of a full-connected layer which has the same structure as the encoder in the convolution automatic encoder shown in FIG. 6 and the tail part of which is added with a layer of complete-connected layer with neuron number as the total category number n, and performing category discrimination by taking an original image x and a generated image x' as input;
both the generative model and the discriminative model benefit from a binary task of whether to distinguish x from x ", wherein training of the generative model benefits from the discriminative model to classify the generative image x" into a real image, and training of the discriminative model benefits from distinguishing x from x ", which are in an interactive game.
After training iteration for a certain number of times, the gray image X 'generated by each generated model is saved as the expanded gray pattern sample data X'.
The large-scale teacher model is a large-scale convolutional neural network formed by stacking a plurality of convolutional layers, pooling layers, linear layers and the like; and modifying the number of neurons in the last full connecting layer of the existing ResNet101 model into the total class number n to obtain the large-scale teacher model. The invention has a lightweight structure as a whole, and large network models with good performances can be used as the preference of the transfer learning.
And step 3: training a large-scale teacher model and a convolution automatic encoder, and migrating the encoder of the convolution automatic encoder into a small-scale student model; the specific implementation process comprises the following steps:
step 3.1: the original benign and malicious gray level images generated in the step 1 and the gray level images of the various malicious countermeasure samples generated in the step 2 are sent to a large-scale teacher model together for multi-class supervised training;
the multi-class supervised training refers to: 1 benign category and n-1 malicious category are set, and each category data comprises original data and generated data. And performing feedback training according to the difference value between the label and the forecast.
Step 3.2: the original benign and malicious gray level images generated in the step 1 and the gray level images of various types of malicious countermeasure samples generated in the step 2 are sent to a convolution automatic encoder together for self-supervision training;
the self-supervision training refers to: and (5) taking the input sample as a training target, and performing feedback training according to the difference between the reconstructed image and the input image.
Step 3.3: the encoder in the trained automatic encoder is used as a student model (small-scale detection model), a dense layer is added at the tail of the student model to facilitate multi-classification, and the number of neurons in the dense layer is the total number n of classes and is used for multi-classification.
And 4, step 4: by using a model compression technology based on knowledge distillation, the probability distribution and the real labels of multi-class prediction of a large-scale teacher model are respectively used as soft supervision and hard supervision information of a student model, and the loss values of the two are synthesized to be used as total loss to feed back the student model; and setting a hyper-parameter alpha, wherein alpha is used as a soft supervision loss corresponding weight, 1-alpha is used as a hard supervision loss corresponding weight, and the weighted sum of the loss values of the soft supervision and the hard supervision is the total loss of the student model.
In step 4, the KL divergence (the contrast entropy) is used for measuring the difference between the probability distribution of each category predicted by the large-scale teacher model and the probability distribution of each category of the corresponding sample predicted by the student model, and the difference value of the two is shown as the following formula (V):
in the formula (V), p is the probability distribution of each class of the corresponding sample predicted by the student model, and q is the probability distribution of each class predicted by the teacher model; p is a radical ofi、qiThe probability value of predicting the input sample into the ith category for the student model and the teacher model respectively, and n is the total category number.
Considering that the student model learns the soft knowledge from the large-scale teacher model, if the teacher model is over-determined for the sample belonging category, that is, the determination value for the sample belonging category is extremely high, the student model is difficult to learn the soft knowledge of the teacher model.
Setting a hyper-parameter tau for softening probability distribution of various types of corresponding samples of a large-scale teacher model and a student model, and setting a hyper-parameter alpha for adjusting influence factors of soft and hard supervision on total loss, wherein the influence factors are respectively shown in formulas (VI) and (VII):
L=αKL(p,q)+(1-α)CE(y,p)(Ⅶ)
in the formulae (VI) and (VII),
and expressing the predicted value of the ith category, p is the predicted probability distribution of the student model for each category, and y is a real label.
KL is KL divergence between large-scale teacher model prediction distribution and various types of probability distribution of samples corresponding to student model prediction, and CE is cross entropy of prediction of the student models and real labels;
the invention uses the model compression technology based on knowledge distillation, and on the basis, the student models adopt the technologies of pruning optimization, quantitative acceleration and the like, which can be used as supplements of the scheme.
The error and loss functions used in the gradient back propagation stage of each model in the present invention are not limited to the above-mentioned schemes, and the loss functions such as L1, L2, cross entropy, mean square error, and root mean square error can be used instead of or as an alternative to the above-mentioned loss functions.
And 5: and predicting the class of the student model by using the finally obtained student model, namely detecting the result.
In this embodiment, for prediction of an unknown software sample, a byte file of the unknown software sample is visualized as a gray-scale map, and then the gray-scale map is sent to a student model to perform probability prediction of each category, and the category to which the unknown software sample belongs is output.
The invention generally uses an automatic encoder as a pre-training stage of other network models, and initializes other model parameters by using transfer learning. Compared with a model without a pre-training stage, the method is easy to make the model quickly converge to a better state.
The invention is based on static analysis to process only a single data source of the software, namely byte files. The method can be visualized as a gray level image and directly sent into an end-to-end deep convolution model, so that the problem that a classification algorithm highly depends on the integrity of a feature space extracted by a fussy feature engineering is solved, and the real-time requirement of malicious detection can be met. And secondly, generating data of all samples according to categories by using the GAN, so that the problem of a huge amount of samples required by a teacher model is solved on one hand, and the discrimination capability of the model on unknown confusing samples is improved by the generated confrontation samples on the other hand. And thirdly, adopting a knowledge distillation-based model compression technology to draw the knowledge of the large-scale teacher model into the small-scale student model. The detection precision is reduced as little as possible, and simultaneously the model scale is obviously reduced, so that the hardware resource consumption is reduced, the reasoning speed is improved, and the feasibility is injected for the deployment of the model on small-sized terminal equipment.
Example 2
A robust detection system of a lightweight high-precision malware identification model is used for operating the robust detection method of the lightweight high-precision malware identification model in embodiment 1, and comprises a sample generation module, a knowledge distillation module and a detection module;
the sample generation module is to: generating a corresponding category malicious sample data set by using a convolution automatic encoder, a generated countermeasure network and a transfer learning idea; the knowledge distillation module is used for: adopting knowledge distillation to draw the knowledge of the large-scale teacher model into the student model; the detection module is used for: and predicting unknown samples by adopting the trained student model.
The method generally uses the idea of transfer learning, so that the model can be converged to a higher precision more quickly. In addition, the generated countermeasure sample is used as an expanded label data set, on one hand, data enhancement is favorable for improving model precision, and on the other hand, the countermeasure sample is used for training and can enhance the discrimination capability of the model on the confusion sample. And thirdly, a knowledge distillation lightweight large-scale model is adopted, so that feasibility is provided for deployment of lightweight equipment.