CN112800426A

CN112800426A - Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN

Info

Publication number: CN112800426A
Application number: CN202110182166.XA
Authority: CN
Inventors: 梁军淼; 宁振虎; 曹东芝; 公备
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-05-14
Anticipated expiration: 2041-02-09
Also published as: CN112800426B

Abstract

The invention discloses a malicious code data unbalanced processing method based on a swarm intelligence algorithm and cGAN, which is used for constructing a malicious code generation model. And calculating the acceptable optimal initial sample proportion of the malicious code by adopting a group intelligent algorithm. And generating various families of malicious codes, and constructing a relatively balanced malicious code data set. The acceptable optimal sample proportion of each malicious code family is obtained by using a group intelligent algorithm, cGAN is introduced to learn the data distribution of different families of malicious codes and generate samples, finally, an unbalanced data set is processed, and malicious code data sets with relatively balanced samples are constructed, so that the malicious codes of different types reach an ideal proportion when being selected, positive and negative samples have the same status in the training process, and the problem of data imbalance is solved more effectively.

Description

Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN

Technical Field

The invention belongs to the field of information security, particularly relates to a malicious code data unbalanced processing method based on a swarm intelligence algorithm and cGAN, and belongs to a data balancing strategy in the malicious code classification problem.

Background

With the rapid development of information technology, the internet has become an important part of our daily life, which brings many benefits to our life, learning and work, but simultaneously hides many security problems such as Trojan horse virus, phishing websites and malicious software, wherein malicious codes are one of the main security threats. Driven by economic benefits, the number of new malware samples is explosively increased, anti-malware providers face millions of potential malware samples each year, and research needs to rely on a large number of high-quality samples to build an efficient malware detection model in order to continuously resist the increase of the malware samples.

In classification applications, data imbalance has significant adverse effects on the training of classification models, both in terms of the convergence of the training models and in terms of the generalization of the models during the testing phase. High-quality data is the key of machine learning and deep learning, the scarcity of data can hinder the development of a model, the model trained by the high-quality data is more robust (over-fitting prevention), and even the training can be simple and fast due to a data set. In the problem of malicious code detection, data among malicious code families are seriously unbalanced, so that an over-fitting problem is easy to occur during training, and the trained model has a poor classification effect. Currently, research strategies regarding solving the problem of data imbalance can be roughly divided into three aspects.

1) Research from data level

The method mainly comprises resampling, wherein upsampling and downsampling are commonly used; or data enhancement is carried out on the data with small data volume through a data enhancement method, and a countermeasure network (GAN) is generated with a better effect, so that the distribution of the training set is changed to enable the distribution to tend to be balanced.

2) Research based on algorithm level

The optimal data set sampling weight is obtained through an optimization algorithm, and the group intelligent optimization algorithm has a good effect; or improving a classification algorithm, reducing errors biased to negative classes, and improving the recognition rate of positive classes, wherein the most popular is a cost-sensitive classification algorithm.

3) Data and algorithm level combination

The method mainly integrates the above two strategies (data-level-based research and algorithm-level-based research) to extract respective advantages thereof, and simultaneously reduces respective weaknesses thereof to obtain a data set with balanced distribution, thereby improving the classification effect of the model.

Disclosure of Invention

In order to solve the problem that in the malicious code detection problem, a sample data set is unbalanced to cause poor performance of a trained model, the invention provides a novel method for solving the data imbalance, firstly, a countermeasure network (cGAN) is generated through a constructed Conditional expression to carry out data enhancement on sample data of each family, then, according to the characteristic that a group intelligent algorithm is good at solving the problem of optimized combination, a typical group intelligent algorithm (PSO) is selected to calculate the sample proportion of each family of the malicious code, and the data enhancement is carried out according to the proportion; and finally, constructing a malicious code data set with relatively balanced sample data through the original data set and the data set generated according to the proportion.

The technical scheme adopted by the invention is a malicious code data unbalanced processing method based on a swarm intelligence algorithm and cGAN, which comprises the following steps:

step 1, constructing a malicious code generation model.

The generation countermeasure network (GAN) is mainly composed of two parts of a generation network G (Generator network) and a discriminant network D (discriminant network), and G and D dynamic games: g spoofs D with the generated samples to falsely falsify, while D continually improves discrimination to distinguish between true data and G-synthesized data until the last two reach Nash equilibrium, i.e. the theoretical G-generated data distribution (P)_g) And true data distribution (P)_data) Are equal. The conditional generation countermeasure network (cGAN) can guide the generation of data through the control of parameters, namely, under the original network structure, an additional auxiliary information y is added to the input of a discriminator and a generator, the y can be a classification label of each data and the like, and the auxiliary information y is a family label of malicious codes in the invention. Continuously confrontation and overlapping between the generation network and the discrimination networkAfter the generation optimization, the generator can be used as a generation model of the malicious code.

Step 1.1, in generating networks, randomly distributing p from the front_z(z) taking out the random input z, and splicing and combining the random input z with the malicious code family label y to form a brand-new implicit expression;

step 1.2, in the discrimination network, both a real malicious code sample or a generated malicious code and a family label y are input together for discrimination;

and 1.3, the discrimination network D improves the capability of discriminating true and false samples by repeated iterative learning, and the generated network G improves the self-counterfeiting capability by repeated iterative learning. The two are dynamically confronted and continuously optimized in an iteration process, when D cannot distinguish real data from generated data at last, namely D takes the generated data G (z) as real data, the model is considered to be optimal, and G is considered to obtain the complete distribution of real sample data. The generated network is a generation model of the malicious code, and the generated data can be regarded as new malicious code sample data.

And 2, calculating the acceptable optimal initial sample proportion of the malicious codes by adopting a group intelligent algorithm.

A typical group intelligence algorithm PSO algorithm is employed to find acceptable optimal initial weights for different classes of malicious code families. Assuming that the number of malicious code families is M, and the resampling weight is W_iThe combination of the sampling weights can be seen as the position of the individual in the group intelligence algorithm, which can be given by:

position＝(W₁,W₂,...,W_n)

the accuracy of the training model is used as an objective function, and the algorithm 1 is a malicious code optimal initial weight calculation process based on a group intelligent algorithm.

And 3, generating malicious codes of all families and constructing a relatively balanced malicious code data set.

According to the optimal sample proportion of the malicious code family calculated by the PSO algorithm, data enhancement of different degrees is carried out on each family sample through the cGAN model, namely, various samples are generated through the generation model, so that a malicious code sample set with balanced data is constructed.

Assuming that the malicious code data set for classification belongs to M classes, let X ═ X (X)₁,X₂,...,X_max,...X_m) Training samples for each family, where X_maxFor a malicious code family with the largest number of family samples, C ═ C₁,C₂,...,C_m) For the optimal sample proportion of each malicious code family obtained by a group intelligent method, m belongs to N + and m is a positive integer.

Obtaining a class X according to the maximum family sample size in the original data set_maxAnd some kind of data enhancement weight W_iAnd the number X of samples of this type_iAnd calculating the sample amount of the sample to be generated. The specific calculation formula is as follows:

Y_i＝X_maxW_i-X_i

data enhancement weight W_iThe calculation formula of (a) is as follows:

wherein, Y_iAmount of sample to be generated for class i family, C_iIs the ith value, C, in the optimal sample ratio_maxFor the value with the maximum number of samples in the optimal sample proportion, i belongs to [1, M ∈]I is the category of the sample, and M is the number of categories.

Obtaining Y ═ Y₁,Y₂,...,Y_m) For the data size required to be generated by various family samples, a malicious code data set X (X) with relatively balanced sample data of each family is constructed₁+Y₁,X₂+Y₂,...,X_max,...X_m+Y_m)

3.1, generating data by using the malicious code generation model trained in the step 1.3 according to the optimal initial sample proportion calculated in the step 2;

and 3.2, the generated data set and the original data set together construct a relatively balanced malicious code data set.

Compared with the prior art, the invention has the following advantages:

1. the scarcity of data can hinder the development of a model, and a model trained with high-quality data tends to be more robust, and even training can be simple and fast due to a data set. According to the method, data enhancement is performed on the malicious code through the generation model finished by cGAN training, the malicious code sample is generated under the condition that the real characteristics of the malicious code are kept to the maximum extent, and a malicious code data set is expanded.

2. Because the number of samples in different malicious code families is very different, a classifier directly trained by using a data set is easy to cause an overfitting problem, and a proper sample proportion is very important for the training set. In actual work, the optimal initial weight is difficult to find for dozens of malicious code families, the group intelligent algorithm is an effective method for solving the complex combination optimization problem, and the effect of optimizing the initial weights of different malicious code families can be achieved by adopting the group intelligent algorithm.

3. The acceptable optimal sample proportion of each malicious code family is obtained by using a group intelligent algorithm, cGAN is introduced to learn the data distribution of different families of malicious codes and generate samples, finally, an unbalanced data set is processed, and malicious code data sets with relatively balanced samples are constructed, so that the malicious codes of different types reach an ideal proportion when being selected, positive and negative samples have the same status in the training process, and the problem of data imbalance is solved more effectively.

Drawings

FIG. 1 is a flow diagram of an equalized data set construction for malicious code.

FIG. 2 is a flow chart of a group intelligence algorithm.

Fig. 3 is based on cGAN's malicious code data enhancement model.

Detailed Description

The invention is explained and illustrated below with reference to the accompanying drawings:

in order to make the objects, technical solutions and features of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

The flow chart of the construction of the malicious code balanced data set is shown in fig. 1, and comprises the following steps:

step S10, constructing a malicious code generation model;

step S20, calculating the acceptable optimal initial sample proportion of the malicious codes by adopting a group intelligent algorithm;

and step S30, generating malicious codes of each family, and constructing a relatively balanced malicious code data set.

The step S10 of constructing a malicious code generation model of an embodiment further includes the steps of:

step S100, in the generation network, from the preposition random distribution p_z(z) taking out the random input z, and splicing and combining the random input z with the malicious code family label y to form a brand-new implicit expression;

step S110, in the discrimination network, both a real malicious code sample or a generated malicious code and a family label y are input together for discrimination;

and step S120, the discrimination network D improves the self true and false discrimination sample capability through repeated iterative learning, and the generation network G improves the self imitation capability through repeated iterative learning. The two are dynamically confronted and continuously optimized in an iteration process, when D cannot distinguish real data from generated data at last, namely D takes the generated data G (z) as real data, the model is considered to be optimal, and G is considered to obtain the complete distribution of real sample data. The generated network is a generation model of the malicious code, and the generated data can be regarded as new malicious code sample data.

The embodiment generates various families of malicious codes, and the step S30 of constructing a relatively balanced malicious code data set further includes the following steps:

step S300, according to the optimal initial sample proportion calculated in step 20, using the malicious code generation model trained in step 120 to generate data,

and step S310, the generated data set and the original data set together construct a relatively balanced malicious code data set.

The practical effect of the malicious code data unbalanced processing method based on the swarm intelligence algorithm and the cGAN provided by the invention is verified through experiments. The test environment is an ubuntu14.04 host, an 8G memory and a 1T hard disk. The experimental data are from the Malware Images dataset and the cGAN generated dataset. Two experiments were set up for the present invention:

first experiment, selecting Swizzor. gem!from Malware Images data set! Comparative experiments were performed on two families, I (132 samples) and Ramnit (1541 samples), with unbalanced class ratios exceeding 1:10, using a cGAN network to swizzor. gem! Sample Generation for family I, Table 1 shows the results for Swizzor. gem! I and Ramnit used different proportions to train AlexNet model to obtain classification accuracy.

TABLE 1 Experimental results of different scale training models

From Table 1, we can see that Swizzor. gem. formation is generated by cGAN! Family I data, which can improve the classification accuracy of models, is available in Swizzor. The ratio of the I and Ramnit family samples was 8: the AlexNet model trained at 10 hours has the best classification accuracy.

Experiment two, in actual work, the optimal initial weight is difficult to find for dozens of malicious codes, the PSO algorithm is adopted to calculate the optimal sample proportion of the malicious codes, a data set (D3) obtained after data generation is carried out in proportion is compared with an original sample data set (D1) and a data set (D2) with the same proportion of all family samples, and table 2 shows the classification accuracy of the model obtained by training three different data sets.

Table 2 experimental results of three data set training models

Data set	D1	D2	D3
				ROC area	0.905	0.956	0.974

From the comparison data, the malicious code data unbalanced processing method based on the group intelligent algorithm and the cGAN has a great improvement effect on the data unbalanced classification problem of the malicious codes.

Claims

1. A malicious code data unbalanced processing method based on group intelligent algorithm and cGAN is characterized in that: comprises the following steps of (a) carrying out,

step 1, constructing a malicious code generation model;

the generation countermeasure network GAN consists of a generation network G and a discrimination network D, and the G and D dynamic games are as follows: g spoofs D with the generated sample to falsely falsify D, and D continuously improves discrimination to distinguish real data from G-synthesized data until the two finally reach Nash balance, i.e. the data distribution P generated by G theoretically_gAnd true data distribution P_dataEqual; the conditional generation countermeasure network cGAN guides the generation of data through the control of parameters, namely under the original network structure, an additional auxiliary information y is added to the input of a discriminator and a generator, the y is a classification label of each data, and the auxiliary information y is a family label of a malicious code; continuously confrontation and iteration excellence between generation network and discrimination networkAfter the malicious code is converted, the generator is used as a generation model of the malicious code;

step 1.3, the discrimination network D improves the capability of discriminating true and false samples by iterative learning for many times, and the generated network G improves the self-counterfeiting capability by iterative learning for many times; the two are dynamically confronted and continuously optimized in an iteration process, when D cannot distinguish real data from generated data at last, namely D takes the generated data G (z) as real data, the model is considered to be optimal, and G is considered to obtain complete distribution of real sample data; the generated network is a generation model of the malicious code, and the generated data is regarded as new malicious code sample data;

step 2, calculating the acceptable optimal initial sample proportion of the malicious codes by adopting a group intelligent algorithm;

searching acceptable optimal initial weights of different classes of malicious code families by adopting a typical group intelligent algorithm PSO algorithm; assuming that the number of malicious code families is M, and the resampling weight is W_iThe combination of the sampling weights can be seen as the position of the individual in the group intelligence algorithm, given by:

position＝(W₁,W₂,...,W_n)

taking the accuracy of the training model as a target function;

step 3, generating malicious codes of each family, and constructing a relatively balanced malicious code data set;

according to the optimal sample proportion of the malicious code family calculated by the PSO algorithm, data enhancement of different degrees is carried out on each family sample through a cGAN model, namely, various samples are generated through a generation model, so that a malicious code sample set with balanced data is constructed;

2. The method for unbalanced processing of malicious code data based on group intelligent algorithm and cGAN as claimed in claim 1, wherein: in step 3, assuming that the malicious code data set for classification belongs to M classes, let X ═ X (X)₁,X₂,...,X_max,...X_m) Training samples for each family, where X_maxFor a malicious code family with the largest number of family samples, C ═ C₁,C₂,...,C_m) The optimal sample proportion of each malicious code family obtained by a group intelligent method is that m belongs to N + and m is a positive integer;

obtaining a class X according to the maximum family sample size in the original data set_maxAnd some kind of data enhancement weight W_iAnd the number X of samples of this type_iCalculating the sample amount of the sample to be generated; the specific calculation formula is as follows:

Y_i＝X_maxW_i-X_i

data enhancement weight W_iThe calculation formula of (a) is as follows:

wherein, Y_iAmount of sample to be generated for class i family, C_iIs the ith value, C, in the optimal sample ratio_maxFor the value with the maximum number of samples in the optimal sample proportion, i belongs to [1, M ∈]I is the category of the sample, and M is the number of categories;

obtaining Y ═ Y₁,Y₂,...,Y_m) For the data size required to be generated by various family samples, a malicious code data set X (X) with relatively balanced sample data of each family is constructed₁+Y₁,X₂+Y₂,...,X_max,...X_m+Y_m)。