CN112800426B

CN112800426B - Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN

Info

Publication number: CN112800426B
Application number: CN202110182166.XA
Authority: CN
Inventors: 梁军淼; 宁振虎; 曹东芝; 公备
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2024-03-22
Anticipated expiration: 2041-02-09
Also published as: CN112800426A

Abstract

The invention discloses a malicious code data unbalanced processing method based on a group intelligent algorithm and a cGAN, which constructs a malicious code generation model. An acceptable optimal initial sample proportion of malicious code is calculated by adopting a swarm intelligence algorithm. Generating malicious codes of each family, and constructing a relatively balanced malicious code data set. According to the invention, the acceptable optimal sample proportion of each malicious code family is obtained by utilizing a swarm intelligence algorithm, and simultaneously, the cGAN is introduced to learn data distribution of different families of malicious codes and generate samples, and finally, an unbalanced data set is processed to construct a malicious code data set with relatively balanced various samples, so that the ideal proportion of different types of malicious codes is achieved when the malicious codes are selected, positive and negative samples have the same position in the training process, and the problem of unbalanced data is more effectively solved.

Description

Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN

Technical Field

The invention belongs to the field of information security, and particularly relates to a malicious code data unbalanced processing method based on a group intelligent algorithm and a cGAN, belonging to a data balanced strategy in the problem of malicious code classification.

Background

With the rapid development of information technology, the internet has become an important part of our daily life, brings a plurality of benefits to our life, study and work, but simultaneously hides a plurality of security problems such as Trojan horse viruses, phishing websites and malicious software, wherein malicious code is one of main security threats. Driven by economic interests, new malware samples are growing explosively, and anti-malware providers are facing millions of potential malware samples each year, and research is required to rely on large numbers of high quality samples to build efficient malware detection models in order to continue to combat the increase in malware samples.

In classification applications, data imbalance has a significant adverse effect on training of classification models, both in terms of convergence of the training model and generalization of the test phase model. High quality data is critical to machine learning and deep learning, the scarcity of data can hinder the development of a model, and models trained using high quality data tend to be more robust (prevent overfitting), and even can make training simple and fast due to the data set. In the detection problem of malicious codes, data among malicious code families are seriously unbalanced, so that fitting problems are easy to occur during training, and the classification effect of a trained model is poor. Currently, research strategies for solving the problem of data imbalance can be broadly divided into three aspects.

1) Study from the data plane

The method mainly uses resampling, and commonly used upsampling and downsampling are carried out; or the data enhancement method is used for carrying out data enhancement on the data with small data quantity, and the generation of the countermeasure network (GAN, generative Adversarial Networks) has better effect, so that the distribution of the training set is changed to lead the distribution to be balanced.

2) Research based on algorithm level

The optimal data set sampling weight is obtained through an optimization algorithm, and the group intelligent optimization algorithm is good in effect; or improving the classification algorithm, reducing errors biased to negative classes, and improving the recognition rate of positive classes, wherein the most popular is the cost-sensitive classification algorithm.

3) Combining data with algorithm level

The method mainly integrates the two strategies (study based on a data layer and study based on an algorithm layer) to extract the respective advantages, and reduces the respective weaknesses to obtain a data set with balanced distribution so as to improve the classification effect of the model.

Disclosure of Invention

In order to solve the problem of bad performance of a trained model caused by unbalanced sample data sets in the malicious code detection problem, the invention provides a novel method for solving the data imbalance, which comprises the steps of firstly generating a challenge network (cGAN, conditional Generative Adversarial Networks) through constructed conditions to carry out data enhancement on sample data of each family, and selecting a typical group intelligent algorithm particle swarm algorithm (Particle Swarm Optimization, PSO) to calculate the sample proportion of each family of the malicious code according to the characteristic that the group intelligent algorithm is good at solving the problem of optimizing combination, and carrying out data enhancement according to the proportion; and finally, constructing a malicious code data set with relatively balanced sample data through the original data set and the data set generated according to the proportion.

The technical scheme adopted by the invention is a malicious code data unbalanced processing method based on a group intelligent algorithm and a cGAN, which comprises the following steps:

and 1, constructing a malicious code generation model.

The generation countermeasure network (GAN) is mainly composed of two parts, namely a generation network G (Generator Network) and a discrimination network D (Discriminator Network), and G and D dynamic games: g spoofs D with the generated samples to spurious, while D continues to increase discrimination capability to distinguish between real data and G-synthesized data until the last two reach Nash equilibrium, i.e., the theoretical G-generated data distribution (P _g ) And true data distribution (P _data ) Equal. The conditional generation antagonism network (cGAN) can guide the generation of data through the control of parameters, namely, under the original network structure, an extra auxiliary information y is added to the input of a discriminator and a generator, wherein the y can be a classification label of each data, and the like, and the auxiliary information y is a family label of malicious codes in the invention. After the generation network and the discrimination network are continuously counteracted and iterated and optimized, the generator can be used as a generation model of malicious codes.

Step 1.1, in the generation network, randomly distributing p from the front _z Taking out random input z from the step (z), and then splicing and combining the random input z with a malicious code family label y to form a totally new implicit expression;

step 1.2, in a discrimination network, a real malicious code sample or a generated malicious code is input together with a family tag y to discriminate;

step 1.3, the discrimination network D improves the capability of discriminating true and false samples by repeated iterative learning, and the generation network G improves the imitation capability by repeated iterative learning. The two dynamic countermeasures are continuously optimized in the iterative process, and when the D cannot distinguish the real data from the generated data finally, namely, the D takes the generated data G (z) as the real data, the model is considered to be optimal, and the G is considered to be distributed completely by the real sample data. The generation network is the generation model of the malicious code at this time, and the generated data can be regarded as new malicious code sample data.

And 2, calculating the acceptable optimal initial sample proportion of the malicious codes by adopting a swarm intelligence algorithm.

A typical swarm intelligence algorithm, PSO, algorithm is employed to find the acceptable optimal initial weights for different classes of malicious code families. Assuming that the number of malicious code families is M, the resampling weight is W _i The combination of sampling weights can be regarded as the location of the individual in the swarm intelligence algorithm, and can be given by:

position＝(W ₁ ,W ₂ ,...,W _n )

taking the accuracy of the training model as an objective function, and the algorithm 1 is the optimal initial weight calculation process of the malicious code based on the group intelligent algorithm.

And 3, generating malicious codes of each family, and constructing a relatively balanced malicious code data set.

According to the optimal sample proportion of the malicious code family calculated by the PSO algorithm, data enhancement of different degrees is carried out on each family sample through the cGAN model, namely various sample generation is carried out by using the generation model, and therefore a malicious code sample set with balanced data is constructed.

Assuming that the malicious code dataset for classification belongs to M categories, let x= (X) ₁ ,X ₂ ,...,X _max ,...X _m ) Training samples for each family, where X _max For one malicious code family with the largest number of family samples, c= (C ₁ ,C ₂ ,...,C _m ) For the optimal sample proportion of each malicious code family obtained by the group intelligent method, m is N+ and m is a positive integer.

Obtaining X according to the maximum family sample size in the original data set _max Some kind of data enhancement weight W _i And the number of such samplesQuantity X _i The sample size that the sample of this type needs to be generated is calculated. The specific calculation formula is as follows:

Y _i ＝X _max W _i -X _i

data enhancement weight W _i The calculation formula of (2) is as follows:

wherein Y is _i Sample size to be generated for class i family, C _i For the ith value in the optimal sample ratio, C _max For the value of the most sample number class in the optimal sample proportion, i E [1, M]I is the category of the sample, and M is the category number.

To obtain Y= (Y) ₁ ,Y ₂ ,...,Y _m ) For the data volume to be generated of various family samples, a malicious code data set X= (X) with relatively balanced data of each family sample is constructed ₁ +Y ₁ ,X ₂ +Y ₂ ,...,X _max ,...X _m +Y _m )

Step 3.1, according to the optimal initial sample proportion calculated in the step 2, performing data generation by using the malicious code generation model trained in the step 1.3;

step 3.2 the generated dataset together with the original dataset constitutes a relatively balanced malicious code dataset.

Compared with the prior art, the invention has the following advantages:

1. the scarcity of data can hinder the development of a model, which tends to be more robust after training with high quality data, and even can make training simple and fast due to the data set. According to the invention, the malicious code is subjected to data enhancement through the generation model completed by the cGAN training, a malicious code sample is generated under the condition that the real characteristics of the malicious code are maintained to the greatest extent, and a malicious code data set is expanded.

2. Because the number of samples in different malicious code families is very different, a classifier trained directly by using a data set is easy to cause an overfitting problem, and proper sample proportion is important to the training set. In actual work, it is difficult to find the optimal initial weights for tens of malicious code families, and the swarm intelligence algorithm is an effective method for solving the problem of complex combination optimization, and the effect of optimizing the initial weights of different malicious code families can be achieved by adopting the swarm intelligence algorithm.

3. According to the invention, the acceptable optimal sample proportion of each malicious code family is obtained by utilizing a swarm intelligence algorithm, and simultaneously, the cGAN is introduced to learn data distribution of different families of malicious codes and generate samples, and finally, an unbalanced data set is processed to construct a malicious code data set with relatively balanced various samples, so that the ideal proportion of different types of malicious codes is achieved when the malicious codes are selected, positive and negative samples have the same position in the training process, and the problem of unbalanced data is more effectively solved.

Drawings

FIG. 1 is a flow chart of balanced dataset construction for malicious code.

FIG. 2 is a flow chart of a population intelligent algorithm.

Fig. 3 cGAN-based malicious code data enhancement model.

Detailed Description

The invention is explained and illustrated below in connection with the accompanying drawings:

in order to make the objects, technical solutions and features of the present invention more apparent, the present invention will be further elaborated with reference to the following specific examples, and referring to the accompanying drawings.

The construction flow chart of the malicious code equalization data set is shown in fig. 1, and comprises the following steps:

s10, constructing a malicious code generation model;

step S20, calculating the acceptable optimal initial sample proportion of malicious codes by adopting a swarm intelligence algorithm;

step S30, generating malicious codes of each family, and constructing a relatively balanced malicious code data set.

The step S10 of constructing the malicious code generation model according to the embodiment further includes the steps of:

step S100, in generating a network, fromFront random distribution p _z Taking out random input z from the step (z), and then splicing and combining the random input z with a malicious code family label y to form a totally new implicit expression;

step S110, in the discrimination network, a real malicious code sample or a generated malicious code is input together with a family tag y to discriminate;

in step S120, the discrimination network D improves its own capability of discriminating true and false samples through multiple iterative learning, and the generation network G improves its own capability of imitation through multiple iterative learning. The two dynamic countermeasures are continuously optimized in the iterative process, and when the D cannot distinguish the real data from the generated data finally, namely, the D takes the generated data G (z) as the real data, the model is considered to be optimal, and the G is considered to be distributed completely by the real sample data. The generation network is the generation model of the malicious code at this time, and the generated data can be regarded as new malicious code sample data.

The generating malicious codes of each family of embodiments, the constructing a relatively balanced malicious code data set step S30 further includes the steps of:

step S300, according to the optimal initial sample proportion calculated in step 20, using the malicious code generation model trained in step 120 to generate data,

in step S310, the generated data set and the original data set together construct a relatively balanced malicious code data set.

The actual effect of the malicious code data unbalanced processing method based on the group intelligent algorithm and the cGAN provided by the invention is verified through experiments. The test environment is a ubuntu14.04 host, an 8G memory and a 1T hard disk. Experimental data were from the Malware Images dataset and the cGAN generated dataset. Two experiments were set up for the present invention:

experiment one, swizzor.gem-! Comparative experiments were performed on two families of I (132 samples) and Ramnit (1541 samples), with unbalanced class ratios exceeding 1:10, using a cGAN network for Swizzor. Gem-! Sample generation for family I, table 1 shows the results for Swizzor. Gem-! I and Ramnit used different ratios to train AlexNet model to get classification accuracy.

TABLE 1 Experimental results of training models at different ratios

As can be seen from Table 1, the generation of Swizzor. Gem-! I family data, which can improve the classification accuracy of the model, is expressed in Swizzor.gem-! Samples of the I and Ramnit families have a ratio of 8: the classification accuracy of the AlexNet model trained at 10 was the best.

In the second experiment, in actual work, it is difficult to find the optimal initial weight for tens of malicious codes, the optimal sample proportion of the malicious codes is calculated by adopting a PSO algorithm, a comparison test is carried out on a data set (D3) obtained after data generation according to the proportion and a data set (D1) with the same proportion as that of the original sample data set and all family samples, and table 2 shows the classification accuracy of the model obtained by training three different data sets.

Table 2 experimental results of three data set training models

Data set	D1	D2	D3
				ROC area	0.905	0.956	0.974

From the comparison data, the malicious code data imbalance processing method based on the group intelligent algorithm and the cGAN has a great improvement effect on the problem of data imbalance classification of malicious codes.

Claims

1. A malicious code data unbalanced processing method based on a group intelligent algorithm and a cGAN is characterized by comprising the following steps of: comprises the steps of,

step 1, constructing a malicious code generation model;

the generation countermeasure network GAN is composed of a generation network G and a discrimination network D, wherein G and D are dynamic games: g spoofing D with the generated samples to spurious, while D continues to increase discrimination capability to discriminate between real data and G-synthesized data until the last two reach Nash equilibrium, i.e., the theoretical G-generated data distribution P _g And a true data distribution P _data Equal; the conditional generation is used for guiding the generation of data through the control of parameters of the antagonism network cGAN, namely, under the original network structure, an extra auxiliary information y is added to the input of a discriminator and a generator, wherein the y is a classification label of each data, and the auxiliary information y is a family label of malicious codes; after the generation network and the discrimination network continuously fight against each other and are subjected to iterative optimization, the generator is used as a generation model of malicious codes;

step 1.3, the discrimination network D improves the capability of discriminating true and false samples by repeated iterative learning, and the generation network G improves the imitation capability by repeated iterative learning; the two dynamic countermeasures are continuously optimized in the iterative process, and when the D cannot distinguish the real data from the generated data at last, namely, the D takes the generated data G (z) as the real data, the model is considered to be optimal, and the G is considered to obtain complete distribution of the real sample data; the generation network is a generation model of malicious codes at the moment, and the generated data is regarded as new malicious code sample data;

step 2, calculating the acceptable optimal initial sample proportion of malicious codes by adopting a swarm intelligence algorithm;

adopting a typical swarm intelligence algorithm PSO algorithm to find acceptable optimal initial weights of different classes of malicious code families; assuming that the number of malicious code families is M, the data enhancement weight is W _i The combination of sampling weights can be regarded as the location of the individual in the swarm intelligence algorithm, given by:

position＝(W ₁ ,W ₂ ,...W _i ,...W _n )

taking the accuracy of the training model as an objective function;

step 3, generating malicious codes of each family, and constructing a relatively balanced malicious code data set;

according to the optimal sample proportion of the malicious code family calculated by the PSO algorithm, carrying out data enhancement on each family sample to different degrees through a cGAN model, namely carrying out various sample generation by using a generation model, and thus constructing a malicious code sample set with balanced data;

step 3.2, generating a data set and constructing a relatively balanced malicious code data set together with the original data set;

in step 3, assuming that the malicious code data set for classification belongs to M categories, let x= (X) ₁ ,X ₂ ,...,X _max ,...X _m ) Training samples for each family, where X _max For one malicious code family with the largest number of family samples, c= (C ₁ ,C ₂ ,...,C _m ) For the optimal sample proportion of each malicious code family obtained by a group intelligent method, m is N+ and m is a positive integer;

x according to the most family sample in the original dataset _max Some kind of data enhancement weight W _i And the likeThe number X _i Calculating the sample size to be generated of the sample; the specific calculation formula is as follows:

Y _i ＝X _max W _i -X _i

data enhancement weight W _i The calculation formula of (2) is as follows:

wherein Y is _i Sample size to be generated for class i family, C _i For the ith value in the optimal sample ratio, C _max For the value of the most sample number class in the optimal sample proportion, i E [1, M]I is the category of the sample, M is the category number;

to obtain Y= (Y) ₁ ,Y ₂ ,...,Y _m ) For the data volume to be generated of various family samples, a malicious code data set X= (X) with relatively balanced data of each family sample is constructed ₁ +Y ₁ ,X ₂ +Y ₂ ,...,X _max ,...X _m +Y _m )。