CN112800426B - Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN - Google Patents

Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN Download PDF

Info

Publication number
CN112800426B
CN112800426B CN202110182166.XA CN202110182166A CN112800426B CN 112800426 B CN112800426 B CN 112800426B CN 202110182166 A CN202110182166 A CN 202110182166A CN 112800426 B CN112800426 B CN 112800426B
Authority
CN
China
Prior art keywords
data
malicious code
sample
family
malicious
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110182166.XA
Other languages
Chinese (zh)
Other versions
CN112800426A (en
Inventor
梁军淼
宁振虎
曹东芝
公备
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202110182166.XA priority Critical patent/CN112800426B/en
Publication of CN112800426A publication Critical patent/CN112800426A/en
Application granted granted Critical
Publication of CN112800426B publication Critical patent/CN112800426B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Abstract

The invention discloses a malicious code data unbalanced processing method based on a group intelligent algorithm and a cGAN, which constructs a malicious code generation model. An acceptable optimal initial sample proportion of malicious code is calculated by adopting a swarm intelligence algorithm. Generating malicious codes of each family, and constructing a relatively balanced malicious code data set. According to the invention, the acceptable optimal sample proportion of each malicious code family is obtained by utilizing a swarm intelligence algorithm, and simultaneously, the cGAN is introduced to learn data distribution of different families of malicious codes and generate samples, and finally, an unbalanced data set is processed to construct a malicious code data set with relatively balanced various samples, so that the ideal proportion of different types of malicious codes is achieved when the malicious codes are selected, positive and negative samples have the same position in the training process, and the problem of unbalanced data is more effectively solved.

Description

Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN
Technical Field
The invention belongs to the field of information security, and particularly relates to a malicious code data unbalanced processing method based on a group intelligent algorithm and a cGAN, belonging to a data balanced strategy in the problem of malicious code classification.
Background
With the rapid development of information technology, the internet has become an important part of our daily life, brings a plurality of benefits to our life, study and work, but simultaneously hides a plurality of security problems such as Trojan horse viruses, phishing websites and malicious software, wherein malicious code is one of main security threats. Driven by economic interests, new malware samples are growing explosively, and anti-malware providers are facing millions of potential malware samples each year, and research is required to rely on large numbers of high quality samples to build efficient malware detection models in order to continue to combat the increase in malware samples.
In classification applications, data imbalance has a significant adverse effect on training of classification models, both in terms of convergence of the training model and generalization of the test phase model. High quality data is critical to machine learning and deep learning, the scarcity of data can hinder the development of a model, and models trained using high quality data tend to be more robust (prevent overfitting), and even can make training simple and fast due to the data set. In the detection problem of malicious codes, data among malicious code families are seriously unbalanced, so that fitting problems are easy to occur during training, and the classification effect of a trained model is poor. Currently, research strategies for solving the problem of data imbalance can be broadly divided into three aspects.
1) Study from the data plane
The method mainly uses resampling, and commonly used upsampling and downsampling are carried out; or the data enhancement method is used for carrying out data enhancement on the data with small data quantity, and the generation of the countermeasure network (GAN, generative Adversarial Networks) has better effect, so that the distribution of the training set is changed to lead the distribution to be balanced.
2) Research based on algorithm level
The optimal data set sampling weight is obtained through an optimization algorithm, and the group intelligent optimization algorithm is good in effect; or improving the classification algorithm, reducing errors biased to negative classes, and improving the recognition rate of positive classes, wherein the most popular is the cost-sensitive classification algorithm.
3) Combining data with algorithm level
The method mainly integrates the two strategies (study based on a data layer and study based on an algorithm layer) to extract the respective advantages, and reduces the respective weaknesses to obtain a data set with balanced distribution so as to improve the classification effect of the model.
Disclosure of Invention
In order to solve the problem of bad performance of a trained model caused by unbalanced sample data sets in the malicious code detection problem, the invention provides a novel method for solving the data imbalance, which comprises the steps of firstly generating a challenge network (cGAN, conditional Generative Adversarial Networks) through constructed conditions to carry out data enhancement on sample data of each family, and selecting a typical group intelligent algorithm particle swarm algorithm (Particle Swarm Optimization, PSO) to calculate the sample proportion of each family of the malicious code according to the characteristic that the group intelligent algorithm is good at solving the problem of optimizing combination, and carrying out data enhancement according to the proportion; and finally, constructing a malicious code data set with relatively balanced sample data through the original data set and the data set generated according to the proportion.
The technical scheme adopted by the invention is a malicious code data unbalanced processing method based on a group intelligent algorithm and a cGAN, which comprises the following steps:
and 1, constructing a malicious code generation model.
The generation countermeasure network (GAN) is mainly composed of two parts, namely a generation network G (Generator Network) and a discrimination network D (Discriminator Network), and G and D dynamic games: g spoofs D with the generated samples to spurious, while D continues to increase discrimination capability to distinguish between real data and G-synthesized data until the last two reach Nash equilibrium, i.e., the theoretical G-generated data distribution (P g ) And true data distribution (P data ) Equal. The conditional generation antagonism network (cGAN) can guide the generation of data through the control of parameters, namely, under the original network structure, an extra auxiliary information y is added to the input of a discriminator and a generator, wherein the y can be a classification label of each data, and the like, and the auxiliary information y is a family label of malicious codes in the invention. After the generation network and the discrimination network are continuously counteracted and iterated and optimized, the generator can be used as a generation model of malicious codes.
Step 1.1, in the generation network, randomly distributing p from the front z Taking out random input z from the step (z), and then splicing and combining the random input z with a malicious code family label y to form a totally new implicit expression;
step 1.2, in a discrimination network, a real malicious code sample or a generated malicious code is input together with a family tag y to discriminate;
step 1.3, the discrimination network D improves the capability of discriminating true and false samples by repeated iterative learning, and the generation network G improves the imitation capability by repeated iterative learning. The two dynamic countermeasures are continuously optimized in the iterative process, and when the D cannot distinguish the real data from the generated data finally, namely, the D takes the generated data G (z) as the real data, the model is considered to be optimal, and the G is considered to be distributed completely by the real sample data. The generation network is the generation model of the malicious code at this time, and the generated data can be regarded as new malicious code sample data.
And 2, calculating the acceptable optimal initial sample proportion of the malicious codes by adopting a swarm intelligence algorithm.
A typical swarm intelligence algorithm, PSO, algorithm is employed to find the acceptable optimal initial weights for different classes of malicious code families. Assuming that the number of malicious code families is M, the resampling weight is W i The combination of sampling weights can be regarded as the location of the individual in the swarm intelligence algorithm, and can be given by:
position=(W 1 ,W 2 ,...,W n )
taking the accuracy of the training model as an objective function, and the algorithm 1 is the optimal initial weight calculation process of the malicious code based on the group intelligent algorithm.
And 3, generating malicious codes of each family, and constructing a relatively balanced malicious code data set.
According to the optimal sample proportion of the malicious code family calculated by the PSO algorithm, data enhancement of different degrees is carried out on each family sample through the cGAN model, namely various sample generation is carried out by using the generation model, and therefore a malicious code sample set with balanced data is constructed.
Assuming that the malicious code dataset for classification belongs to M categories, let x= (X) 1 ,X 2 ,...,X max ,...X m ) Training samples for each family, where X max For one malicious code family with the largest number of family samples, c= (C 1 ,C 2 ,...,C m ) For the optimal sample proportion of each malicious code family obtained by the group intelligent method, m is N+ and m is a positive integer.
Obtaining X according to the maximum family sample size in the original data set max Some kind of data enhancement weight W i And the number of such samplesQuantity X i The sample size that the sample of this type needs to be generated is calculated. The specific calculation formula is as follows:
Y i =X max W i -X i
data enhancement weight W i The calculation formula of (2) is as follows:
wherein Y is i Sample size to be generated for class i family, C i For the ith value in the optimal sample ratio, C max For the value of the most sample number class in the optimal sample proportion, i E [1, M]I is the category of the sample, and M is the category number.
To obtain Y= (Y) 1 ,Y 2 ,...,Y m ) For the data volume to be generated of various family samples, a malicious code data set X= (X) with relatively balanced data of each family sample is constructed 1 +Y 1 ,X 2 +Y 2 ,...,X max ,...X m +Y m )
Step 3.1, according to the optimal initial sample proportion calculated in the step 2, performing data generation by using the malicious code generation model trained in the step 1.3;
step 3.2 the generated dataset together with the original dataset constitutes a relatively balanced malicious code dataset.
Compared with the prior art, the invention has the following advantages:
1. the scarcity of data can hinder the development of a model, which tends to be more robust after training with high quality data, and even can make training simple and fast due to the data set. According to the invention, the malicious code is subjected to data enhancement through the generation model completed by the cGAN training, a malicious code sample is generated under the condition that the real characteristics of the malicious code are maintained to the greatest extent, and a malicious code data set is expanded.
2. Because the number of samples in different malicious code families is very different, a classifier trained directly by using a data set is easy to cause an overfitting problem, and proper sample proportion is important to the training set. In actual work, it is difficult to find the optimal initial weights for tens of malicious code families, and the swarm intelligence algorithm is an effective method for solving the problem of complex combination optimization, and the effect of optimizing the initial weights of different malicious code families can be achieved by adopting the swarm intelligence algorithm.
3. According to the invention, the acceptable optimal sample proportion of each malicious code family is obtained by utilizing a swarm intelligence algorithm, and simultaneously, the cGAN is introduced to learn data distribution of different families of malicious codes and generate samples, and finally, an unbalanced data set is processed to construct a malicious code data set with relatively balanced various samples, so that the ideal proportion of different types of malicious codes is achieved when the malicious codes are selected, positive and negative samples have the same position in the training process, and the problem of unbalanced data is more effectively solved.
Drawings
FIG. 1 is a flow chart of balanced dataset construction for malicious code.
FIG. 2 is a flow chart of a population intelligent algorithm.
Fig. 3 cGAN-based malicious code data enhancement model.
Detailed Description
The invention is explained and illustrated below in connection with the accompanying drawings:
in order to make the objects, technical solutions and features of the present invention more apparent, the present invention will be further elaborated with reference to the following specific examples, and referring to the accompanying drawings.
The construction flow chart of the malicious code equalization data set is shown in fig. 1, and comprises the following steps:
s10, constructing a malicious code generation model;
step S20, calculating the acceptable optimal initial sample proportion of malicious codes by adopting a swarm intelligence algorithm;
step S30, generating malicious codes of each family, and constructing a relatively balanced malicious code data set.
The step S10 of constructing the malicious code generation model according to the embodiment further includes the steps of:
step S100, in generating a network, fromFront random distribution p z Taking out random input z from the step (z), and then splicing and combining the random input z with a malicious code family label y to form a totally new implicit expression;
step S110, in the discrimination network, a real malicious code sample or a generated malicious code is input together with a family tag y to discriminate;
in step S120, the discrimination network D improves its own capability of discriminating true and false samples through multiple iterative learning, and the generation network G improves its own capability of imitation through multiple iterative learning. The two dynamic countermeasures are continuously optimized in the iterative process, and when the D cannot distinguish the real data from the generated data finally, namely, the D takes the generated data G (z) as the real data, the model is considered to be optimal, and the G is considered to be distributed completely by the real sample data. The generation network is the generation model of the malicious code at this time, and the generated data can be regarded as new malicious code sample data.
The generating malicious codes of each family of embodiments, the constructing a relatively balanced malicious code data set step S30 further includes the steps of:
step S300, according to the optimal initial sample proportion calculated in step 20, using the malicious code generation model trained in step 120 to generate data,
in step S310, the generated data set and the original data set together construct a relatively balanced malicious code data set.
The actual effect of the malicious code data unbalanced processing method based on the group intelligent algorithm and the cGAN provided by the invention is verified through experiments. The test environment is a ubuntu14.04 host, an 8G memory and a 1T hard disk. Experimental data were from the Malware Images dataset and the cGAN generated dataset. Two experiments were set up for the present invention:
experiment one, swizzor.gem-! Comparative experiments were performed on two families of I (132 samples) and Ramnit (1541 samples), with unbalanced class ratios exceeding 1:10, using a cGAN network for Swizzor. Gem-! Sample generation for family I, table 1 shows the results for Swizzor. Gem-! I and Ramnit used different ratios to train AlexNet model to get classification accuracy.
TABLE 1 Experimental results of training models at different ratios
As can be seen from Table 1, the generation of Swizzor. Gem-! I family data, which can improve the classification accuracy of the model, is expressed in Swizzor.gem-! Samples of the I and Ramnit families have a ratio of 8: the classification accuracy of the AlexNet model trained at 10 was the best.
In the second experiment, in actual work, it is difficult to find the optimal initial weight for tens of malicious codes, the optimal sample proportion of the malicious codes is calculated by adopting a PSO algorithm, a comparison test is carried out on a data set (D3) obtained after data generation according to the proportion and a data set (D1) with the same proportion as that of the original sample data set and all family samples, and table 2 shows the classification accuracy of the model obtained by training three different data sets.
Table 2 experimental results of three data set training models
Data set D1 D2 D3
ROC area 0.905 0.956 0.974
From the comparison data, the malicious code data imbalance processing method based on the group intelligent algorithm and the cGAN has a great improvement effect on the problem of data imbalance classification of malicious codes.

Claims (1)

1. A malicious code data unbalanced processing method based on a group intelligent algorithm and a cGAN is characterized by comprising the following steps of: comprises the steps of,
step 1, constructing a malicious code generation model;
the generation countermeasure network GAN is composed of a generation network G and a discrimination network D, wherein G and D are dynamic games: g spoofing D with the generated samples to spurious, while D continues to increase discrimination capability to discriminate between real data and G-synthesized data until the last two reach Nash equilibrium, i.e., the theoretical G-generated data distribution P g And a true data distribution P data Equal; the conditional generation is used for guiding the generation of data through the control of parameters of the antagonism network cGAN, namely, under the original network structure, an extra auxiliary information y is added to the input of a discriminator and a generator, wherein the y is a classification label of each data, and the auxiliary information y is a family label of malicious codes; after the generation network and the discrimination network continuously fight against each other and are subjected to iterative optimization, the generator is used as a generation model of malicious codes;
step 1.1, in the generation network, randomly distributing p from the front z Taking out random input z from the step (z), and then splicing and combining the random input z with a malicious code family label y to form a totally new implicit expression;
step 1.2, in a discrimination network, a real malicious code sample or a generated malicious code is input together with a family tag y to discriminate;
step 1.3, the discrimination network D improves the capability of discriminating true and false samples by repeated iterative learning, and the generation network G improves the imitation capability by repeated iterative learning; the two dynamic countermeasures are continuously optimized in the iterative process, and when the D cannot distinguish the real data from the generated data at last, namely, the D takes the generated data G (z) as the real data, the model is considered to be optimal, and the G is considered to obtain complete distribution of the real sample data; the generation network is a generation model of malicious codes at the moment, and the generated data is regarded as new malicious code sample data;
step 2, calculating the acceptable optimal initial sample proportion of malicious codes by adopting a swarm intelligence algorithm;
adopting a typical swarm intelligence algorithm PSO algorithm to find acceptable optimal initial weights of different classes of malicious code families; assuming that the number of malicious code families is M, the data enhancement weight is W i The combination of sampling weights can be regarded as the location of the individual in the swarm intelligence algorithm, given by:
position=(W 1 ,W 2 ,...W i ,...W n )
taking the accuracy of the training model as an objective function;
step 3, generating malicious codes of each family, and constructing a relatively balanced malicious code data set;
according to the optimal sample proportion of the malicious code family calculated by the PSO algorithm, carrying out data enhancement on each family sample to different degrees through a cGAN model, namely carrying out various sample generation by using a generation model, and thus constructing a malicious code sample set with balanced data;
step 3.1, according to the optimal initial sample proportion calculated in the step 2, performing data generation by using the malicious code generation model trained in the step 1.3;
step 3.2, generating a data set and constructing a relatively balanced malicious code data set together with the original data set;
in step 3, assuming that the malicious code data set for classification belongs to M categories, let x= (X) 1 ,X 2 ,...,X max ,...X m ) Training samples for each family, where X max For one malicious code family with the largest number of family samples, c= (C 1 ,C 2 ,...,C m ) For the optimal sample proportion of each malicious code family obtained by a group intelligent method, m is N+ and m is a positive integer;
x according to the most family sample in the original dataset max Some kind of data enhancement weight W i And the likeThe number X i Calculating the sample size to be generated of the sample; the specific calculation formula is as follows:
Y i =X max W i -X i
data enhancement weight W i The calculation formula of (2) is as follows:
wherein Y is i Sample size to be generated for class i family, C i For the ith value in the optimal sample ratio, C max For the value of the most sample number class in the optimal sample proportion, i E [1, M]I is the category of the sample, M is the category number;
to obtain Y= (Y) 1 ,Y 2 ,...,Y m ) For the data volume to be generated of various family samples, a malicious code data set X= (X) with relatively balanced data of each family sample is constructed 1 +Y 1 ,X 2 +Y 2 ,...,X max ,...X m +Y m )。
CN202110182166.XA 2021-02-09 2021-02-09 Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN Active CN112800426B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110182166.XA CN112800426B (en) 2021-02-09 2021-02-09 Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110182166.XA CN112800426B (en) 2021-02-09 2021-02-09 Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN

Publications (2)

Publication Number Publication Date
CN112800426A CN112800426A (en) 2021-05-14
CN112800426B true CN112800426B (en) 2024-03-22

Family

ID=75815048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110182166.XA Active CN112800426B (en) 2021-02-09 2021-02-09 Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN

Country Status (1)

Country Link
CN (1) CN112800426B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704842A (en) * 2019-09-27 2020-01-17 山东理工大学 Malicious code family classification detection method
CN110795732A (en) * 2019-10-10 2020-02-14 南京航空航天大学 SVM-based dynamic and static combination detection method for malicious codes of Android mobile network terminal
CN111259393A (en) * 2020-01-14 2020-06-09 河南信息安全研究院有限公司 Anti-concept drift method of malicious software detector based on generation countermeasure network
CN111753299A (en) * 2020-06-22 2020-10-09 重庆文理学院 Unbalanced malicious software detection method based on packet integration

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704842A (en) * 2019-09-27 2020-01-17 山东理工大学 Malicious code family classification detection method
CN110795732A (en) * 2019-10-10 2020-02-14 南京航空航天大学 SVM-based dynamic and static combination detection method for malicious codes of Android mobile network terminal
CN111259393A (en) * 2020-01-14 2020-06-09 河南信息安全研究院有限公司 Anti-concept drift method of malicious software detector based on generation countermeasure network
CN111753299A (en) * 2020-06-22 2020-10-09 重庆文理学院 Unbalanced malicious software detection method based on packet integration

Also Published As

Publication number Publication date
CN112800426A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
Liu et al. A fast network intrusion detection system using adaptive synthetic oversampling and LightGBM
Tesfahun et al. Intrusion detection using random forests classifier with SMOTE and feature reduction
Kim et al. Genetic algorithm to improve SVM based network intrusion detection system
Wu et al. Using improved conditional generative adversarial networks to detect social bots on Twitter
CN112231703B (en) Malicious software countermeasure sample generation method combined with API fuzzy processing technology
CN113922985B (en) Network intrusion detection method and system based on ensemble learning
CN110619049A (en) Message anomaly detection method based on deep learning
Nuiaa et al. A new proactive feature selection model based on the enhanced optimization algorithms to detect DRDoS attacks
CN113627543B (en) Anti-attack detection method
Li et al. Improving attack detection performance in NIDS using GAN
Wu et al. Genetic algorithm with multiple fitness functions for generating adversarial examples
CN113901448A (en) Intrusion detection method based on convolutional neural network and lightweight gradient elevator
CN1223941C (en) Hierarchial invasion detection system based on related characteristic cluster
Guo et al. An IoT Intrusion Detection System Based on TON IoT Network Dataset
Kong et al. Evolutionary multi-label adversarial examples: An effective black-box attack
CN112800426B (en) Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN
CN112560034A (en) Malicious code sample synthesis method and device based on feedback type deep countermeasure network
Zhou et al. Network unknown‐threat detection based on a generative adversarial network and evolutionary algorithm
CN116824232A (en) Data filling type deep neural network image classification model countermeasure training method
Jie Research on malicious TLS traffic identification based on hybrid neural network
CN113449865B (en) Optimization method for enhancing training artificial intelligence model
Zhang et al. MF2POSE: Multi-task Feature Fusion Pseudo-Siamese Network for intrusion detection using Category-distance Promotion Loss
Shorfuzzaman Detection of cyber attacks in IoT using tree-based ensemble and feedforward neural network
CN114398977A (en) Network deception traffic generation method based on countermeasure sample
Iftikhar et al. A supervised feature selection method for malicious intrusions detection in IoT based on genetic algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant