CN112800426A - Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN - Google Patents

Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN Download PDF

Info

Publication number
CN112800426A
CN112800426A CN202110182166.XA CN202110182166A CN112800426A CN 112800426 A CN112800426 A CN 112800426A CN 202110182166 A CN202110182166 A CN 202110182166A CN 112800426 A CN112800426 A CN 112800426A
Authority
CN
China
Prior art keywords
data
malicious code
sample
family
generated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110182166.XA
Other languages
Chinese (zh)
Other versions
CN112800426B (en
Inventor
梁军淼
宁振虎
曹东芝
公备
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202110182166.XA priority Critical patent/CN112800426B/en
Publication of CN112800426A publication Critical patent/CN112800426A/en
Application granted granted Critical
Publication of CN112800426B publication Critical patent/CN112800426B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Abstract

The invention discloses a malicious code data unbalanced processing method based on a swarm intelligence algorithm and cGAN, which is used for constructing a malicious code generation model. And calculating the acceptable optimal initial sample proportion of the malicious code by adopting a group intelligent algorithm. And generating various families of malicious codes, and constructing a relatively balanced malicious code data set. The acceptable optimal sample proportion of each malicious code family is obtained by using a group intelligent algorithm, cGAN is introduced to learn the data distribution of different families of malicious codes and generate samples, finally, an unbalanced data set is processed, and malicious code data sets with relatively balanced samples are constructed, so that the malicious codes of different types reach an ideal proportion when being selected, positive and negative samples have the same status in the training process, and the problem of data imbalance is solved more effectively.

Description

Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN
Technical Field
The invention belongs to the field of information security, particularly relates to a malicious code data unbalanced processing method based on a swarm intelligence algorithm and cGAN, and belongs to a data balancing strategy in the malicious code classification problem.
Background
With the rapid development of information technology, the internet has become an important part of our daily life, which brings many benefits to our life, learning and work, but simultaneously hides many security problems such as Trojan horse virus, phishing websites and malicious software, wherein malicious codes are one of the main security threats. Driven by economic benefits, the number of new malware samples is explosively increased, anti-malware providers face millions of potential malware samples each year, and research needs to rely on a large number of high-quality samples to build an efficient malware detection model in order to continuously resist the increase of the malware samples.
In classification applications, data imbalance has significant adverse effects on the training of classification models, both in terms of the convergence of the training models and in terms of the generalization of the models during the testing phase. High-quality data is the key of machine learning and deep learning, the scarcity of data can hinder the development of a model, the model trained by the high-quality data is more robust (over-fitting prevention), and even the training can be simple and fast due to a data set. In the problem of malicious code detection, data among malicious code families are seriously unbalanced, so that an over-fitting problem is easy to occur during training, and the trained model has a poor classification effect. Currently, research strategies regarding solving the problem of data imbalance can be roughly divided into three aspects.
1) Research from data level
The method mainly comprises resampling, wherein upsampling and downsampling are commonly used; or data enhancement is carried out on the data with small data volume through a data enhancement method, and a countermeasure network (GAN) is generated with a better effect, so that the distribution of the training set is changed to enable the distribution to tend to be balanced.
2) Research based on algorithm level
The optimal data set sampling weight is obtained through an optimization algorithm, and the group intelligent optimization algorithm has a good effect; or improving a classification algorithm, reducing errors biased to negative classes, and improving the recognition rate of positive classes, wherein the most popular is a cost-sensitive classification algorithm.
3) Data and algorithm level combination
The method mainly integrates the above two strategies (data-level-based research and algorithm-level-based research) to extract respective advantages thereof, and simultaneously reduces respective weaknesses thereof to obtain a data set with balanced distribution, thereby improving the classification effect of the model.
Disclosure of Invention
In order to solve the problem that in the malicious code detection problem, a sample data set is unbalanced to cause poor performance of a trained model, the invention provides a novel method for solving the data imbalance, firstly, a countermeasure network (cGAN) is generated through a constructed Conditional expression to carry out data enhancement on sample data of each family, then, according to the characteristic that a group intelligent algorithm is good at solving the problem of optimized combination, a typical group intelligent algorithm (PSO) is selected to calculate the sample proportion of each family of the malicious code, and the data enhancement is carried out according to the proportion; and finally, constructing a malicious code data set with relatively balanced sample data through the original data set and the data set generated according to the proportion.
The technical scheme adopted by the invention is a malicious code data unbalanced processing method based on a swarm intelligence algorithm and cGAN, which comprises the following steps:
step 1, constructing a malicious code generation model.
The generation countermeasure network (GAN) is mainly composed of two parts of a generation network G (Generator network) and a discriminant network D (discriminant network), and G and D dynamic games: g spoofs D with the generated samples to falsely falsify, while D continually improves discrimination to distinguish between true data and G-synthesized data until the last two reach Nash equilibrium, i.e. the theoretical G-generated data distribution (P)g) And true data distribution (P)data) Are equal. The conditional generation countermeasure network (cGAN) can guide the generation of data through the control of parameters, namely, under the original network structure, an additional auxiliary information y is added to the input of a discriminator and a generator, the y can be a classification label of each data and the like, and the auxiliary information y is a family label of malicious codes in the invention. Continuously confrontation and overlapping between the generation network and the discrimination networkAfter the generation optimization, the generator can be used as a generation model of the malicious code.
Step 1.1, in generating networks, randomly distributing p from the frontz(z) taking out the random input z, and splicing and combining the random input z with the malicious code family label y to form a brand-new implicit expression;
step 1.2, in the discrimination network, both a real malicious code sample or a generated malicious code and a family label y are input together for discrimination;
and 1.3, the discrimination network D improves the capability of discriminating true and false samples by repeated iterative learning, and the generated network G improves the self-counterfeiting capability by repeated iterative learning. The two are dynamically confronted and continuously optimized in an iteration process, when D cannot distinguish real data from generated data at last, namely D takes the generated data G (z) as real data, the model is considered to be optimal, and G is considered to obtain the complete distribution of real sample data. The generated network is a generation model of the malicious code, and the generated data can be regarded as new malicious code sample data.
And 2, calculating the acceptable optimal initial sample proportion of the malicious codes by adopting a group intelligent algorithm.
A typical group intelligence algorithm PSO algorithm is employed to find acceptable optimal initial weights for different classes of malicious code families. Assuming that the number of malicious code families is M, and the resampling weight is WiThe combination of the sampling weights can be seen as the position of the individual in the group intelligence algorithm, which can be given by:
position=(W1,W2,...,Wn)
the accuracy of the training model is used as an objective function, and the algorithm 1 is a malicious code optimal initial weight calculation process based on a group intelligent algorithm.
Figure BDA0002941734430000041
And 3, generating malicious codes of all families and constructing a relatively balanced malicious code data set.
According to the optimal sample proportion of the malicious code family calculated by the PSO algorithm, data enhancement of different degrees is carried out on each family sample through the cGAN model, namely, various samples are generated through the generation model, so that a malicious code sample set with balanced data is constructed.
Assuming that the malicious code data set for classification belongs to M classes, let X ═ X (X)1,X2,...,Xmax,...Xm) Training samples for each family, where XmaxFor a malicious code family with the largest number of family samples, C ═ C1,C2,...,Cm) For the optimal sample proportion of each malicious code family obtained by a group intelligent method, m belongs to N + and m is a positive integer.
Obtaining a class X according to the maximum family sample size in the original data setmaxAnd some kind of data enhancement weight WiAnd the number X of samples of this typeiAnd calculating the sample amount of the sample to be generated. The specific calculation formula is as follows:
Yi=XmaxWi-Xi
data enhancement weight WiThe calculation formula of (a) is as follows:
Figure BDA0002941734430000051
wherein, YiAmount of sample to be generated for class i family, CiIs the ith value, C, in the optimal sample ratiomaxFor the value with the maximum number of samples in the optimal sample proportion, i belongs to [1, M ∈]I is the category of the sample, and M is the number of categories.
Obtaining Y ═ Y1,Y2,...,Ym) For the data size required to be generated by various family samples, a malicious code data set X (X) with relatively balanced sample data of each family is constructed1+Y1,X2+Y2,...,Xmax,...Xm+Ym)
3.1, generating data by using the malicious code generation model trained in the step 1.3 according to the optimal initial sample proportion calculated in the step 2;
and 3.2, the generated data set and the original data set together construct a relatively balanced malicious code data set.
Compared with the prior art, the invention has the following advantages:
1. the scarcity of data can hinder the development of a model, and a model trained with high-quality data tends to be more robust, and even training can be simple and fast due to a data set. According to the method, data enhancement is performed on the malicious code through the generation model finished by cGAN training, the malicious code sample is generated under the condition that the real characteristics of the malicious code are kept to the maximum extent, and a malicious code data set is expanded.
2. Because the number of samples in different malicious code families is very different, a classifier directly trained by using a data set is easy to cause an overfitting problem, and a proper sample proportion is very important for the training set. In actual work, the optimal initial weight is difficult to find for dozens of malicious code families, the group intelligent algorithm is an effective method for solving the complex combination optimization problem, and the effect of optimizing the initial weights of different malicious code families can be achieved by adopting the group intelligent algorithm.
3. The acceptable optimal sample proportion of each malicious code family is obtained by using a group intelligent algorithm, cGAN is introduced to learn the data distribution of different families of malicious codes and generate samples, finally, an unbalanced data set is processed, and malicious code data sets with relatively balanced samples are constructed, so that the malicious codes of different types reach an ideal proportion when being selected, positive and negative samples have the same status in the training process, and the problem of data imbalance is solved more effectively.
Drawings
FIG. 1 is a flow diagram of an equalized data set construction for malicious code.
FIG. 2 is a flow chart of a group intelligence algorithm.
Fig. 3 is based on cGAN's malicious code data enhancement model.
Detailed Description
The invention is explained and illustrated below with reference to the accompanying drawings:
in order to make the objects, technical solutions and features of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.
The flow chart of the construction of the malicious code balanced data set is shown in fig. 1, and comprises the following steps:
step S10, constructing a malicious code generation model;
step S20, calculating the acceptable optimal initial sample proportion of the malicious codes by adopting a group intelligent algorithm;
and step S30, generating malicious codes of each family, and constructing a relatively balanced malicious code data set.
The step S10 of constructing a malicious code generation model of an embodiment further includes the steps of:
step S100, in the generation network, from the preposition random distribution pz(z) taking out the random input z, and splicing and combining the random input z with the malicious code family label y to form a brand-new implicit expression;
step S110, in the discrimination network, both a real malicious code sample or a generated malicious code and a family label y are input together for discrimination;
and step S120, the discrimination network D improves the self true and false discrimination sample capability through repeated iterative learning, and the generation network G improves the self imitation capability through repeated iterative learning. The two are dynamically confronted and continuously optimized in an iteration process, when D cannot distinguish real data from generated data at last, namely D takes the generated data G (z) as real data, the model is considered to be optimal, and G is considered to obtain the complete distribution of real sample data. The generated network is a generation model of the malicious code, and the generated data can be regarded as new malicious code sample data.
The embodiment generates various families of malicious codes, and the step S30 of constructing a relatively balanced malicious code data set further includes the following steps:
step S300, according to the optimal initial sample proportion calculated in step 20, using the malicious code generation model trained in step 120 to generate data,
and step S310, the generated data set and the original data set together construct a relatively balanced malicious code data set.
The practical effect of the malicious code data unbalanced processing method based on the swarm intelligence algorithm and the cGAN provided by the invention is verified through experiments. The test environment is an ubuntu14.04 host, an 8G memory and a 1T hard disk. The experimental data are from the Malware Images dataset and the cGAN generated dataset. Two experiments were set up for the present invention:
first experiment, selecting Swizzor. gem!from Malware Images data set! Comparative experiments were performed on two families, I (132 samples) and Ramnit (1541 samples), with unbalanced class ratios exceeding 1:10, using a cGAN network to swizzor. gem! Sample Generation for family I, Table 1 shows the results for Swizzor. gem! I and Ramnit used different proportions to train AlexNet model to obtain classification accuracy.
TABLE 1 Experimental results of different scale training models
Figure BDA0002941734430000081
From Table 1, we can see that Swizzor. gem. formation is generated by cGAN! Family I data, which can improve the classification accuracy of models, is available in Swizzor. The ratio of the I and Ramnit family samples was 8: the AlexNet model trained at 10 hours has the best classification accuracy.
Experiment two, in actual work, the optimal initial weight is difficult to find for dozens of malicious codes, the PSO algorithm is adopted to calculate the optimal sample proportion of the malicious codes, a data set (D3) obtained after data generation is carried out in proportion is compared with an original sample data set (D1) and a data set (D2) with the same proportion of all family samples, and table 2 shows the classification accuracy of the model obtained by training three different data sets.
Table 2 experimental results of three data set training models
Data set D1 D2 D3
ROC area 0.905 0.956 0.974
From the comparison data, the malicious code data unbalanced processing method based on the group intelligent algorithm and the cGAN has a great improvement effect on the data unbalanced classification problem of the malicious codes.

Claims (2)

1. A malicious code data unbalanced processing method based on group intelligent algorithm and cGAN is characterized in that: comprises the following steps of (a) carrying out,
step 1, constructing a malicious code generation model;
the generation countermeasure network GAN consists of a generation network G and a discrimination network D, and the G and D dynamic games are as follows: g spoofs D with the generated sample to falsely falsify D, and D continuously improves discrimination to distinguish real data from G-synthesized data until the two finally reach Nash balance, i.e. the data distribution P generated by G theoreticallygAnd true data distribution PdataEqual; the conditional generation countermeasure network cGAN guides the generation of data through the control of parameters, namely under the original network structure, an additional auxiliary information y is added to the input of a discriminator and a generator, the y is a classification label of each data, and the auxiliary information y is a family label of a malicious code; continuously confrontation and iteration excellence between generation network and discrimination networkAfter the malicious code is converted, the generator is used as a generation model of the malicious code;
step 1.1, in generating networks, randomly distributing p from the frontz(z) taking out the random input z, and splicing and combining the random input z with the malicious code family label y to form a brand-new implicit expression;
step 1.2, in the discrimination network, both a real malicious code sample or a generated malicious code and a family label y are input together for discrimination;
step 1.3, the discrimination network D improves the capability of discriminating true and false samples by iterative learning for many times, and the generated network G improves the self-counterfeiting capability by iterative learning for many times; the two are dynamically confronted and continuously optimized in an iteration process, when D cannot distinguish real data from generated data at last, namely D takes the generated data G (z) as real data, the model is considered to be optimal, and G is considered to obtain complete distribution of real sample data; the generated network is a generation model of the malicious code, and the generated data is regarded as new malicious code sample data;
step 2, calculating the acceptable optimal initial sample proportion of the malicious codes by adopting a group intelligent algorithm;
searching acceptable optimal initial weights of different classes of malicious code families by adopting a typical group intelligent algorithm PSO algorithm; assuming that the number of malicious code families is M, and the resampling weight is WiThe combination of the sampling weights can be seen as the position of the individual in the group intelligence algorithm, given by:
position=(W1,W2,...,Wn)
taking the accuracy of the training model as a target function;
step 3, generating malicious codes of each family, and constructing a relatively balanced malicious code data set;
according to the optimal sample proportion of the malicious code family calculated by the PSO algorithm, data enhancement of different degrees is carried out on each family sample through a cGAN model, namely, various samples are generated through a generation model, so that a malicious code sample set with balanced data is constructed;
3.1, generating data by using the malicious code generation model trained in the step 1.3 according to the optimal initial sample proportion calculated in the step 2;
and 3.2, the generated data set and the original data set together construct a relatively balanced malicious code data set.
2. The method for unbalanced processing of malicious code data based on group intelligent algorithm and cGAN as claimed in claim 1, wherein: in step 3, assuming that the malicious code data set for classification belongs to M classes, let X ═ X (X)1,X2,...,Xmax,...Xm) Training samples for each family, where XmaxFor a malicious code family with the largest number of family samples, C ═ C1,C2,...,Cm) The optimal sample proportion of each malicious code family obtained by a group intelligent method is that m belongs to N + and m is a positive integer;
obtaining a class X according to the maximum family sample size in the original data setmaxAnd some kind of data enhancement weight WiAnd the number X of samples of this typeiCalculating the sample amount of the sample to be generated; the specific calculation formula is as follows:
Yi=XmaxWi-Xi
data enhancement weight WiThe calculation formula of (a) is as follows:
Figure FDA0002941734420000031
wherein, YiAmount of sample to be generated for class i family, CiIs the ith value, C, in the optimal sample ratiomaxFor the value with the maximum number of samples in the optimal sample proportion, i belongs to [1, M ∈]I is the category of the sample, and M is the number of categories;
obtaining Y ═ Y1,Y2,...,Ym) For the data size required to be generated by various family samples, a malicious code data set X (X) with relatively balanced sample data of each family is constructed1+Y1,X2+Y2,...,Xmax,...Xm+Ym)。
CN202110182166.XA 2021-02-09 2021-02-09 Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN Active CN112800426B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110182166.XA CN112800426B (en) 2021-02-09 2021-02-09 Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110182166.XA CN112800426B (en) 2021-02-09 2021-02-09 Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN

Publications (2)

Publication Number Publication Date
CN112800426A true CN112800426A (en) 2021-05-14
CN112800426B CN112800426B (en) 2024-03-22

Family

ID=75815048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110182166.XA Active CN112800426B (en) 2021-02-09 2021-02-09 Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN

Country Status (1)

Country Link
CN (1) CN112800426B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704842A (en) * 2019-09-27 2020-01-17 山东理工大学 Malicious code family classification detection method
CN110795732A (en) * 2019-10-10 2020-02-14 南京航空航天大学 SVM-based dynamic and static combination detection method for malicious codes of Android mobile network terminal
CN111259393A (en) * 2020-01-14 2020-06-09 河南信息安全研究院有限公司 Anti-concept drift method of malicious software detector based on generation countermeasure network
CN111753299A (en) * 2020-06-22 2020-10-09 重庆文理学院 Unbalanced malicious software detection method based on packet integration

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704842A (en) * 2019-09-27 2020-01-17 山东理工大学 Malicious code family classification detection method
CN110795732A (en) * 2019-10-10 2020-02-14 南京航空航天大学 SVM-based dynamic and static combination detection method for malicious codes of Android mobile network terminal
CN111259393A (en) * 2020-01-14 2020-06-09 河南信息安全研究院有限公司 Anti-concept drift method of malicious software detector based on generation countermeasure network
CN111753299A (en) * 2020-06-22 2020-10-09 重庆文理学院 Unbalanced malicious software detection method based on packet integration

Also Published As

Publication number Publication date
CN112800426B (en) 2024-03-22

Similar Documents

Publication Publication Date Title
Liu et al. A fast network intrusion detection system using adaptive synthetic oversampling and LightGBM
Tesfahun et al. Intrusion detection using random forests classifier with SMOTE and feature reduction
CN113922985B (en) Network intrusion detection method and system based on ensemble learning
CN104809069A (en) Source node loophole detection method based on integrated neural network
Lan et al. A two-phase learning-based swarm optimizer for large-scale optimization
CN114492768B (en) Twin capsule network intrusion detection method based on small sample learning
D’hooge et al. Classification hardness for supervised learners on 20 years of intrusion detection data
CN115811440B (en) Real-time flow detection method based on network situation awareness
Huang et al. Weighting method for feature selection in k-means
Wu et al. Genetic algorithm with multiple fitness functions for generating adversarial examples
CN111786951A (en) Traffic data feature extraction method, malicious traffic identification method and network system
CN113901448A (en) Intrusion detection method based on convolutional neural network and lightweight gradient elevator
CN111400713B (en) Malicious software population classification method based on operation code adjacency graph characteristics
Saheed et al. An efficient hybridization of K-means and genetic algorithm based on support vector machine for cyber intrusion detection system
CN116633601A (en) Detection method based on network traffic situation awareness
CN115577357A (en) Android malicious software detection method based on stacking integration technology
CN113963182A (en) Hyperspectral image classification method based on multi-scale void convolution attention network
CN112560034B (en) Malicious code sample synthesis method and device based on feedback type deep countermeasure network
Cano et al. Subgroup discover in large size data sets preprocessed using stratified instance selection for increasing the presence of minority classes
Zhou et al. Network unknown‐threat detection based on a generative adversarial network and evolutionary algorithm
CN112001424A (en) Malicious software open set family classification method and device based on countermeasure training
CN116502091A (en) Network intrusion detection method based on LSTM and attention mechanism
CN116934470A (en) Financial transaction risk assessment method based on clustering sampling and meta integration
CN112800426A (en) Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN
Soliman et al. A network intrusions detection system based on a quantum bio inspired algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant