CN115442087A - Intelligent attack detection method and system based on DNA calculation - Google Patents

Intelligent attack detection method and system based on DNA calculation Download PDF

Info

Publication number
CN115442087A
CN115442087A CN202210999367.3A CN202210999367A CN115442087A CN 115442087 A CN115442087 A CN 115442087A CN 202210999367 A CN202210999367 A CN 202210999367A CN 115442087 A CN115442087 A CN 115442087A
Authority
CN
China
Prior art keywords
dna
dna sequence
population
characteristic
attack
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210999367.3A
Other languages
Chinese (zh)
Other versions
CN115442087B (en
Inventor
曾增日
赵宝康
彭伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210999367.3A priority Critical patent/CN115442087B/en
Publication of CN115442087A publication Critical patent/CN115442087A/en
Application granted granted Critical
Publication of CN115442087B publication Critical patent/CN115442087B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/002Biomolecular computers, i.e. using biomolecules, proteins, cells

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Organic Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an intelligent attack detection method and system based on DNA calculation, which comprises the steps of coding flow characteristics into base groups to form DNA sequences, randomly initializing a plurality of DNA sequences to form a parent population P, and randomly selecting N DNA sequences from the parent population P as a child population S; carrying out DNA biochemical reaction on the offspring population S and then combining the offspring population S with the parent population P to obtain a population Q; calculating classification performance indexes aiming at each DNA sequence in the population Q as the fitness of each DNA sequence; sequencing all DNA sequences in the population Q based on a non-dominated sequencing algorithm according to fitness, and selecting a new parent population P; and if the iteration times Gen is equal to the set maximum iteration times GenMax or the fitness of each DNA sequence is not changed any more, selecting the optimal DNA sequence from the parent population P according to the fitness, and then decoding the base as the flow characteristic to obtain the optimal characteristic subset. The invention can not only improve the classification precision and solve the unbalance problem, but also reduce the characteristic dimension and screen out the optimal subset to improve the detection rate.

Description

Intelligent attack detection method and system based on DNA calculation
Technical Field
The invention relates to a network security technology, in particular to an intelligent attack detection method and system based on DNA calculation.
Background
In recent years, intrusion means such as DDoS attack, APT attack and data stealing are endless, various networks and interconnection equipment are paralyzed, network security events in a global scope are frequent, and the nightmare for personal, organization and even national security is formed. Therefore, establishing an attack detection system and effectively identifying various network attack behaviors become important in network security today. However, with the explosive growth of network traffic, the network abnormal traffic has the characteristics of complexity, concealment, diversity and the like, which causes the failure of the traditional attack detection system. In order to cope with such a trend, researchers have developed many attack detection methods based on methods such as rough set theory, machine learning, and deep learning. For example: the traditional detection method based on rough set theory and the like has the advantage that the network attack can still be detected under the condition of incomplete data. The method based on machine learning has the capability of optimizing the performance of the classifier by continuously learning and accumulating experiences, and is a better solution compared with the traditional detection method. The deep learning can learn the intrinsic rules and the expression levels of the training data, and then the nonlinear network structure built by building the hidden layer has the advantages of higher efficiency, automatic model building according to problems and no limitation to a fixed problem.
At present, the latest detection methods only aim at the overall detection precision and feature dimension reduction of the network attack and ignore two problems of feature subset optimization and imbalance detection. For example, as the network traffic increases greatly, the contained characteristic information also increases, but more noise and redundant information is included in the information. Meanwhile, the noise and redundant information are connected with useful information in a million ways, so that the calculation complexity and time consumption of attack detection are increased, the useful information is deleted while part of the noise and redundant information is reserved when the dimension of the characteristic is reduced, and a serious challenge is brought to an attack detection technology. Meanwhile, the existing detection data set has the problems of small number of examples due to high difficulty in collecting samples of part attack types, unbalanced data and the like. However, the conventional feature selection method tends to favor the interaction between the tags and features with a large number of instances, and ignores the association between the tags and features with a small number of instances, which aggravates the detection imbalance.
In a dataset with unbalanced classes, most classes consist of normal samples and frequent attack samples, and few classes consist of infrequent rare attack samples. The detection system trained by the data set usually provides detection results with large errors for a few attack categories, so that the robustness of the detection classification system is seriously influenced.
To address the problems associated with data imbalances, many studies have attempted to address the imbalance by changing the distribution of class samples from the data level. For example, the first approach at the data plane is to achieve class balancing by artificially synthesizing a few classes of samples. Typically, data enhancement methods such as interpolation, oversampling, and encoder can be used to balance the data sets and achieve better experimental results. The second method is to reach class balance by manually deleting most classes of samples. Typically, liu et al propose a new hard set sampling technique algorithm that compresses most samples in the hard set using the Kmeans algorithm to reduce most classes to achieve class balance. However, artificially synthesizing samples easily causes the problem of model overfitting, and artificially deleting samples easily causes important information loss of most types of samples. In order to solve the problems of the data level, bedi et al change the classification algorithm, and make the detection system focus on a small number of classes from the algorithm level so as to achieve detection balance, and provide a novel Sim detection system based on a Sim neural network. The system is capable of detecting R2L and U2R attacks with a small number of instances in the NSL-KDD dataset without using conventional class balancing techniques. Although overfitting and information loss of a data layer can be effectively avoided by paying more attention to a few classes of samples, partial classes can not be detected for a few classes of problems with more classes and multiple classes of problems, and meanwhile, the precision of the most classes of problems can not be guaranteed.
DNA calculation is a brand-new research field, and Adleman teaches that the problem of the 7-vertex directed Hamiltonian is successfully solved by using DNA molecules as calculation media in 1994, and the Adleman becomes a milestone for the development of DNA calculation. The DNA calculation is a bionic optimization algorithm based on biological DNA coding and evolution mechanisms, is very effective for solving the complex combination optimization problem, and has the greatest advantage of fully utilizing the mass storage capacity of DNA molecules and the huge parallelism of biochemical reactions. Documents Zang W, ren L, zhang W, et al.a closed model-based DNA genetic optimization schemes. Future Generation Computer Systems,2018, 81:465-477, replacing 0 and 1 codes of traditional evolutionary algorithm by simulating DNA coding mode by using DNA calculation, and experiments show that the DNA calculation has the characteristics of rich population diversity, high convergence rate and the like. Jatoth et al (Jatoth C, gangadharan GR, buyya R. Optimal fixed access ceramic composition using an adaptive genetic algorithm. Future Generation Computer Systems,2019, 94, 185-198.) propose to use various biochemical reactions of biomolecules to complete the DNA calculation process and to represent the information carried by the system with DNA codes, simulating various manipulations of DNA molecules to discover and process the information, while constantly acquiring and updating the information during evolution. Shukla (Shukla A, pandey H M, mehrotra D.comparative view of selection techniques in genetic algorithm.2015international conference on functional trees analysis and knowledge management (ABLAZE). IEEE,2015: 515-519.) shows that the DNA calculation method can not only fully play the idea of invasive DNA calculation, but also solve various complex optimization problems widely existing in the engineering fields such as automatic control, mode recognition, decision making, machine learning and the like. The optimization algorithm based on DNA calculation is further developed compared with a genetic algorithm, and the method accommodates the characteristics that the conventional genetic algorithms are intelligent, can be self-organized and self-learned, evolve a coding population formed by parameters, guide the search advancing to the optimization direction by utilizing random operation and the like. DNA computation has several advantages over common genetic algorithms: (1) DNA coding allows the population to contain more information and have more diverse expression capabilities, allowing more complex knowledge to be expressed using shorter length DNA strands. (2) The DNA calculation greatly expands the optimization method by introducing molecular level operation, so that the convergence speed of the optimization algorithm is higher, and the premature problem can be avoided. (3) The variable length of the chromosome enables the DNA calculation to more easily complete the operations of inserting and deleting base sequences, and is more suitable for optimizing complex knowledge.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides an intelligent attack detection method and system based on DNA calculation, the invention enriches the diversity of flow characteristic populations by expressing the flow characteristics as DNA sequences to generate new flow characteristics by using DNA biochemical reaction, realizes the sequencing based on a non-dominated sequencing algorithm by using classification performance indexes as the fitness of each DNA sequence, effectively solves the premature problem when the algorithms such as the non-dominated sequencing and the like carry out characteristic optimization, and accelerates the convergence speed, so that the invention can improve the classification precision to solve the imbalance problem, and can reduce the characteristic dimension to screen out the optimal subset to improve the detection rate.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
an intelligent attack detection method based on DNA calculation comprises the following steps:
s1, encoding the flow characteristics into DNA sequences consisting of basic groups, and randomly initializing a plurality of DNA sequences to form a parent population P, wherein an individual in the parent population P is a DNA sequence, the DNA sequence consists of basic groups with fixed length, and each basic group is used for representing the state of one flow characteristic and is distributed according to {0,1,2,3 }; initializing iteration times Gen;
s2, randomly selecting N DNA sequences from the parent population P as a child population S;
s3, carrying out DNA biochemical reaction on the DNA sequence in the progeny population S;
s4, combining the parent population P and the child population S to obtain a population Q;
s5, aiming at each DNA sequence in the population Q, decoding bases into flow characteristics to obtain a characteristic subset, and carrying out attack classification detection on the basis of the characteristic subset to obtain a corresponding classification performance index serving as the fitness of each DNA sequence;
s6, sequencing all DNA sequences in the population Q based on a non-dominated sequencing algorithm according to the fitness, and selecting a specified number of DNA sequences to form a new parent population P according to a sequencing result;
s7, judging whether the iteration number Gen is equal to the set maximum iteration number GenMax or whether the fitness of each DNA sequence is not changed, if not, adding 1 to the iteration number Gen and skipping to the step S2; otherwise, selecting an optimal DNA sequence from the new parent population P according to the fitness, and decoding the optimal DNA sequence into flow characteristics serving as an optimal characteristic subset.
Optionally, step S3 includes: firstly, decoding bases into flow characteristics aiming at each DNA sequence in a filial generation population S to obtain a characteristic subset, carrying out attack classification detection based on the characteristic subset, and calculating a classification performance index of the characteristic subset; and then performing DNA biochemical reaction on the DNA sequences in the offspring population S based on the classification performance indexes, wherein the DNA biochemical reaction comprises the intersection among different DNA sequences, the variation in the same DNA sequence and part or all of the reverse sequence in the same DNA sequence.
Optionally, the crossing between the different DNA sequences comprises:
step 1, randomly selecting a plurality of pairs of DNA sequences;
step 2, traversing and selecting a pair of DNA sequences from the plurality of pairs of DNA sequences as a current DNA sequence pair;
step 3, judging whether the distance D of the current DNA sequence pair is larger than a set value D, and if so, skipping to the step 1; otherwise, skipping and continuing to execute downwards;
step 4, calculating cross probability Pm according to the classification performance indexes, generating a random number Pr, and if the condition Pr is less than Pm, performing base transposition on the current DNA sequence pair; otherwise, performing base inversion for the current DNA sequence pair; the function expression for calculating the cross probability Pm according to the classification performance index is as follows:
Figure BDA0003806922120000041
in the above formula, K 1 And K 2 Is a constant parameter, K 1 And K 2 Has a value range of [0,p]P is the dimension of the flow characteristic; f. of max The optimal classification performance index corresponding to all DNA sequences in the filial generation population S is shown as f', the average classification performance index after the current DNA sequence pair is crossed is shown as f avg The average classification performance indexes corresponding to all DNA sequences in the filial generation population S are obtained; wherein f' is obtained by decoding bases into flow characteristics aiming at two DNA sequences obtained after the current DNA sequence pair is crossed, carrying out attack classification detection on the characteristic subsets based on the characteristic subsets, and calculating the average value of classification performance indexes of the two characteristic subsets;
step 5, judging whether the DNA sequence is completely traversed or not, and jumping to the step 2 if the DNA sequence is not completely traversed; otherwise, judging that the cross among different DNA sequences is finished.
Optionally, the variations within the same DNA sequence include: firstly, selecting a DNA sequence according to the variation probability Pc, and then randomly selecting bases in the selected DNA sequence to perform base variation, wherein the base variation comprises base replacement, loss and embedding, and the base replacement comprises conversion variation among bases of the same type and mutual replacement of bases of different types; wherein, the calculation function expression of the variation probability Pc is:
Figure BDA0003806922120000042
in the above formula, K 3 And K 4 Is a constant parameter, K 3 And K 4 Has a value range of [0,p]P is the dimension of the flow characteristic; f. of max The optimal classification performance indexes corresponding to all DNA sequences in the filial generation population S are shown, f is the classification performance index corresponding to the selected DNA sequence, f is the optimal classification performance index corresponding to the selected DNA sequence avg And (4) obtaining the average classification performance indexes corresponding to all DNA sequences in the filial generation population S.
Optionally, the reverse order within the same DNA sequence comprises: firstly, selecting a DNA sequence according to a reverse sequence probability PI, and then randomly selecting two positions in the selected DNA sequence for reversing the base sequence, wherein a calculation function expression of the reverse sequence probability PI is as follows:
Figure BDA0003806922120000043
in the above formula, K 5 Is a constant parameter with the value range of [0,p]P is the dimension of the flow characteristic; f. of max For the optimal classification performance index, f, in the offspring population S i Is the classification performance index of the selected DNA sequence.
Optionally, the classification performance index in step S4 includes part or all of the actual detection accuracy, the imbalance index σ, and the optimal feature index Γ.
Optionally, when performing attack classification detection based on the feature subset, a function expression of the adopted classifier is as follows:
Figure BDA0003806922120000051
in the above-mentioned formula, the compound has the following structure,
Figure BDA0003806922120000052
representing the classification prediction result label obtained by the classifier, f representing the flow characteristic set X established by the classifier to the classification prediction result label
Figure BDA0003806922120000053
The mapping relation between the two; the calculation function expression of the actual detection accuracy is as follows:
Figure BDA0003806922120000054
in the above formula, ACC i In order to detect the actual detection accuracy of the ith attack type in the attack classification, Y represents the real classification corresponding to the flow characteristic set X, and n is the number of samples;
the calculation function expression of the imbalance index sigma is as follows:
Figure BDA0003806922120000055
in the above formula, n is an attack category, E (i) is a detection accuracy rate expected to be achieved by an ith attack category during attack classification detection, and a calculation function expression of the optimal characteristic index Γ is as follows:
Figure BDA0003806922120000056
in the above formula, ACC total In order to realize the actual detection accuracy of all attack types during the attack classification detection, sigma represents the imbalance index,
Figure BDA0003806922120000057
as a subset of features
Figure BDA0003806922120000058
P is the dimension of the flow characteristic; feature subsets
Figure BDA0003806922120000059
The screening function expression of (a) is:
Figure BDA00038069221200000510
in the above equation, k (i) indicates whether the flow rate characteristic i is selected, k (i) =0 indicates that the flow rate characteristic i is not selected, and vice versa indicates that the flow rate characteristic i is selected.
Optionally, step S6 includes: performing constraint domination sequencing on all DNA sequences based on fitness by adopting an NSGA3 algorithm; and selecting a specified number of DNA sequences to form a new parent population P according to the result of the constraint domination ordering by adopting a Niche-Preservation operation.
In addition, the invention also provides an intelligent attack detection system based on DNA calculation, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the steps of the intelligent attack detection method based on DNA calculation.
Furthermore, the present invention also provides a computer-readable storage medium, in which a computer program is stored, the computer program being programmed or configured by a microprocessor to perform the steps of the intelligent attack detection method based on DNA-computing.
Compared with the prior art, the invention mainly has the following advantages:
1. the invention generates a new characteristic set by using DNA biochemical reaction by expressing the flow characteristic set as a DNA sequence, and can select a corresponding characteristic subset according to the characteristic state, thereby enriching the diversity of the flow characteristic subset population.
2. The invention realizes the sequencing based on the non-dominated sequencing algorithm by using the classification performance index as the fitness of each DNA sequence, effectively solves the premature problem when the non-dominated sequencing algorithm and other algorithms carry out feature optimization, accelerates the convergence speed and avoids premature, so that the invention not only can improve the precision of each classification to solve the unbalanced problem, but also can reduce the feature dimension to screen out the optimal subset to improve the detection rate.
3. Massive network data bring a large amount of redundant information, and the increasingly changing network attack types also bring great difficulty to the collection of samples, so that the sample size of each attack type in a data set is seriously unbalanced, the two problems seriously reduce the robustness of the existing detection method, and the method utilizes the DNA biochemical reaction to generate a new flow characteristic subset, can realize the processing of massive flow characteristic data, and has the advantage of strong robustness.
Drawings
FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of DNA calculation in the embodiment of the present invention.
FIG. 3 is a flow chart of DNA biochemical reactions in an embodiment of the present invention.
FIG. 4 is a cross-flow diagram of DNA sequences in an example of the present invention.
Fig. 5 is a flowchart of performing classification to obtain a classification performance index according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of a population selection process in an embodiment of the present invention.
FIG. 7 is a diagram illustrating a distribution of reference points on a normalized hyperplane in an embodiment of the present invention.
Detailed Description
As shown in fig. 1, the intelligent attack detection method based on DNA computation of the present embodiment includes:
s1, encoding the flow characteristics into DNA sequences consisting of basic groups, and randomly initializing a plurality of DNA sequences to form a parent population P, wherein an individual in the parent population P is a DNA sequence, the DNA sequence consists of basic groups with fixed length, and each basic group is used for representing the state of one flow characteristic and is distributed according to {0,1,2,3 }; initializing iteration times Gen;
s2, randomly selecting N DNA sequences from the parent population P as a child population S;
s3, carrying out DNA biochemical reaction on the DNA sequence in the progeny population S;
s4, combining the parent population P and the child population S to obtain a population Q;
s5, aiming at each DNA sequence in the population Q, decoding bases into flow characteristics to obtain a characteristic subset, and carrying out attack classification detection based on the characteristic subset to obtain a corresponding classification performance index as the fitness of each DNA sequence;
s6, sequencing all DNA sequences in the population Q based on a non-dominated sequencing algorithm according to the fitness, and selecting a specified number of DNA sequences to form a new parent population P according to a sequencing result;
s7, judging whether the iteration number Gen is equal to the set maximum iteration number GenMax or whether the fitness of each DNA sequence is not changed, if not, adding 1 to the iteration number Gen and skipping to the step S2; otherwise, selecting an optimal DNA sequence from the new parent population P according to the fitness, and decoding the optimal DNA sequence into flow characteristics serving as an optimal characteristic subset.
In order to solve the problem that the optimal feature subset cannot be evolved due to the premature phenomenon of the non-dominated sorting algorithm, the intelligent attack detection method based on the DNA calculation of the embodiment accelerates the convergence speed by introducing the DNA calculation, so that the non-dominated sorting selection can jump out the local optimum to achieve the global optimum.
When the flux characteristics are encoded as bases in step S1, the DNA encoding principle is as follows: the DNA sequence is composed of nucleotides consisting of 4 types of bases such as adenine (A), guanine (G), cytosine (C) and thymine (T) [37 ]]. The different arrangement of these four classes of bases in a DNA sequence makes it very rich in the genetic information that it can express. If the bases are mathematically described, a single nucleotide X can be considered as an integer between 0 and 3, i.e., X.epsilon. {0,1,2,3}, and the DNA sequence can be expressed as
Figure BDA0003806922120000071
Figure BDA0003806922120000072
DNA coding is a key link of the whole algorithm, subsequent calculation is completed completely on the basis of initial coding, and the length of a DNA sequence and the size of a population also determine the convergence speed and precision of the final problem solution.
Step S3 is used for performing a DNA biochemical reaction on the DNA sequences in the progeny population S, where the DNA biochemical reaction refers to a biochemical reaction occurring at the DNA sequence or in-individual molecule level under a biochemical environment, and specifically refers to base exchange between DNA sequences, exchange of bases or sequences before and after in-individual DNA sequences, inversion of a certain sequence, mutation of a certain base, and the like in this embodiment. Based on the above principles, the following assumptions are made for a population consisting of traffic characteristics: (1) Each nucleotide represents a flow characteristic, and the nucleotide consisting of A, T, C and G represents a characteristic different state and contains the information whether the nucleotide is selected or not. The four base sequences constitute a DNA sequence of fixed length p, which represents the dimension of the flux characteristic. (2) all populations consist of DNA sequences. Different biochemical manipulations can be performed on any one DNA sequence. (3) Each DNA sequence corresponds to its own classification performance index value and represents the survival and replication capacity of the DNA sequence. As shown in fig. 2, the principle of the DNA biochemical reaction performed on the DNA sequence in the progeny population S in step S3 is: firstly, to the characteristicsBase encoding of the DNA forms a characteristic population of DNA sequences. Each base contains information on whether a feature is selected, so the DNA population can be viewed as a feature subset population. And secondly, carrying out sequential biochemical reactions such as selection, crossing, mutation, reverse order and the like on the DNA population in a specific biochemical environment to finally achieve a stable population state. And then decoding the DNA sequence to obtain the solution of the problem, namely the optimal stable characteristic subset population. For example, in the design stage of DNA calculation algorithm, the problem to be solved is to select an appropriate subset of flow characteristics, so the characteristics are encoded according to the problem. The extraction of the core element of the coding is to extract all the flow characteristics F i The pretreatment is to base obey the {0,1,2,3} distribution. Secondly, forming a DNA sequence population, namely, coding p-dimensional characteristics into p-dimensional DNA sequences according to the nucleotide coding rule, and forming F 1 A DNA sequence of = {0,2,3,1,0,3,1,0 }. Then integrated into F according to each individual 1 ={0,2,3,1,0,3,1,0},...,F n The DNA population consists of n individuals, i.e., = {1,0,3,1,2,0,1,3 }. In the biochemical reaction stage, base group conversion is carried out between populations and individuals according to biochemical reaction modes such as crossing, mutation, reverse order and the like, and the base group conversion is formed as F 1 ={0,1,0,3,3,0,2,3},...,F n = {1,0,1,1,2,1,1,3} post-biochemical reaction individuals and populations. In the problem solving stage, the problem can be solved by setting the value of mod (f) i 2) =0 screening rule of the rule whose feature is selected, F 1 The feature finally selected is f 1 ,f 3 ,f 6 And f 7 A subset of features is composed. Then, the characteristic subset is put into a classification performance calculation model to calculate a performance set composed of accuracy, imbalance index and the like of each category as F 1 And finally, selecting N (N is less than N) individuals from the N individuals according to the fitness to form a next generation parent population.
Specifically, step S3 in this embodiment includes: firstly, decoding bases into flow characteristics to obtain a characteristic subset aiming at each DNA sequence in a filial generation population S, carrying out attack classification detection based on the characteristic subset, and calculating a classification performance index of the characteristic subset; and then performing DNA biochemical reaction on the DNA sequences in the offspring population S based on the classification performance indexes, wherein the DNA biochemical reaction comprises part or all of the intersection among different DNA sequences, the variation (also called mutation) in the same DNA sequence and the reverse sequence (also called inversion or inversion) in the same DNA sequence.
As an alternative implementation, as shown in fig. 3, the operations are performed in sequence according to the sequence of interleaving, mutation and reverse order in this embodiment. If the new parent population does not reach the ideal target value or the iteration times are not enough, the biochemical reaction process of the simulated DNA is crossed, mutated and inverted until the requirement is met. Through the circulation, the DNA population contains various conditions, the target value of an individual is closer to the optimal value, and the average target value is continuously improved. And ending the iteration until the fitness of the DNA sequence does not change after the optimal solution is found or a certain limit is reached.
Crossovers are biochemical reactions between DNA sequences, including: firstly, a plurality of pairs of DNA sequences are randomly selected, each pair randomly generates a cross position, and then the contents of the DNA sequence pairs are interchanged according to the cross positions to generate new DNA sequence pairs. In this way, the genes of the DNA population will be greatly altered. The crossing can be divided into a single point crossing, a sub-path crossing, a standard crossing and other ways.
As a preferred embodiment, as shown in fig. 4, the crossover between different DNA sequences in this example includes:
step 1, randomly selecting a plurality of pairs of DNA sequences, for example, selecting N pairs of DNA sequences in the embodiment;
step 2, traversing and selecting a pair of DNA sequences from the plurality of pairs of DNA sequences as a current DNA sequence pair;
step 3, judging whether the distance D of the current DNA sequence pair is larger than a set value D, and if so, skipping to the step 1; otherwise, skipping and continuing to execute downwards; wherein the set value d is a preset inter-sequence distance threshold value, and the range of the set value d is 0-1.7. The computational function expression for the distance D of the current DNA sequence pair is:
Figure BDA0003806922120000081
in the above formula, X i And X j Two DNA sequences representing the current DNA sequence pair, p being the length of the DNA sequence, x ik Is a DNA sequence X i The kth base of (1), x jk Is a DNA sequence X j The kth base of (1).
Step 4, calculating cross probability Pm according to the classification performance indexes, generating a random number Pr, and if the condition Pr is less than Pm, performing base transposition on the current DNA sequence pair; otherwise, performing base inversion for the current DNA sequence pair; wherein, the function expression for calculating the cross probability Pm according to the classification performance index is as follows:
Figure BDA0003806922120000082
in the above formula, K i And K 2 Is a constant parameter, K 1 And K 2 Has a value range of [0,p]P is the dimension of the flow characteristic; f. of max The optimal classification performance index corresponding to all DNA sequences in the filial generation population S is shown as f', the average classification performance index after the current DNA sequence pair is crossed is shown as f avg The average classification performance indexes corresponding to all DNA sequences in the filial generation population S; wherein f' is obtained by decoding bases into flow characteristics aiming at two DNA sequences obtained after the current DNA sequence pair is crossed, carrying out attack classification detection on the characteristic subsets based on the characteristic subsets, and calculating the average value of classification performance indexes of the two characteristic subsets;
step 5, judging whether the DNA sequence is completely traversed or not, and jumping to the step 2 if the DNA sequence is not completely traversed; otherwise, judging that the cross among different DNA sequences is finished.
And calculating a function expression of the cross probability Pm according to the classification performance indexes, wherein when the performance indexes of the DNA population are more than set types or locally optimal, the cross probability Pm is automatically increased, and otherwise, the cross probability Pm is automatically decreased. The cross probability Pm is calculated according to the classification performance index, so that the cross probability is increasedThe rate Pm can automatically change along with the change of the performance index, and the variation effect is better. If Pm =3, the DNA sequence F 1 With DNA sequence F n The intersection can be represented as:
Figure BDA0003806922120000091
variation is a biochemical reaction between individuals within a DNA sequence. Firstly, selecting DNA sequences in a DNA population according to corresponding variation probability Pc, and then randomly selecting bases in the selected DNA sequences for variation. Base variations in DNA sequences are substitutions, deletions and insertions of bases. There are two main types of base substitutions: one is a transition variation between bases of the same type, such as A for G, T for C; another class is the mutual substitution of heterogeneous bases: such as a being replaced by T.
In this example, variations within the same DNA sequence include: firstly, selecting a DNA sequence according to the variation probability Pc, and then randomly selecting bases in the selected DNA sequence to perform base variation, wherein the base variation comprises base replacement, loss and embedding, and the base replacement comprises conversion variation among bases of the same type and mutual replacement of bases of different types; wherein, the calculation function expression of the variation probability Pc is:
Figure BDA0003806922120000092
in the above formula, K 3 And K 4 Is a constant parameter, K 3 And K 4 Has a value range of [0,p]P is the dimension of the flow characteristic; f. of max The optimal classification performance indexes corresponding to all DNA sequences in the filial generation population S are obtained, f is the classification performance index corresponding to the selected DNA sequence, f avg And (4) obtaining the average classification performance indexes corresponding to all DNA sequences in the filial generation population S. In this embodiment, the classification performance index is introduced into the calculation of the variation probability Pc, so that the classification performance index automatically changes along with the change of the classification performance index. When the performance indexes of the DNA population are more concentrated or locally optimal, the variation probability Pc is automatically increased,otherwise, the size is automatically reduced.
The reverse order is also a biochemical reaction within an individual of DNA sequences, and the aim is to try to find a better base evolution order. Firstly, randomly selecting a plurality of DNA sequences from a DNA population according to the probability PI, and then randomly selecting two positions in the selected DNA sequences to carry out base sequence inversion. In this example, the reverse order within the same DNA sequence includes: firstly, selecting a DNA sequence according to a reverse sequence probability PI, and then randomly selecting two positions in the selected DNA sequence for reversing the base sequence, wherein a calculation function expression of the reverse sequence probability PI is as follows:
Figure BDA0003806922120000093
in the above formula, K 5 Is a constant parameter with the value range of [0,p]P is the dimension of the flow characteristic; f. of max For the optimal classification performance index, f, in the offspring population S i Is the classification performance index of the selected DNA sequence. In this embodiment, the classification performance index is introduced into the calculation of the reverse order probability PI, so that the classification performance index automatically changes along with the change of the classification performance index. When the performance indexes of the DNA population are more centralized or locally optimal, the reverse order probability PI is automatically increased, otherwise, the reverse order probability PI is automatically decreased.
The fitness of the DNA sequence refers to a measure of whether the DNA sequence has survival advantage in the DNA population, in this embodiment, the classification performance index is used as the fitness of the DNA sequence, and in this embodiment, the classification performance index in step S4 includes the actual detection accuracy, the imbalance index σ and the optimal characteristic index Γ (in addition, the above parts may also be adopted as required). On the basis of actual detection accuracy, the performance advantages and disadvantages of the method in the aspects of overall precision, feature subset optimization and detection balance can be visually embodied through the imbalance index sigma and the optimal feature index gamma. Moreover, through three classification performance indexes, the embodiment forms a multi-target feature selection algorithm. Compared with the traditional multi-target searching scheme, the scheme of the embodiment can distinguish normal traffic and abnormal traffic and can distinguish different types of abnormal traffic for network abnormity.
Since external information is not basically utilized in the evolution search of the DNA and the search is performed only based on the fitness, the classification performance index plays a decisive role in the evolution of the DNA population. In order to obtain the most accurate fitness quickly and effectively, in the embodiment, the attack classification detection based on the feature subset adopts a K nearest neighbor classification method (KNN) as a main method and calculates the classification performance index of the feature subset as the fitness of each DNA sequence in a manner of optimizing a hyper-parameter as an auxiliary method.
As shown in fig. 5, the flow data set is first pre-processed to make it suitable for training the KNN-dominated fitness computation model. And secondly, selecting a corresponding characteristic subset according to the base value of the DNA sequence and putting the characteristic subset into a fitness calculation model for training. And then optimizing a fitness calculation model according to a hyper-parameter optimization method such as random search and the like to obtain the fitness of the DNA sequence. The specific process comprises the following steps:
step 1, data preprocessing.
Raw flow data is largely incomplete, inconsistent, and very noise-contaminated dirty data that must be preprocessed for training and detection. To improve the quality of the data, this step follows two operations to prepare the training and detection data:
(1) Maximum and minimum normalization: the washed data set also has the problems of different value ranges and often different distributions of different features, so that the features cannot be directly compared with each other. In order to make the features comparable, all the features can be converted into values in a consistent range through a normalization mode, so that all the features have consistent weight influence on a detection model, and the quality of a data set and the convergence speed during model training are improved. The essence of normalization is a linear transformation that does not cause substantial changes to the features and retains the original information of the DNA sequence. In this embodiment, the specific normalized calculation function expression is as follows:
Figure BDA0003806922120000101
in the above formula, Y' ij Is Y ij A normalized value of (a), an integer ranging from 0 to N, Y ij The jth flow characteristic for the ith sample, min (Y) j ) Denotes the jth flow characteristic Y j Minimum value of, max (Y) j ) Indicating the jth flow characteristic Y j Round represents rounding.
(2) Removing fuzzy values: the corresponding relation between the set with partial characteristic values in the normalized detection data set and the label is wrong or fuzzy. Such as S 1 The value of part of the features is illegal number, S 2 The attack tag corresponding to the flow characteristic value set is null, and the data sets containing the two corresponding relations are error sets; and as S 3 And S 4 If the same feature value set corresponds to two or more attack tags, the corresponding data set is called a fuzzy set. Both data sets reduce the robustness of the detection system. Therefore, after the data is normalized, the error set and the fuzzy set are deleted, and only the determined data set with the feature value set and the label in one-to-one correspondence is left, so that the robustness of the attack detection system is improved.
And 2, selecting characteristics.
In order to select a suitable subset of features for training, feature screening is performed herein based on the following rules.
Assuming that the number of DNA populations is n, the number of flux features to be detected is p,
Figure BDA0003806922120000111
represents the ith DNA sequence of the t generation, i ∈ {1,2,3. Each DNA sequence is represented by a p-bit quaternary (A0,T 1,C 2,G 3). Thus each individual
Figure BDA0003806922120000112
{ B }. Epsilon {0,1,2,3}. Then
Figure BDA0003806922120000113
Of the selected characteristicsCharacterized in that:
γ i =∑{k|mod(B k ,2)=1},
in the above formula, γ i To represent
Figure BDA0003806922120000114
A subset of features consisting of the selected features, mod represents a remainder function, k ∈ {1,2.
And 3, optimizing the hyper-parameters of the classifier.
Parameters which can be adjusted in the training process of algorithms such as machine learning are generally called hyper-parameters, and the selection of values of the hyper-parameters has great influence on the detection effect of the model. Hyper-parametric tuning can be seen as a "learning" problem, i.e. learning a model G such that loss = G, where loss is the optimum value of the loss function that can be obtained after training given a hyper-parametric λ model. With the model, the optimal value loss of the loss function corresponding to any hyper-parameter lambda can be predicted, and the value of the hyper-parameter lambda corresponding to the optimal value loss of the loss function can be found naturally. The solving formula for optimizing the hyperparameter lambda is as follows:
λ (*) =argmin λ∈∧ E x ~G x [L(x;A λ (X train ))],
in the above formula, λ (*) Is the solution of the over-parameter lambda, lambda is the optional range of the over-parameter lambda, E x ~G x [L(x;A λ (X train ))]To generalize the error, E x To generalize the expected value of the error, G x For the distribution function, L is the loss function, x is the training data, A is the function to be optimized, A λ To select the optimized function of the hyperparameter λ, X train Is a training data set.
In this embodiment, when performing attack classification detection based on the feature subset, the function expression of the adopted classifier is as follows:
Figure BDA0003806922120000115
in the above-mentioned formula, the compound has the following structure,
Figure BDA0003806922120000116
representing the classification prediction result label obtained by the classifier, f representing the flow characteristic set X established by the classifier to the classification prediction result label
Figure BDA0003806922120000117
The mapping relationship between the two; the calculation function expression of the actual detection accuracy is as follows:
Figure BDA0003806922120000118
in the above formula, ACC i In order to detect the actual detection accuracy of the ith attack type in the attack classification, Y represents the real classification corresponding to the flow characteristic set X, and n is the number of samples;
the computational function expression of the imbalance index σ is:
Figure BDA0003806922120000121
in the above formula, n is an attack category, E (i) is a detection accuracy expected to be achieved by the ith attack category during attack classification detection, and the imbalance index σ is used to measure the deviation degree between each classification detection accuracy and the expected accuracy, and a closer value to 1 indicates that the problem of classification imbalance existing in the detection system is more serious, otherwise, the problem of classification imbalance does not exist. The expression of the calculation function of the optimal characteristic index gamma is as follows:
Figure BDA0003806922120000122
in the above formula, ACC total In order to realize the actual detection accuracy of all attack types during the attack classification detection, sigma represents the imbalance index,
Figure BDA0003806922120000123
as a subset of features
Figure BDA0003806922120000124
P is the dimension of the flow characteristic; the optimal characteristic index gamma is used for reducing the number of the characteristics required to be trained and detected as much as possible on the premise of ensuring that the performance of the detection system is not reduced, and the closer to 1, the worse the capability of the intrusion detection system for screening the characteristics is, and the better the characteristic screening is otherwise. Feature subsets
Figure BDA0003806922120000128
The screening function expression of (1) is:
Figure BDA0003806922120000125
in the above equation, k (i) indicates whether the flow rate characteristic i is selected, k (i) =0 indicates that the flow rate characteristic i is not selected, and conversely, indicates that the flow rate characteristic i is selected.
Step S6 in this embodiment includes: performing constraint domination sequencing on all DNA sequences based on fitness by adopting an NSGA3 algorithm; and selecting a specified number of DNA sequences to form a new parent population P according to the result of the constraint domination ordering by adopting a Niche-Preservation operation.
Steps S4 to S6 are steps of population selection, and the process of population selection is shown in fig. 6. First, the father generation and the son generation are combined into a new population Q, and then whether all DNA sequences in the population Q are arranged or not is judged. If there is no ordering, the DNA sequence is decoded to calculate fitness. And secondly, generating a reference point according to the detection target, dividing a reference line and searching an associated reference point. And finally, performing constraint domination sequencing on all DNA sequences based on the fitness and the reference point, wherein the sequencing selection mode is consistent with the NSGA3 algorithm. If so, then the parent population of the new generation is selected according to the Niche-Preservation operation (see: aguilar-river A.A GPU full vector approach to access performance of NSGA-2 based on mechanical non-doping monitoring and grid-crowning applied Soft Computing 2020, 88, 106047.).
The NSGA3 algorithm utilizes known Das and Dennis system methods to determine the set of reference points used for each generation. This method defines a reference point on the normalized hyperplane that is equally tilted for all target axes and has an intercept on each target axis. The reference point is assumed to be on a hyperplane with M-1 dimension, wherein M is the dimension of a target space, namely the number of optimized detection classification performance indexes. If we divide each target into H shares, then the number of reference points Cd is:
Figure BDA0003806922120000126
in the above-mentioned formula, the compound has the following structure,
Figure BDA0003806922120000127
represents a permutation combination of arbitrarily taking out H reference points from H + m-1 reference points. For example, for a problem of M =3 with H =5, its reference points form a triangle, which is known from this equation to yield 21 reference points. Thus, the reference points created by this method are widely and uniformly distributed across the normalized hyperplane, as shown in FIG. 7. However, this method faces a problem that the computation and storage overhead increases rapidly when the objective function is large. For example, when both H and m are 10, the calculation according to this formula yields 92378 reference points. To solve this problem, the reference point generation Method of Deb and Jain's Method can be adopted, and the main idea is to generate a hyperplane of two reference points, i.e. inner and outer layers, so as to reduce the reference points and ensure the wide distribution of the reference points.
The purpose of the reference point is to obtain a mapping relationship (i.e., a vertical distance) between the population individuals and the response reference point, so that the population evolves toward a direction closer to the reference point, and the distribution of the reference point is more uniform. In FIG. 7, an "ideal" hyperplane has been created, and then a reference point is created on the hyperplane. The hyperplane is the Pareto frontage which we need to find, so we need to continuously evolve the population towards the Pareto frontage, and this process needs us to construct the mapping relation between the population individuals and the reference point.
The associated reference points are first calculated by calculating the vertical distance from the DNA strand to the reference line, and the horizontal distance d from the DNA strand x to the jth reference line L j,1 (x) And a vertical distance d j,2 (x) The calculation formula of (2) is as follows:
Figure BDA0003806922120000131
Figure BDA0003806922120000132
in the above formula, Z j A coordinate vector that is a reference line L; (F (x)) T Z j As the performance index F (x) and the reference line Z j The associated reference point needs to take advantage of the vertical distance d j,2 (x) Find the reference point closest to the DNA sequence x. At this time, the horizontal distance d of the individual to the associated reference point j,1 (x) The convergence can be characterized, and the smaller the value, the better the convergence; perpendicular distance d of DNA strand to associated reference point j,2 (x) The diversity is characterized, and the smaller the value, the better the diversity.
The constraint governing relationship is defined as follows: for J reference lines, J =1,2,.., J, any given two DNA sequences X a And X b If the following two conditions are satisfied, X is considered to be a Dominating X b And has:
(I) For the
Figure BDA0003806922120000133
All have d j,1 (X a )≤d j,1 (X b ) And d j,2 (X a )≤d j,2 (X b ) This is true.
(II)
Figure BDA0003806922120000134
So that d j,1 (X a )<d j,1 (X b ) Or d j,2 (X a )<d j,2 (X b ) This is true.
The constraint domination relation based on the reference point is that the reference point is adopted to divide the target space into a plurality of sub-areas, the sub-area where the target space is located by utilizing the objective function value of the infeasible solution, and the sparsity of the target space is judged by utilizing a niche mechanism. The objective function value of the infeasible solution can represent the position of the infeasible solution in the target space, and the distribution of the position plays an important role in the diversity of the population, so that the infeasible solution in a sparse area needs to be selected to improve the diversity of the population.
The restricted dominance ordering of all DNA sequences based on fitness using NSGA3 algorithm includes: firstly, according to the constraint domination relationship, selecting the DNA sequence of the top-ranked region to be put into the next generation population P t+1 Until the next generation population P t+1 The number of individuals in the total is more than or equal to n. Suppose F l For the next generation of population P t+1 The last non-dominant layer where the number of mesomers is ≧ n for the first time, where { F 1 ,F 2 ,...,F l-1 Is the preceding non-dominant layer, the number of individuals is n 1 ,F 1 The number of layers is n 2 . If n is 1 +n 2 = n, the selection is stopped, the population P t ={F 1 ,F 2 ,...,F l }; otherwise from F l The layer region is selected from n-m DNA sequences such that P t+1 Is equal to n.
The computational complexity of the intelligent attack detection method based on DNA computation of the embodiment is explained as follows: for a population of size 2N and with an m-dimensional target vector, assume that the number of flow samples is Z and the feature dimension is P. The calculated amount of DNA biochemical reactions such as crossing, mutation, reverse order and the like is less than or equal to O (N), and the times required for the DNA biochemical reactions are O (N). When the classification performance index is calculated, the data preprocessing needs O (Z) times of calculation, and the feature selection needs O (NP) times of calculation, so the classification performance index calculation needs O (NZP) times of calculation, and the calculation times when the classification performance index is calculated are O (NZP). Furthermore, the non-dominated sorting requires O (NlogM-2N) computation, and the reference point generation and association requires O (MNH) computations, so the number of computations required for group selection is based on the larger of (N2 logM-2N) or O (N2M). Therefore, the intelligent attack detection method based on DNA calculation in the embodimentThe computational complexity of (A) is represented by O (NZP), (N) 2 log (M-2) N) and O (N) 2 The larger of M) is the criterion. However, in general, Z > N, so the computation complexity of ADDC is basically based on the complexity O (NZP) of the classification performance index, and the application of the non-dominated sorting multi-objective optimization algorithm based on the reference point has little influence on the detection algorithms such as machine learning based on the hyper-parameter optimization algorithm.
The network intrusion detection data set can be divided into seven categories, namely a data set based on network flow, a data set based on a power grid and a data set based on internet flow. The data set for verifying the performance of the algorithm is mainly based on network traffic, wherein the commonly used data sets are as follows: CICDDoS2019, doHBrw-2020, NSL-KDD and UNSW-NB 15. In this embodiment, a software and hardware specification table of the experiment is performed.
Table 1: software and hardware specification table of experiment.
Figure BDA0003806922120000141
Since the amount of data in the data set is large enough, we chose about 10% of the data as the test set. The 4 data sets were randomly divided into a training data set and a test data set using a 90: 10 ratio in this experiment.
The preset parameter values of each data set during training are shown in table 4. Wherein the iteration represents the number of iterations of the algorithm, and the population size is the number of DNA sequences in the initialized parent population.
Table 2: a parameter value setting table.
Figure BDA0003806922120000142
Figure BDA0003806922120000151
In this experiment, as a comparison of the method (ADDC) of the present example with the conventional methods, there are: RS-KNN-CFS, TPE-KNN-CFS, anneal-KNN-CFS, RS-KNN-IGFS, anneal-KNN-IGFS, TPE-KNN-IGFS, RS-RF-CFS, anneal-RF-CFS, TPE-RF-CFS, RS-RF-IGFS, anneal-RF-IGFS, TPE-RF-IGFS, GMGWO-ECAE, and BRS. In this experiment, we evaluated the performance of each detection method by 5-class and multi-class accuracy comparisons on CICDDoS2019, doHBrw-2020, NSL-KDD and UNSW-NB15 datasets, as shown in tables 3-111. The two categories are divided into a normal type and an abnormal type, and the abnormal type is that all attack types are regarded as one type of data. The multi-classification includes normal type and various attack types, for example, 4 classifications on DoHBrw-2020 data set include 3 attack types, such as normal and Non-DoH, malcious-DoH.
Experiments prove that: (1) In the dichotomy, the method (ADDC) of the embodiment has better overall performance, and can ensure higher detection accuracy of normal data. For example, in the detection of normal data, in the CICDDoS2019 dataset with better detection performance, the detection accuracy =100% in the method (ADDC) of this embodiment, and in the unstnb 15 dataset with the worst detection performance, the detection performance is improved by more than 10% compared with other best ADSs (intrusion detection systems). And in other data sets, the data are all kept above 99.5%, and compared with other algorithms, the data are all improved. In the detection of abnormal data, because the data set of DoHBrw-2020 has less abnormal data (normal data: abnormal data is 141696. The method (ADDC) of the embodiment basically maintains more than 50% of detection accuracy, and proves that the detection balance of the method is stronger than that of other ADSs, and the method can detect attacks which cannot be detected by other ADSs. Meanwhile, in the overall performance, the detection accuracy of the method (ADDC) is more than 99.4 percent, which is better than that of other ADS. (2) In multi-class detection, the IDS detection performance of part of IGFS is obviously reduced, and particularly when the CICDDoS2019 data set contains 11 attacks, the accuracy rate is even lower than 30%. The method (ADDC) of the embodiment basically keeps stable, except that the overall accuracy in the DoHBrw-2020 dataset is lower than 90% (compared with the best ADS, the detection accuracy is improved by more than 3%), and the rest is higher than 90%. In the multi-classification of unsknb 15 dataset, the ratio of its classes is Analysis: backdoor: doS: exploits: fuzzers: the method comprises the following steps of Generic: normal: reconnaissance: shellcode: worms is 397:319:2458:6724: 3572:8195:11268:2177:219:30. the detection accuracy of other IDSs in 5 classes such as Analysis, backdor, reconnaisnce, shellcode, worms and the like with the number of examples being lower than 400 (the proportion is lower than 1%) is very low or even 0, while the method (ADDC) of the embodiment keeps higher detection accuracy in 3 classes such as Analysis, reconnaisnce, shellcode and the like, and the overall detection rate is improved by more than 10% compared with other optimal IDSs, thereby proving that the detection performance and the balance of the other IDSs are superior to those of the other IDSs. In summary, in both the two-class classification and the multi-class classification, the ADS based on the method (ADDC) of the present embodiment can better improve the overall detection accuracy of the system, and can keep a certain balance in the unbalanced data set with a small number of instances of some types, thereby effectively detecting most types of attacks with a small number of instances. This is because, based on DNA calculation, population can be enriched effectively while avoiding premature and jumping out of local optimums, improving the detection capability for attack species with a small number of instances. While other algorithms result in an imbalance in detection performance based on overfitting for the higher number of instances categories.
While the ADS proposed herein can exhibit optimal performance as assessed by performance indicators such as overall accuracy, correct and abnormal data detection rates, these indicators are not reliable in the classification problem of unbalanced training and test data sets. Then, in the present experiment, the present embodiment evaluates the balance performance of each ADS in four large data sets by the imbalance exponent σ and the optimal characteristic exponent Γ. Where γ is a subset of features
Figure BDA0003806922120000161
Is the imbalance exponent, and Γ is the optimal characteristic exponent. Because F is less in partial data set, F is multiplied by 100 on the basis of the original F. Experiments prove that the imbalance index of the method (ADDC) of the embodiment is improved by more than 50% in the DoHBrw-2020 dataset compared with other optimal ADSThe decrease in the upper limit. It turns out that the present embodiment method (ADDC) maintains a strong balance of performance regardless of whether it is two-class or multi-class. In the case of binary classification, except in the NSL _ KDD dataset and the GMGWO algorithm, the number of features required by the method (ADDC) of this embodiment is greater than part of the ADS, and smaller than other ADS in the remaining datasets. The optimal characteristic index Γ < 0.001 for this embodiment method (ADDC) is lower than other IDSs in all datasets, especially Γ =0 in the cic cddos2019 and DoHBrw-2020 datasets. Under the condition of multi-classification, the optimal characteristic index gamma of the method (ADDC) of the embodiment is maintained at a lower level, and is greatly improved compared with other optimal ADS. This shows that although only a few features are selected in the method (ADDC) of this embodiment, the association relationship between these features and the label is better than that of other ADS, and can ensure higher balance and accuracy.
Experimental results show that the method (ADDC) not only fully utilizes DNA calculation to enrich population and jump out local optimum to find global optimum, but also inherits the multi-objective optimization capability of NSGA3 algorithm to achieve the purpose of considering a small number of instances for attack detection, and utilizes a hyper-parameter optimization detection system to balance detection of various attack types with reasonable accuracy. From the binary classification results, the method (ADDC) of the present embodiment can maintain a high level of overall, normal, and abnormal accuracy for all data sets. Although larger in partial data sets than the feature subsets of the GMGWO and BRS screens, it is significantly stronger in terms of accuracy, disparity and index and optimal feature subsets, so this embodiment method (ADDC) can provide a more reasonable feature set than other detection methods. And the analysis of the multi-class classification result reveals that the existing detection method can not provide effective detection accuracy aiming at the situation that the number of the attack class instances is more and less, and the method (ADDC) of the embodiment can be better solved and provides better balance performance and feature subsets. These results show that the model of the present embodiment method (ADDC) is balanced and stable regardless of the size of the data set. The experimental result shows that compared with the existing detection method, the intelligent attack detection method based on DNA calculation improves the overall accuracy of multi-classification by about 10% at most; the imbalance index is reduced by 0.5 at most, and 1.5 attack types are detected on average; the maximum improvement of the optimal index of the feature subset exceeds 83.8 percent.
In summary, the intelligent attack detection method based on DNA calculation in this embodiment mainly includes three steps of DNA calculation, classification performance index calculation, and population selection. DNA was calculated as DNA coding and biochemical reaction: firstly, the random initialization characteristic set is a DNA sequence parent population, all sequences are based on 4 types of base codes of DNA, and then new individual and offspring populations are formed through DNA biochemical reactions such as crossing, mutation, inversion and the like. The classification performance index calculation comprises the steps of firstly decoding a DNA sequence, selecting a corresponding base set as a characteristic subset, then optimizing a detection model by combining a hyper-parameter optimization method, and finally calculating the precision, imbalance index and optimal characteristic subset index of each category, and taking the results as the fitness of the DNA sequence. The population selection comprises the steps of firstly merging parent population and offspring population, then sequencing all DNA sequences according to fitness by using a non-dominated sequencing method based on a reference point, and finally selecting the parent population of the next generation according to the sequencing result. The three steps are circularly repeated until the preset iteration number is reached or the fitness of the DNA sequence is not changed any more.
In addition, the present embodiment also provides an intelligent attack detection system based on DNA calculation, which includes a microprocessor and a memory connected to each other, wherein the microprocessor is programmed or configured to execute the steps of the aforementioned intelligent attack detection method based on DNA calculation. Furthermore, the present embodiment also provides a computer-readable storage medium, in which a computer program is stored, the computer program being programmed or configured by a microprocessor to perform the steps of the aforementioned intelligent attack detection method based on DNA computation.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention should also be considered as within the scope of the present invention.

Claims (10)

1. An intelligent attack detection method based on DNA calculation is characterized by comprising the following steps:
s1, encoding the flow characteristics into DNA sequences consisting of basic groups, and randomly initializing a plurality of DNA sequences to form a parent population P, wherein an individual in the parent population P is a DNA sequence, the DNA sequence consists of basic groups with fixed length, and each basic group is used for representing the state of one flow characteristic and is distributed according to {0,1,2,3 }; initializing iteration times Gen;
s2, randomly selecting N DNA sequences from the parent population P as a child population S;
s3, carrying out DNA biochemical reaction on the DNA sequence in the progeny population S;
s4, combining the parent population P and the child population S to obtain a population Q;
s5, aiming at each DNA sequence in the population Q, decoding bases into flow characteristics to obtain a characteristic subset, and carrying out attack classification detection based on the characteristic subset to obtain a corresponding classification performance index as the fitness of each DNA sequence;
s6, sequencing all DNA sequences in the population Q based on a non-dominated sequencing algorithm according to the fitness, and selecting a specified number of DNA sequences to form a new parent population P according to a sequencing result;
s7, judging whether the iteration times Gen is equal to the set maximum iteration times GenMax or whether the fitness of each DNA sequence is not changed, if not, adding 1 to the iteration times Gen and skipping to the step S2; otherwise, selecting an optimal DNA sequence from the new parent population P according to the fitness, and decoding the optimal DNA sequence into flow characteristics serving as an optimal characteristic subset.
2. The intelligent attack detection method based on DNA calculation according to claim 1, wherein the step S3 comprises: firstly, decoding bases into flow characteristics to obtain a characteristic subset aiming at each DNA sequence in a filial generation population S, carrying out attack classification detection based on the characteristic subset, and calculating a classification performance index of the characteristic subset; and then performing DNA biochemical reaction on the DNA sequences in the offspring population S based on the classification performance indexes, wherein the DNA biochemical reaction comprises the intersection among different DNA sequences, the variation in the same DNA sequence and part or all of the reverse sequence in the same DNA sequence.
3. The intelligent attack detection method based on DNA calculation according to claim 2, wherein the crossing between different DNA sequences comprises:
step 1, randomly selecting a plurality of pairs of DNA sequences;
step 2, traversing and selecting a pair of DNA sequences from the plurality of pairs of DNA sequences as a current DNA sequence pair;
step 3, judging whether the distance D of the current DNA sequence pair is larger than a set value D, and if so, skipping to the step 1; otherwise, skipping and continuing to execute downwards;
step 4, calculating cross probability Pm according to the classification performance indexes, generating a random number Pr, and if the condition Pr is less than Pm, performing base transposition on the current DNA sequence pair; otherwise, performing base inversion for the current DNA sequence pair; wherein, the function expression for calculating the cross probability Pm according to the classification performance index is as follows:
Figure FDA0003806922110000011
in the above formula, K 1 And K 2 Is a constant parameter, K 1 And K 2 Has a value range of [0,p]P is the dimension of the flow characteristic; f. of max The optimal classification performance index corresponding to all DNA sequences in the filial generation population S is shown as f', the average classification performance index after the current DNA sequence pair is crossed is shown as f avg The average classification performance indexes corresponding to all DNA sequences in the filial generation population S; wherein f' is obtained by decoding bases into flow characteristics aiming at two DNA sequences obtained after the current DNA sequence pair is crossed, carrying out attack classification detection on the characteristic subsets based on the characteristic subsets and calculating the average value of classification performance indexes of the two characteristic subsets;
step 5, judging whether the DNA sequence is completely traversed or not, and jumping to the step 2 if the DNA sequence is not completely traversed; otherwise, judging that the cross among different DNA sequences is finished.
4. The DNA-computation-based smart attack detection method according to claim 2, wherein the variation within the same DNA sequence comprises: firstly, selecting a DNA sequence according to the variation probability Pc, and then randomly selecting bases in the selected DNA sequence to perform base variation, wherein the base variation comprises base replacement, loss and embedding, and the base replacement comprises conversion variation among bases of the same type and mutual replacement of bases of different types; wherein, the calculation function expression of the variation probability Pc is:
Figure FDA0003806922110000021
in the above formula, K 3 And K 4 Is a constant parameter, K 3 And K 4 Has a value range of [0,p]P is the dimension of the flow characteristic; f. of max The optimal classification performance indexes corresponding to all DNA sequences in the filial generation population S are shown, f is the classification performance index corresponding to the selected DNA sequence, f is the optimal classification performance index corresponding to the selected DNA sequence avg And obtaining the average classification performance index corresponding to all DNA sequences in the filial generation population S.
5. The intelligent attack detection method based on DNA calculation according to claim 2, wherein the reverse order inside the same DNA sequence comprises: firstly, selecting a DNA sequence according to a reverse sequence probability PI, and then randomly selecting two positions in the selected DNA sequence for reversing the base sequence, wherein a calculation function expression of the reverse sequence probability PI is as follows:
Figure FDA0003806922110000022
in the above formula, K 5 Is a constant parameter with the value range of [0,p]P is the dimension of the flow characteristic; f. of max For the optimal classification performance index, f, in the offspring population S i Is the classification performance index of the selected DNA sequence.
6. The intelligent attack detection method based on DNA calculation of claim 1, wherein the classification performance index in step S4 comprises part or all of actual detection accuracy, imbalance index σ and optimal characteristic index Γ.
7. The intelligent attack detection method based on DNA computing according to claim 6, characterized in that, when the attack classification detection is performed based on the feature subset, the function expression of the adopted classifier is as follows:
Figure FDA0003806922110000023
in the above formula, the first and second carbon atoms are,
Figure FDA0003806922110000024
representing the classification prediction result label obtained by the classifier, f representing the flow characteristic set X established by the classifier to the classification prediction result label
Figure FDA0003806922110000025
The mapping relationship between the two; the calculation function expression of the actual detection accuracy is as follows:
Figure FDA0003806922110000026
in the above formula, ACC i In order to detect the actual detection accuracy of the ith attack type in the attack classification, Y represents the real classification corresponding to the flow characteristic set X, and n is the number of samples;
the calculation function expression of the imbalance index sigma is as follows:
Figure FDA0003806922110000031
in the above formula, n is an attack category, E (i) is a detection accuracy rate expected to be achieved by an ith attack category during attack classification detection, and a calculation function expression of the optimal characteristic index Γ is as follows:
Figure FDA0003806922110000032
in the above formula, ACC total In order to realize the actual detection accuracy of all attack types during the attack classification detection, sigma represents the imbalance index,
Figure FDA0003806922110000033
as a subset of features
Figure FDA0003806922110000034
P is the dimension of the flow characteristic; feature subsets
Figure FDA0003806922110000035
The screening function expression of (a) is:
Figure FDA0003806922110000036
in the above equation, k (i) indicates whether the flow rate characteristic i is selected, k (i) =0 indicates that the flow rate characteristic i is not selected, and vice versa indicates that the flow rate characteristic i is selected.
8. The intelligent attack detection method based on DNA calculation according to claim 1, wherein the step S6 comprises: performing constraint domination sequencing on all DNA sequences based on fitness by adopting an NSGA3 algorithm; and selecting a specified number of DNA sequences to form a new parent population P according to the result of the constraint domination ordering by adopting a Niche-Preservation operation.
9. A DNA computation based smart attack detection system comprising a microprocessor and a memory connected to each other, characterized in that the microprocessor is programmed or configured to perform the steps of the DNA computation based smart attack detection method according to any one of claims 1 to 8.
10. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is adapted to be programmed or configured by a microprocessor to perform the steps of the DNA computation based smart attack detection method according to any one of claims 1 to 8.
CN202210999367.3A 2022-08-19 2022-08-19 Intelligent attack detection method and system based on DNA calculation Active CN115442087B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210999367.3A CN115442087B (en) 2022-08-19 2022-08-19 Intelligent attack detection method and system based on DNA calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210999367.3A CN115442087B (en) 2022-08-19 2022-08-19 Intelligent attack detection method and system based on DNA calculation

Publications (2)

Publication Number Publication Date
CN115442087A true CN115442087A (en) 2022-12-06
CN115442087B CN115442087B (en) 2024-05-24

Family

ID=84243370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210999367.3A Active CN115442087B (en) 2022-08-19 2022-08-19 Intelligent attack detection method and system based on DNA calculation

Country Status (1)

Country Link
CN (1) CN115442087B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170339187A1 (en) * 2016-05-19 2017-11-23 Nec Europe Ltd. Intrusion detection and prevention system and method for generating detection rules and taking countermeasures
US10706144B1 (en) * 2016-09-09 2020-07-07 Bluerisc, Inc. Cyber defense with graph theoretical approach
CN112307678A (en) * 2020-11-05 2021-02-02 湖南科技大学 Robot multi-target searching method based on chaos non-dominated sorting genetic algorithm
WO2022134581A1 (en) * 2020-12-24 2022-06-30 深圳壹账通智能科技有限公司 Test case sorting method and related device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170339187A1 (en) * 2016-05-19 2017-11-23 Nec Europe Ltd. Intrusion detection and prevention system and method for generating detection rules and taking countermeasures
US10706144B1 (en) * 2016-09-09 2020-07-07 Bluerisc, Inc. Cyber defense with graph theoretical approach
CN112307678A (en) * 2020-11-05 2021-02-02 湖南科技大学 Robot multi-target searching method based on chaos non-dominated sorting genetic algorithm
WO2022134581A1 (en) * 2020-12-24 2022-06-30 深圳壹账通智能科技有限公司 Test case sorting method and related device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
曹鸽;郭海东;王丽萍;徐梦娜;: "基于重构邻域策略的分解多目标进化算法", 小型微型计算机系统, no. 03, 15 March 2020 (2020-03-15) *
王延峰: "基于改进非支配遗传算法的DNA编码序列优化方法", 《计算机应用》, 1 November 2009 (2009-11-01) *

Also Published As

Publication number Publication date
CN115442087B (en) 2024-05-24

Similar Documents

Publication Publication Date Title
CN108520268B (en) Black box antagonistic attack defense method based on sample selection and model evolution
Kassahun et al. Efficient reinforcement learning through Evolutionary Acquisition of Neural Topologies.
CN110188785A (en) A kind of data clusters analysis method based on genetic algorithm
Zhang et al. Evolutionary computation and its applications in neural and fuzzy systems
CN112863599B (en) Automatic analysis method and system for virus sequencing sequence
CN113222165A (en) Quantum line optimization method based on genetic algorithm
Zhang et al. A novel multi-objective genetic algorithm based error correcting output codes
Antonelli et al. Multi-objective evolutionary learning of granularity, membership function parameters and rules of Mamdani fuzzy systems
CN112734051A (en) Evolutionary ensemble learning method for classification problem
Ye et al. A ternary bitwise calculator based genetic algorithm for improving error correcting output codes
CN115310664A (en) RBF neural network training method and prediction system based on gene regulation genetic algorithm
Milliken et al. Multi-objective optimization of base classifiers in StackingC by NSGA-II for intrusion detection
CN115442087A (en) Intelligent attack detection method and system based on DNA calculation
CN116010625A (en) Genetic algorithm-based quoted network graph label attack resistance method
Ojha et al. Multi-objective optimisation of multi-output neural trees
Pytel The fuzzy genetic system for multiobjective optimization
Senthamilarasu et al. A genetic algorithm based intuitionistic fuzzification technique for attribute selection
Qu et al. An automatic clustering algorithm using nsga-ii with gene rearrangement
Murthy Genetic Algorithms: Basic principles and applications
Zeng et al. Towards intelligent attack detection using DNA computing
Maulik et al. Genetic algorithms and multiobjective optimization
Zuo et al. Symbolic Regression for Data Storage with Side Information
CN116994645B (en) Prediction method of piRNA and mRNA target pair based on interactive reasoning network
CN112270952B (en) Method for identifying cancer drive pathway
Yalabik et al. A pattern classification approach for boosting with genetic algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant