CN115442087A

CN115442087A - Intelligent attack detection method and system based on DNA calculation

Info

Publication number: CN115442087A
Application number: CN202210999367.3A
Authority: CN
Inventors: 曾增日; 赵宝康; 彭伟
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-08-19
Filing date: 2022-08-19
Publication date: 2022-12-06
Anticipated expiration: 2042-08-19
Also published as: CN115442087B

Abstract

The invention discloses an intelligent attack detection method and system based on DNA calculation, which comprises the steps of coding flow characteristics into base groups to form DNA sequences, randomly initializing a plurality of DNA sequences to form a parent population P, and randomly selecting N DNA sequences from the parent population P as a child population S; carrying out DNA biochemical reaction on the offspring population S and then combining the offspring population S with the parent population P to obtain a population Q; calculating classification performance indexes aiming at each DNA sequence in the population Q as the fitness of each DNA sequence; sequencing all DNA sequences in the population Q based on a non-dominated sequencing algorithm according to fitness, and selecting a new parent population P; and if the iteration times Gen is equal to the set maximum iteration times GenMax or the fitness of each DNA sequence is not changed any more, selecting the optimal DNA sequence from the parent population P according to the fitness, and then decoding the base as the flow characteristic to obtain the optimal characteristic subset. The invention can not only improve the classification precision and solve the unbalance problem, but also reduce the characteristic dimension and screen out the optimal subset to improve the detection rate.

Description

Intelligent attack detection method and system based on DNA calculation

Technical Field

The invention relates to a network security technology, in particular to an intelligent attack detection method and system based on DNA calculation.

Background

In recent years, intrusion means such as DDoS attack, APT attack and data stealing are endless, various networks and interconnection equipment are paralyzed, network security events in a global scope are frequent, and the nightmare for personal, organization and even national security is formed. Therefore, establishing an attack detection system and effectively identifying various network attack behaviors become important in network security today. However, with the explosive growth of network traffic, the network abnormal traffic has the characteristics of complexity, concealment, diversity and the like, which causes the failure of the traditional attack detection system. In order to cope with such a trend, researchers have developed many attack detection methods based on methods such as rough set theory, machine learning, and deep learning. For example: the traditional detection method based on rough set theory and the like has the advantage that the network attack can still be detected under the condition of incomplete data. The method based on machine learning has the capability of optimizing the performance of the classifier by continuously learning and accumulating experiences, and is a better solution compared with the traditional detection method. The deep learning can learn the intrinsic rules and the expression levels of the training data, and then the nonlinear network structure built by building the hidden layer has the advantages of higher efficiency, automatic model building according to problems and no limitation to a fixed problem.

At present, the latest detection methods only aim at the overall detection precision and feature dimension reduction of the network attack and ignore two problems of feature subset optimization and imbalance detection. For example, as the network traffic increases greatly, the contained characteristic information also increases, but more noise and redundant information is included in the information. Meanwhile, the noise and redundant information are connected with useful information in a million ways, so that the calculation complexity and time consumption of attack detection are increased, the useful information is deleted while part of the noise and redundant information is reserved when the dimension of the characteristic is reduced, and a serious challenge is brought to an attack detection technology. Meanwhile, the existing detection data set has the problems of small number of examples due to high difficulty in collecting samples of part attack types, unbalanced data and the like. However, the conventional feature selection method tends to favor the interaction between the tags and features with a large number of instances, and ignores the association between the tags and features with a small number of instances, which aggravates the detection imbalance.

In a dataset with unbalanced classes, most classes consist of normal samples and frequent attack samples, and few classes consist of infrequent rare attack samples. The detection system trained by the data set usually provides detection results with large errors for a few attack categories, so that the robustness of the detection classification system is seriously influenced.

To address the problems associated with data imbalances, many studies have attempted to address the imbalance by changing the distribution of class samples from the data level. For example, the first approach at the data plane is to achieve class balancing by artificially synthesizing a few classes of samples. Typically, data enhancement methods such as interpolation, oversampling, and encoder can be used to balance the data sets and achieve better experimental results. The second method is to reach class balance by manually deleting most classes of samples. Typically, liu et al propose a new hard set sampling technique algorithm that compresses most samples in the hard set using the Kmeans algorithm to reduce most classes to achieve class balance. However, artificially synthesizing samples easily causes the problem of model overfitting, and artificially deleting samples easily causes important information loss of most types of samples. In order to solve the problems of the data level, bedi et al change the classification algorithm, and make the detection system focus on a small number of classes from the algorithm level so as to achieve detection balance, and provide a novel Sim detection system based on a Sim neural network. The system is capable of detecting R2L and U2R attacks with a small number of instances in the NSL-KDD dataset without using conventional class balancing techniques. Although overfitting and information loss of a data layer can be effectively avoided by paying more attention to a few classes of samples, partial classes can not be detected for a few classes of problems with more classes and multiple classes of problems, and meanwhile, the precision of the most classes of problems can not be guaranteed.

DNA calculation is a brand-new research field, and Adleman teaches that the problem of the 7-vertex directed Hamiltonian is successfully solved by using DNA molecules as calculation media in 1994, and the Adleman becomes a milestone for the development of DNA calculation. The DNA calculation is a bionic optimization algorithm based on biological DNA coding and evolution mechanisms, is very effective for solving the complex combination optimization problem, and has the greatest advantage of fully utilizing the mass storage capacity of DNA molecules and the huge parallelism of biochemical reactions. Documents Zang W, ren L, zhang W, et al.a closed model-based DNA genetic optimization schemes. Future Generation Computer Systems,2018, 81:465-477, replacing 0 and 1 codes of traditional evolutionary algorithm by simulating DNA coding mode by using DNA calculation, and experiments show that the DNA calculation has the characteristics of rich population diversity, high convergence rate and the like. Jatoth et al (Jatoth C, gangadharan GR, buyya R. Optimal fixed access ceramic composition using an adaptive genetic algorithm. Future Generation Computer Systems,2019, 94, 185-198.) propose to use various biochemical reactions of biomolecules to complete the DNA calculation process and to represent the information carried by the system with DNA codes, simulating various manipulations of DNA molecules to discover and process the information, while constantly acquiring and updating the information during evolution. Shukla (Shukla A, pandey H M, mehrotra D.comparative view of selection techniques in genetic algorithm.2015international conference on functional trees analysis and knowledge management (ABLAZE). IEEE,2015: 515-519.) shows that the DNA calculation method can not only fully play the idea of invasive DNA calculation, but also solve various complex optimization problems widely existing in the engineering fields such as automatic control, mode recognition, decision making, machine learning and the like. The optimization algorithm based on DNA calculation is further developed compared with a genetic algorithm, and the method accommodates the characteristics that the conventional genetic algorithms are intelligent, can be self-organized and self-learned, evolve a coding population formed by parameters, guide the search advancing to the optimization direction by utilizing random operation and the like. DNA computation has several advantages over common genetic algorithms: (1) DNA coding allows the population to contain more information and have more diverse expression capabilities, allowing more complex knowledge to be expressed using shorter length DNA strands. (2) The DNA calculation greatly expands the optimization method by introducing molecular level operation, so that the convergence speed of the optimization algorithm is higher, and the premature problem can be avoided. (3) The variable length of the chromosome enables the DNA calculation to more easily complete the operations of inserting and deleting base sequences, and is more suitable for optimizing complex knowledge.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides an intelligent attack detection method and system based on DNA calculation, the invention enriches the diversity of flow characteristic populations by expressing the flow characteristics as DNA sequences to generate new flow characteristics by using DNA biochemical reaction, realizes the sequencing based on a non-dominated sequencing algorithm by using classification performance indexes as the fitness of each DNA sequence, effectively solves the premature problem when the algorithms such as the non-dominated sequencing and the like carry out characteristic optimization, and accelerates the convergence speed, so that the invention can improve the classification precision to solve the imbalance problem, and can reduce the characteristic dimension to screen out the optimal subset to improve the detection rate.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

an intelligent attack detection method based on DNA calculation comprises the following steps:

s1, encoding the flow characteristics into DNA sequences consisting of basic groups, and randomly initializing a plurality of DNA sequences to form a parent population P, wherein an individual in the parent population P is a DNA sequence, the DNA sequence consists of basic groups with fixed length, and each basic group is used for representing the state of one flow characteristic and is distributed according to {0,1,2,3 }; initializing iteration times Gen;

s2, randomly selecting N DNA sequences from the parent population P as a child population S;

s3, carrying out DNA biochemical reaction on the DNA sequence in the progeny population S;

s4, combining the parent population P and the child population S to obtain a population Q;

s5, aiming at each DNA sequence in the population Q, decoding bases into flow characteristics to obtain a characteristic subset, and carrying out attack classification detection on the basis of the characteristic subset to obtain a corresponding classification performance index serving as the fitness of each DNA sequence;

s6, sequencing all DNA sequences in the population Q based on a non-dominated sequencing algorithm according to the fitness, and selecting a specified number of DNA sequences to form a new parent population P according to a sequencing result;

s7, judging whether the iteration number Gen is equal to the set maximum iteration number GenMax or whether the fitness of each DNA sequence is not changed, if not, adding 1 to the iteration number Gen and skipping to the step S2; otherwise, selecting an optimal DNA sequence from the new parent population P according to the fitness, and decoding the optimal DNA sequence into flow characteristics serving as an optimal characteristic subset.

Optionally, step S3 includes: firstly, decoding bases into flow characteristics aiming at each DNA sequence in a filial generation population S to obtain a characteristic subset, carrying out attack classification detection based on the characteristic subset, and calculating a classification performance index of the characteristic subset; and then performing DNA biochemical reaction on the DNA sequences in the offspring population S based on the classification performance indexes, wherein the DNA biochemical reaction comprises the intersection among different DNA sequences, the variation in the same DNA sequence and part or all of the reverse sequence in the same DNA sequence.

Optionally, the crossing between the different DNA sequences comprises:

step 1, randomly selecting a plurality of pairs of DNA sequences;

step 2, traversing and selecting a pair of DNA sequences from the plurality of pairs of DNA sequences as a current DNA sequence pair;

step 3, judging whether the distance D of the current DNA sequence pair is larger than a set value D, and if so, skipping to the step 1; otherwise, skipping and continuing to execute downwards;

step 4, calculating cross probability Pm according to the classification performance indexes, generating a random number Pr, and if the condition Pr is less than Pm, performing base transposition on the current DNA sequence pair; otherwise, performing base inversion for the current DNA sequence pair; the function expression for calculating the cross probability Pm according to the classification performance index is as follows:

in the above formula, K ₁ And K ₂ Is a constant parameter, K ₁ And K ₂ Has a value range of [0,p]P is the dimension of the flow characteristic; f. of _max The optimal classification performance index corresponding to all DNA sequences in the filial generation population S is shown as f', the average classification performance index after the current DNA sequence pair is crossed is shown as f _avg The average classification performance indexes corresponding to all DNA sequences in the filial generation population S are obtained; wherein f' is obtained by decoding bases into flow characteristics aiming at two DNA sequences obtained after the current DNA sequence pair is crossed, carrying out attack classification detection on the characteristic subsets based on the characteristic subsets, and calculating the average value of classification performance indexes of the two characteristic subsets;

step 5, judging whether the DNA sequence is completely traversed or not, and jumping to the step 2 if the DNA sequence is not completely traversed; otherwise, judging that the cross among different DNA sequences is finished.

Optionally, the variations within the same DNA sequence include: firstly, selecting a DNA sequence according to the variation probability Pc, and then randomly selecting bases in the selected DNA sequence to perform base variation, wherein the base variation comprises base replacement, loss and embedding, and the base replacement comprises conversion variation among bases of the same type and mutual replacement of bases of different types; wherein, the calculation function expression of the variation probability Pc is:

in the above formula, K ₃ And K ₄ Is a constant parameter, K ₃ And K ₄ Has a value range of [0,p]P is the dimension of the flow characteristic; f. of _max The optimal classification performance indexes corresponding to all DNA sequences in the filial generation population S are shown, f is the classification performance index corresponding to the selected DNA sequence, f is the optimal classification performance index corresponding to the selected DNA sequence _avg And (4) obtaining the average classification performance indexes corresponding to all DNA sequences in the filial generation population S.

Optionally, the reverse order within the same DNA sequence comprises: firstly, selecting a DNA sequence according to a reverse sequence probability PI, and then randomly selecting two positions in the selected DNA sequence for reversing the base sequence, wherein a calculation function expression of the reverse sequence probability PI is as follows:

in the above formula, K ₅ Is a constant parameter with the value range of [0,p]P is the dimension of the flow characteristic; f. of _max For the optimal classification performance index, f, in the offspring population S _i Is the classification performance index of the selected DNA sequence.

Optionally, the classification performance index in step S4 includes part or all of the actual detection accuracy, the imbalance index σ, and the optimal feature index Γ.

Optionally, when performing attack classification detection based on the feature subset, a function expression of the adopted classifier is as follows:

in the above-mentioned formula, the compound has the following structure,

representing the classification prediction result label obtained by the classifier, f representing the flow characteristic set X established by the classifier to the classification prediction result label

The mapping relation between the two; the calculation function expression of the actual detection accuracy is as follows:

in the above formula, ACC _i In order to detect the actual detection accuracy of the ith attack type in the attack classification, Y represents the real classification corresponding to the flow characteristic set X, and n is the number of samples;

the calculation function expression of the imbalance index sigma is as follows:

in the above formula, n is an attack category, E (i) is a detection accuracy rate expected to be achieved by an ith attack category during attack classification detection, and a calculation function expression of the optimal characteristic index Γ is as follows:

in the above formula, ACC _total In order to realize the actual detection accuracy of all attack types during the attack classification detection, sigma represents the imbalance index,

as a subset of features

P is the dimension of the flow characteristic; feature subsets

The screening function expression of (a) is:

in the above equation, k (i) indicates whether the flow rate characteristic i is selected, k (i) =0 indicates that the flow rate characteristic i is not selected, and vice versa indicates that the flow rate characteristic i is selected.

Optionally, step S6 includes: performing constraint domination sequencing on all DNA sequences based on fitness by adopting an NSGA3 algorithm; and selecting a specified number of DNA sequences to form a new parent population P according to the result of the constraint domination ordering by adopting a Niche-Preservation operation.

In addition, the invention also provides an intelligent attack detection system based on DNA calculation, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the steps of the intelligent attack detection method based on DNA calculation.

Furthermore, the present invention also provides a computer-readable storage medium, in which a computer program is stored, the computer program being programmed or configured by a microprocessor to perform the steps of the intelligent attack detection method based on DNA-computing.

Compared with the prior art, the invention mainly has the following advantages:

1. the invention generates a new characteristic set by using DNA biochemical reaction by expressing the flow characteristic set as a DNA sequence, and can select a corresponding characteristic subset according to the characteristic state, thereby enriching the diversity of the flow characteristic subset population.

2. The invention realizes the sequencing based on the non-dominated sequencing algorithm by using the classification performance index as the fitness of each DNA sequence, effectively solves the premature problem when the non-dominated sequencing algorithm and other algorithms carry out feature optimization, accelerates the convergence speed and avoids premature, so that the invention not only can improve the precision of each classification to solve the unbalanced problem, but also can reduce the feature dimension to screen out the optimal subset to improve the detection rate.

3. Massive network data bring a large amount of redundant information, and the increasingly changing network attack types also bring great difficulty to the collection of samples, so that the sample size of each attack type in a data set is seriously unbalanced, the two problems seriously reduce the robustness of the existing detection method, and the method utilizes the DNA biochemical reaction to generate a new flow characteristic subset, can realize the processing of massive flow characteristic data, and has the advantage of strong robustness.

Drawings

FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of DNA calculation in the embodiment of the present invention.

FIG. 3 is a flow chart of DNA biochemical reactions in an embodiment of the present invention.

FIG. 4 is a cross-flow diagram of DNA sequences in an example of the present invention.

Fig. 5 is a flowchart of performing classification to obtain a classification performance index according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a population selection process in an embodiment of the present invention.

FIG. 7 is a diagram illustrating a distribution of reference points on a normalized hyperplane in an embodiment of the present invention.

Detailed Description

As shown in fig. 1, the intelligent attack detection method based on DNA computation of the present embodiment includes:

s5, aiming at each DNA sequence in the population Q, decoding bases into flow characteristics to obtain a characteristic subset, and carrying out attack classification detection based on the characteristic subset to obtain a corresponding classification performance index as the fitness of each DNA sequence;

In order to solve the problem that the optimal feature subset cannot be evolved due to the premature phenomenon of the non-dominated sorting algorithm, the intelligent attack detection method based on the DNA calculation of the embodiment accelerates the convergence speed by introducing the DNA calculation, so that the non-dominated sorting selection can jump out the local optimum to achieve the global optimum.

When the flux characteristics are encoded as bases in step S1, the DNA encoding principle is as follows: the DNA sequence is composed of nucleotides consisting of 4 types of bases such as adenine (A), guanine (G), cytosine (C) and thymine (T) [37 ]]. The different arrangement of these four classes of bases in a DNA sequence makes it very rich in the genetic information that it can express. If the bases are mathematically described, a single nucleotide X can be considered as an integer between 0 and 3, i.e., X.epsilon. {0,1,2,3}, and the DNA sequence can be expressed as

DNA coding is a key link of the whole algorithm, subsequent calculation is completed completely on the basis of initial coding, and the length of a DNA sequence and the size of a population also determine the convergence speed and precision of the final problem solution.

Step S3 is used for performing a DNA biochemical reaction on the DNA sequences in the progeny population S, where the DNA biochemical reaction refers to a biochemical reaction occurring at the DNA sequence or in-individual molecule level under a biochemical environment, and specifically refers to base exchange between DNA sequences, exchange of bases or sequences before and after in-individual DNA sequences, inversion of a certain sequence, mutation of a certain base, and the like in this embodiment. Based on the above principles, the following assumptions are made for a population consisting of traffic characteristics: (1) Each nucleotide represents a flow characteristic, and the nucleotide consisting of A, T, C and G represents a characteristic different state and contains the information whether the nucleotide is selected or not. The four base sequences constitute a DNA sequence of fixed length p, which represents the dimension of the flux characteristic. (2) all populations consist of DNA sequences. Different biochemical manipulations can be performed on any one DNA sequence. (3) Each DNA sequence corresponds to its own classification performance index value and represents the survival and replication capacity of the DNA sequence. As shown in fig. 2, the principle of the DNA biochemical reaction performed on the DNA sequence in the progeny population S in step S3 is: firstly, to the characteristicsBase encoding of the DNA forms a characteristic population of DNA sequences. Each base contains information on whether a feature is selected, so the DNA population can be viewed as a feature subset population. And secondly, carrying out sequential biochemical reactions such as selection, crossing, mutation, reverse order and the like on the DNA population in a specific biochemical environment to finally achieve a stable population state. And then decoding the DNA sequence to obtain the solution of the problem, namely the optimal stable characteristic subset population. For example, in the design stage of DNA calculation algorithm, the problem to be solved is to select an appropriate subset of flow characteristics, so the characteristics are encoded according to the problem. The extraction of the core element of the coding is to extract all the flow characteristics F _i The pretreatment is to base obey the {0,1,2,3} distribution. Secondly, forming a DNA sequence population, namely, coding p-dimensional characteristics into p-dimensional DNA sequences according to the nucleotide coding rule, and forming F ₁ A DNA sequence of = {0,2,3,1,0,3,1,0 }. Then integrated into F according to each individual ₁ ＝{0，2，3，1，0，3，1，0}，...，F _n The DNA population consists of n individuals, i.e., = {1,0,3,1,2,0,1,3 }. In the biochemical reaction stage, base group conversion is carried out between populations and individuals according to biochemical reaction modes such as crossing, mutation, reverse order and the like, and the base group conversion is formed as F ₁ ＝{0，1，0，3，3，0，2，3}，...，F _n = {1,0,1,1,2,1,1,3} post-biochemical reaction individuals and populations. In the problem solving stage, the problem can be solved by setting the value of mod (f) _i 2) =0 screening rule of the rule whose feature is selected, F ₁ The feature finally selected is f ₁ ，f ₃ ，f ₆ And f ₇ A subset of features is composed. Then, the characteristic subset is put into a classification performance calculation model to calculate a performance set composed of accuracy, imbalance index and the like of each category as F ₁ And finally, selecting N (N is less than N) individuals from the N individuals according to the fitness to form a next generation parent population.

Specifically, step S3 in this embodiment includes: firstly, decoding bases into flow characteristics to obtain a characteristic subset aiming at each DNA sequence in a filial generation population S, carrying out attack classification detection based on the characteristic subset, and calculating a classification performance index of the characteristic subset; and then performing DNA biochemical reaction on the DNA sequences in the offspring population S based on the classification performance indexes, wherein the DNA biochemical reaction comprises part or all of the intersection among different DNA sequences, the variation (also called mutation) in the same DNA sequence and the reverse sequence (also called inversion or inversion) in the same DNA sequence.

As an alternative implementation, as shown in fig. 3, the operations are performed in sequence according to the sequence of interleaving, mutation and reverse order in this embodiment. If the new parent population does not reach the ideal target value or the iteration times are not enough, the biochemical reaction process of the simulated DNA is crossed, mutated and inverted until the requirement is met. Through the circulation, the DNA population contains various conditions, the target value of an individual is closer to the optimal value, and the average target value is continuously improved. And ending the iteration until the fitness of the DNA sequence does not change after the optimal solution is found or a certain limit is reached.

Crossovers are biochemical reactions between DNA sequences, including: firstly, a plurality of pairs of DNA sequences are randomly selected, each pair randomly generates a cross position, and then the contents of the DNA sequence pairs are interchanged according to the cross positions to generate new DNA sequence pairs. In this way, the genes of the DNA population will be greatly altered. The crossing can be divided into a single point crossing, a sub-path crossing, a standard crossing and other ways.

As a preferred embodiment, as shown in fig. 4, the crossover between different DNA sequences in this example includes:

step 1, randomly selecting a plurality of pairs of DNA sequences, for example, selecting N pairs of DNA sequences in the embodiment;

step 3, judging whether the distance D of the current DNA sequence pair is larger than a set value D, and if so, skipping to the step 1; otherwise, skipping and continuing to execute downwards; wherein the set value d is a preset inter-sequence distance threshold value, and the range of the set value d is 0-1.7. The computational function expression for the distance D of the current DNA sequence pair is:

in the above formula, X _i And X _j Two DNA sequences representing the current DNA sequence pair, p being the length of the DNA sequence, x _ik Is a DNA sequence X _i The kth base of (1), x _jk Is a DNA sequence X _j The kth base of (1).

Step 4, calculating cross probability Pm according to the classification performance indexes, generating a random number Pr, and if the condition Pr is less than Pm, performing base transposition on the current DNA sequence pair; otherwise, performing base inversion for the current DNA sequence pair; wherein, the function expression for calculating the cross probability Pm according to the classification performance index is as follows:

in the above formula, K _i And K ₂ Is a constant parameter, K ₁ And K ₂ Has a value range of [0,p]P is the dimension of the flow characteristic; f. of _max The optimal classification performance index corresponding to all DNA sequences in the filial generation population S is shown as f', the average classification performance index after the current DNA sequence pair is crossed is shown as f _avg The average classification performance indexes corresponding to all DNA sequences in the filial generation population S; wherein f' is obtained by decoding bases into flow characteristics aiming at two DNA sequences obtained after the current DNA sequence pair is crossed, carrying out attack classification detection on the characteristic subsets based on the characteristic subsets, and calculating the average value of classification performance indexes of the two characteristic subsets;

And calculating a function expression of the cross probability Pm according to the classification performance indexes, wherein when the performance indexes of the DNA population are more than set types or locally optimal, the cross probability Pm is automatically increased, and otherwise, the cross probability Pm is automatically decreased. The cross probability Pm is calculated according to the classification performance index, so that the cross probability is increasedThe rate Pm can automatically change along with the change of the performance index, and the variation effect is better. If Pm =3, the DNA sequence F ₁ With DNA sequence F _n The intersection can be represented as:

variation is a biochemical reaction between individuals within a DNA sequence. Firstly, selecting DNA sequences in a DNA population according to corresponding variation probability Pc, and then randomly selecting bases in the selected DNA sequences for variation. Base variations in DNA sequences are substitutions, deletions and insertions of bases. There are two main types of base substitutions: one is a transition variation between bases of the same type, such as A for G, T for C; another class is the mutual substitution of heterogeneous bases: such as a being replaced by T.

In this example, variations within the same DNA sequence include: firstly, selecting a DNA sequence according to the variation probability Pc, and then randomly selecting bases in the selected DNA sequence to perform base variation, wherein the base variation comprises base replacement, loss and embedding, and the base replacement comprises conversion variation among bases of the same type and mutual replacement of bases of different types; wherein, the calculation function expression of the variation probability Pc is:

in the above formula, K ₃ And K ₄ Is a constant parameter, K ₃ And K ₄ Has a value range of [0,p]P is the dimension of the flow characteristic; f. of _max The optimal classification performance indexes corresponding to all DNA sequences in the filial generation population S are obtained, f is the classification performance index corresponding to the selected DNA sequence, f _avg And (4) obtaining the average classification performance indexes corresponding to all DNA sequences in the filial generation population S. In this embodiment, the classification performance index is introduced into the calculation of the variation probability Pc, so that the classification performance index automatically changes along with the change of the classification performance index. When the performance indexes of the DNA population are more concentrated or locally optimal, the variation probability Pc is automatically increased,otherwise, the size is automatically reduced.

The reverse order is also a biochemical reaction within an individual of DNA sequences, and the aim is to try to find a better base evolution order. Firstly, randomly selecting a plurality of DNA sequences from a DNA population according to the probability PI, and then randomly selecting two positions in the selected DNA sequences to carry out base sequence inversion. In this example, the reverse order within the same DNA sequence includes: firstly, selecting a DNA sequence according to a reverse sequence probability PI, and then randomly selecting two positions in the selected DNA sequence for reversing the base sequence, wherein a calculation function expression of the reverse sequence probability PI is as follows:

in the above formula, K ₅ Is a constant parameter with the value range of [0,p]P is the dimension of the flow characteristic; f. of _max For the optimal classification performance index, f, in the offspring population S _i Is the classification performance index of the selected DNA sequence. In this embodiment, the classification performance index is introduced into the calculation of the reverse order probability PI, so that the classification performance index automatically changes along with the change of the classification performance index. When the performance indexes of the DNA population are more centralized or locally optimal, the reverse order probability PI is automatically increased, otherwise, the reverse order probability PI is automatically decreased.

The fitness of the DNA sequence refers to a measure of whether the DNA sequence has survival advantage in the DNA population, in this embodiment, the classification performance index is used as the fitness of the DNA sequence, and in this embodiment, the classification performance index in step S4 includes the actual detection accuracy, the imbalance index σ and the optimal characteristic index Γ (in addition, the above parts may also be adopted as required). On the basis of actual detection accuracy, the performance advantages and disadvantages of the method in the aspects of overall precision, feature subset optimization and detection balance can be visually embodied through the imbalance index sigma and the optimal feature index gamma. Moreover, through three classification performance indexes, the embodiment forms a multi-target feature selection algorithm. Compared with the traditional multi-target searching scheme, the scheme of the embodiment can distinguish normal traffic and abnormal traffic and can distinguish different types of abnormal traffic for network abnormity.

Since external information is not basically utilized in the evolution search of the DNA and the search is performed only based on the fitness, the classification performance index plays a decisive role in the evolution of the DNA population. In order to obtain the most accurate fitness quickly and effectively, in the embodiment, the attack classification detection based on the feature subset adopts a K nearest neighbor classification method (KNN) as a main method and calculates the classification performance index of the feature subset as the fitness of each DNA sequence in a manner of optimizing a hyper-parameter as an auxiliary method.

As shown in fig. 5, the flow data set is first pre-processed to make it suitable for training the KNN-dominated fitness computation model. And secondly, selecting a corresponding characteristic subset according to the base value of the DNA sequence and putting the characteristic subset into a fitness calculation model for training. And then optimizing a fitness calculation model according to a hyper-parameter optimization method such as random search and the like to obtain the fitness of the DNA sequence. The specific process comprises the following steps:

step 1, data preprocessing.

Raw flow data is largely incomplete, inconsistent, and very noise-contaminated dirty data that must be preprocessed for training and detection. To improve the quality of the data, this step follows two operations to prepare the training and detection data:

(1) Maximum and minimum normalization: the washed data set also has the problems of different value ranges and often different distributions of different features, so that the features cannot be directly compared with each other. In order to make the features comparable, all the features can be converted into values in a consistent range through a normalization mode, so that all the features have consistent weight influence on a detection model, and the quality of a data set and the convergence speed during model training are improved. The essence of normalization is a linear transformation that does not cause substantial changes to the features and retains the original information of the DNA sequence. In this embodiment, the specific normalized calculation function expression is as follows:

in the above formula, Y' _ij Is Y _ij A normalized value of (a), an integer ranging from 0 to N, Y _ij The jth flow characteristic for the ith sample, min (Y) _j ) Denotes the jth flow characteristic Y _j Minimum value of, max (Y) _j ) Indicating the jth flow characteristic Y _j Round represents rounding.

(2) Removing fuzzy values: the corresponding relation between the set with partial characteristic values in the normalized detection data set and the label is wrong or fuzzy. Such as S ₁ The value of part of the features is illegal number, S ₂ The attack tag corresponding to the flow characteristic value set is null, and the data sets containing the two corresponding relations are error sets; and as S ₃ And S ₄ If the same feature value set corresponds to two or more attack tags, the corresponding data set is called a fuzzy set. Both data sets reduce the robustness of the detection system. Therefore, after the data is normalized, the error set and the fuzzy set are deleted, and only the determined data set with the feature value set and the label in one-to-one correspondence is left, so that the robustness of the attack detection system is improved.

And 2, selecting characteristics.

In order to select a suitable subset of features for training, feature screening is performed herein based on the following rules.

Assuming that the number of DNA populations is n, the number of flux features to be detected is p,

represents the ith DNA sequence of the t generation, i ∈ {1,2,3. Each DNA sequence is represented by a p-bit quaternary (A0,T 1,C 2,G 3). Thus each individual

{ B }. Epsilon {0,1,2,3}. Then

Of the selected characteristicsCharacterized in that:

γ ⁱ ＝∑{k|mod(B ^k ，2)＝1}，

in the above formula, γ ⁱ To represent

A subset of features consisting of the selected features, mod represents a remainder function, k ∈ {1,2.

And 3, optimizing the hyper-parameters of the classifier.

Parameters which can be adjusted in the training process of algorithms such as machine learning are generally called hyper-parameters, and the selection of values of the hyper-parameters has great influence on the detection effect of the model. Hyper-parametric tuning can be seen as a "learning" problem, i.e. learning a model G such that loss = G, where loss is the optimum value of the loss function that can be obtained after training given a hyper-parametric λ model. With the model, the optimal value loss of the loss function corresponding to any hyper-parameter lambda can be predicted, and the value of the hyper-parameter lambda corresponding to the optimal value loss of the loss function can be found naturally. The solving formula for optimizing the hyperparameter lambda is as follows:

λ ^(*) ＝argmin _λ∈∧ E _x ～G _x [L(x；A _λ (X ^train ))]，

in the above formula, λ ^(*) Is the solution of the over-parameter lambda, lambda is the optional range of the over-parameter lambda, E _x ～G _x [L(x；A _λ (X ^train ))]To generalize the error, E _x To generalize the expected value of the error, G _x For the distribution function, L is the loss function, x is the training data, A is the function to be optimized, A _λ To select the optimized function of the hyperparameter λ, X ^train Is a training data set.

In this embodiment, when performing attack classification detection based on the feature subset, the function expression of the adopted classifier is as follows:

in the above-mentioned formula, the compound has the following structure,

The mapping relationship between the two; the calculation function expression of the actual detection accuracy is as follows:

the computational function expression of the imbalance index σ is:

in the above formula, n is an attack category, E (i) is a detection accuracy expected to be achieved by the ith attack category during attack classification detection, and the imbalance index σ is used to measure the deviation degree between each classification detection accuracy and the expected accuracy, and a closer value to 1 indicates that the problem of classification imbalance existing in the detection system is more serious, otherwise, the problem of classification imbalance does not exist. The expression of the calculation function of the optimal characteristic index gamma is as follows:

as a subset of features

P is the dimension of the flow characteristic; the optimal characteristic index gamma is used for reducing the number of the characteristics required to be trained and detected as much as possible on the premise of ensuring that the performance of the detection system is not reduced, and the closer to 1, the worse the capability of the intrusion detection system for screening the characteristics is, and the better the characteristic screening is otherwise. Feature subsets

The screening function expression of (1) is:

in the above equation, k (i) indicates whether the flow rate characteristic i is selected, k (i) =0 indicates that the flow rate characteristic i is not selected, and conversely, indicates that the flow rate characteristic i is selected.

Step S6 in this embodiment includes: performing constraint domination sequencing on all DNA sequences based on fitness by adopting an NSGA3 algorithm; and selecting a specified number of DNA sequences to form a new parent population P according to the result of the constraint domination ordering by adopting a Niche-Preservation operation.

Steps S4 to S6 are steps of population selection, and the process of population selection is shown in fig. 6. First, the father generation and the son generation are combined into a new population Q, and then whether all DNA sequences in the population Q are arranged or not is judged. If there is no ordering, the DNA sequence is decoded to calculate fitness. And secondly, generating a reference point according to the detection target, dividing a reference line and searching an associated reference point. And finally, performing constraint domination sequencing on all DNA sequences based on the fitness and the reference point, wherein the sequencing selection mode is consistent with the NSGA3 algorithm. If so, then the parent population of the new generation is selected according to the Niche-Preservation operation (see: aguilar-river A.A GPU full vector approach to access performance of NSGA-2 based on mechanical non-doping monitoring and grid-crowning applied Soft Computing 2020, 88, 106047.).

The NSGA3 algorithm utilizes known Das and Dennis system methods to determine the set of reference points used for each generation. This method defines a reference point on the normalized hyperplane that is equally tilted for all target axes and has an intercept on each target axis. The reference point is assumed to be on a hyperplane with M-1 dimension, wherein M is the dimension of a target space, namely the number of optimized detection classification performance indexes. If we divide each target into H shares, then the number of reference points Cd is:

in the above-mentioned formula, the compound has the following structure,

represents a permutation combination of arbitrarily taking out H reference points from H + m-1 reference points. For example, for a problem of M =3 with H =5, its reference points form a triangle, which is known from this equation to yield 21 reference points. Thus, the reference points created by this method are widely and uniformly distributed across the normalized hyperplane, as shown in FIG. 7. However, this method faces a problem that the computation and storage overhead increases rapidly when the objective function is large. For example, when both H and m are 10, the calculation according to this formula yields 92378 reference points. To solve this problem, the reference point generation Method of Deb and Jain's Method can be adopted, and the main idea is to generate a hyperplane of two reference points, i.e. inner and outer layers, so as to reduce the reference points and ensure the wide distribution of the reference points.

The purpose of the reference point is to obtain a mapping relationship (i.e., a vertical distance) between the population individuals and the response reference point, so that the population evolves toward a direction closer to the reference point, and the distribution of the reference point is more uniform. In FIG. 7, an "ideal" hyperplane has been created, and then a reference point is created on the hyperplane. The hyperplane is the Pareto frontage which we need to find, so we need to continuously evolve the population towards the Pareto frontage, and this process needs us to construct the mapping relation between the population individuals and the reference point.

The associated reference points are first calculated by calculating the vertical distance from the DNA strand to the reference line, and the horizontal distance d from the DNA strand x to the jth reference line L _j，1 (x) And a vertical distance d _j，2 (x) The calculation formula of (2) is as follows:

in the above formula, Z ^j A coordinate vector that is a reference line L; (F (x)) ^T Z ^j As the performance index F (x) and the reference line Z ^j The associated reference point needs to take advantage of the vertical distance d _j，2 (x) Find the reference point closest to the DNA sequence x. At this time, the horizontal distance d of the individual to the associated reference point _j，1 (x) The convergence can be characterized, and the smaller the value, the better the convergence; perpendicular distance d of DNA strand to associated reference point _j，2 (x) The diversity is characterized, and the smaller the value, the better the diversity.

The constraint governing relationship is defined as follows: for J reference lines, J =1,2,.., J, any given two DNA sequences X _a And X _b If the following two conditions are satisfied, X is considered to be _a Dominating X _b And has:

(I) For the

All have d _j，1 (X _a )≤d _j，1 (X _b ) And d _j，2 (X _a )≤d _j，2 (X _b ) This is true.

(II)

So that d _j，1 (X _a )＜d _j，1 (X _b ) Or d _j，2 (X _a )＜d _j，2 (X _b ) This is true.

The constraint domination relation based on the reference point is that the reference point is adopted to divide the target space into a plurality of sub-areas, the sub-area where the target space is located by utilizing the objective function value of the infeasible solution, and the sparsity of the target space is judged by utilizing a niche mechanism. The objective function value of the infeasible solution can represent the position of the infeasible solution in the target space, and the distribution of the position plays an important role in the diversity of the population, so that the infeasible solution in a sparse area needs to be selected to improve the diversity of the population.

The restricted dominance ordering of all DNA sequences based on fitness using NSGA3 algorithm includes: firstly, according to the constraint domination relationship, selecting the DNA sequence of the top-ranked region to be put into the next generation population P _t+1 Until the next generation population P _t+1 The number of individuals in the total is more than or equal to n. Suppose F _l For the next generation of population P _t+1 The last non-dominant layer where the number of mesomers is ≧ n for the first time, where { F ₁ ，F ₂ ，...，F _l-1 Is the preceding non-dominant layer, the number of individuals is n ₁ ，F ₁ The number of layers is n ₂ . If n is ₁ +n ₂ = n, the selection is stopped, the population P _t ＝{F ₁ ，F ₂ ，...，F _l }; otherwise from F _l The layer region is selected from n-m DNA sequences such that P _t+1 Is equal to n.

The computational complexity of the intelligent attack detection method based on DNA computation of the embodiment is explained as follows: for a population of size 2N and with an m-dimensional target vector, assume that the number of flow samples is Z and the feature dimension is P. The calculated amount of DNA biochemical reactions such as crossing, mutation, reverse order and the like is less than or equal to O (N), and the times required for the DNA biochemical reactions are O (N). When the classification performance index is calculated, the data preprocessing needs O (Z) times of calculation, and the feature selection needs O (NP) times of calculation, so the classification performance index calculation needs O (NZP) times of calculation, and the calculation times when the classification performance index is calculated are O (NZP). Furthermore, the non-dominated sorting requires O (NlogM-2N) computation, and the reference point generation and association requires O (MNH) computations, so the number of computations required for group selection is based on the larger of (N2 logM-2N) or O (N2M). Therefore, the intelligent attack detection method based on DNA calculation in the embodimentThe computational complexity of (A) is represented by O (NZP), (N) ² log ^(M-2) N) and O (N) ² The larger of M) is the criterion. However, in general, Z > N, so the computation complexity of ADDC is basically based on the complexity O (NZP) of the classification performance index, and the application of the non-dominated sorting multi-objective optimization algorithm based on the reference point has little influence on the detection algorithms such as machine learning based on the hyper-parameter optimization algorithm.

The network intrusion detection data set can be divided into seven categories, namely a data set based on network flow, a data set based on a power grid and a data set based on internet flow. The data set for verifying the performance of the algorithm is mainly based on network traffic, wherein the commonly used data sets are as follows: CICDDoS2019, doHBrw-2020, NSL-KDD and UNSW-NB 15. In this embodiment, a software and hardware specification table of the experiment is performed.

Table 1: software and hardware specification table of experiment.

Since the amount of data in the data set is large enough, we chose about 10% of the data as the test set. The 4 data sets were randomly divided into a training data set and a test data set using a 90: 10 ratio in this experiment.

The preset parameter values of each data set during training are shown in table 4. Wherein the iteration represents the number of iterations of the algorithm, and the population size is the number of DNA sequences in the initialized parent population.

Table 2: a parameter value setting table.

In this experiment, as a comparison of the method (ADDC) of the present example with the conventional methods, there are: RS-KNN-CFS, TPE-KNN-CFS, anneal-KNN-CFS, RS-KNN-IGFS, anneal-KNN-IGFS, TPE-KNN-IGFS, RS-RF-CFS, anneal-RF-CFS, TPE-RF-CFS, RS-RF-IGFS, anneal-RF-IGFS, TPE-RF-IGFS, GMGWO-ECAE, and BRS. In this experiment, we evaluated the performance of each detection method by 5-class and multi-class accuracy comparisons on CICDDoS2019, doHBrw-2020, NSL-KDD and UNSW-NB15 datasets, as shown in tables 3-111. The two categories are divided into a normal type and an abnormal type, and the abnormal type is that all attack types are regarded as one type of data. The multi-classification includes normal type and various attack types, for example, 4 classifications on DoHBrw-2020 data set include 3 attack types, such as normal and Non-DoH, malcious-DoH.

Experiments prove that: (1) In the dichotomy, the method (ADDC) of the embodiment has better overall performance, and can ensure higher detection accuracy of normal data. For example, in the detection of normal data, in the CICDDoS2019 dataset with better detection performance, the detection accuracy =100% in the method (ADDC) of this embodiment, and in the unstnb 15 dataset with the worst detection performance, the detection performance is improved by more than 10% compared with other best ADSs (intrusion detection systems). And in other data sets, the data are all kept above 99.5%, and compared with other algorithms, the data are all improved. In the detection of abnormal data, because the data set of DoHBrw-2020 has less abnormal data (normal data: abnormal data is 141696. The method (ADDC) of the embodiment basically maintains more than 50% of detection accuracy, and proves that the detection balance of the method is stronger than that of other ADSs, and the method can detect attacks which cannot be detected by other ADSs. Meanwhile, in the overall performance, the detection accuracy of the method (ADDC) is more than 99.4 percent, which is better than that of other ADS. (2) In multi-class detection, the IDS detection performance of part of IGFS is obviously reduced, and particularly when the CICDDoS2019 data set contains 11 attacks, the accuracy rate is even lower than 30%. The method (ADDC) of the embodiment basically keeps stable, except that the overall accuracy in the DoHBrw-2020 dataset is lower than 90% (compared with the best ADS, the detection accuracy is improved by more than 3%), and the rest is higher than 90%. In the multi-classification of unsknb 15 dataset, the ratio of its classes is Analysis: backdoor: doS: exploits: fuzzers: the method comprises the following steps of Generic: normal: reconnaissance: shellcode: worms is 397:319:2458:6724: 3572:8195:11268:2177:219:30. the detection accuracy of other IDSs in 5 classes such as Analysis, backdor, reconnaisnce, shellcode, worms and the like with the number of examples being lower than 400 (the proportion is lower than 1%) is very low or even 0, while the method (ADDC) of the embodiment keeps higher detection accuracy in 3 classes such as Analysis, reconnaisnce, shellcode and the like, and the overall detection rate is improved by more than 10% compared with other optimal IDSs, thereby proving that the detection performance and the balance of the other IDSs are superior to those of the other IDSs. In summary, in both the two-class classification and the multi-class classification, the ADS based on the method (ADDC) of the present embodiment can better improve the overall detection accuracy of the system, and can keep a certain balance in the unbalanced data set with a small number of instances of some types, thereby effectively detecting most types of attacks with a small number of instances. This is because, based on DNA calculation, population can be enriched effectively while avoiding premature and jumping out of local optimums, improving the detection capability for attack species with a small number of instances. While other algorithms result in an imbalance in detection performance based on overfitting for the higher number of instances categories.

While the ADS proposed herein can exhibit optimal performance as assessed by performance indicators such as overall accuracy, correct and abnormal data detection rates, these indicators are not reliable in the classification problem of unbalanced training and test data sets. Then, in the present experiment, the present embodiment evaluates the balance performance of each ADS in four large data sets by the imbalance exponent σ and the optimal characteristic exponent Γ. Where γ is a subset of features

Is the imbalance exponent, and Γ is the optimal characteristic exponent. Because F is less in partial data set, F is multiplied by 100 on the basis of the original F. Experiments prove that the imbalance index of the method (ADDC) of the embodiment is improved by more than 50% in the DoHBrw-2020 dataset compared with other optimal ADSThe decrease in the upper limit. It turns out that the present embodiment method (ADDC) maintains a strong balance of performance regardless of whether it is two-class or multi-class. In the case of binary classification, except in the NSL _ KDD dataset and the GMGWO algorithm, the number of features required by the method (ADDC) of this embodiment is greater than part of the ADS, and smaller than other ADS in the remaining datasets. The optimal characteristic index Γ < 0.001 for this embodiment method (ADDC) is lower than other IDSs in all datasets, especially Γ =0 in the cic cddos2019 and DoHBrw-2020 datasets. Under the condition of multi-classification, the optimal characteristic index gamma of the method (ADDC) of the embodiment is maintained at a lower level, and is greatly improved compared with other optimal ADS. This shows that although only a few features are selected in the method (ADDC) of this embodiment, the association relationship between these features and the label is better than that of other ADS, and can ensure higher balance and accuracy.

Experimental results show that the method (ADDC) not only fully utilizes DNA calculation to enrich population and jump out local optimum to find global optimum, but also inherits the multi-objective optimization capability of NSGA3 algorithm to achieve the purpose of considering a small number of instances for attack detection, and utilizes a hyper-parameter optimization detection system to balance detection of various attack types with reasonable accuracy. From the binary classification results, the method (ADDC) of the present embodiment can maintain a high level of overall, normal, and abnormal accuracy for all data sets. Although larger in partial data sets than the feature subsets of the GMGWO and BRS screens, it is significantly stronger in terms of accuracy, disparity and index and optimal feature subsets, so this embodiment method (ADDC) can provide a more reasonable feature set than other detection methods. And the analysis of the multi-class classification result reveals that the existing detection method can not provide effective detection accuracy aiming at the situation that the number of the attack class instances is more and less, and the method (ADDC) of the embodiment can be better solved and provides better balance performance and feature subsets. These results show that the model of the present embodiment method (ADDC) is balanced and stable regardless of the size of the data set. The experimental result shows that compared with the existing detection method, the intelligent attack detection method based on DNA calculation improves the overall accuracy of multi-classification by about 10% at most; the imbalance index is reduced by 0.5 at most, and 1.5 attack types are detected on average; the maximum improvement of the optimal index of the feature subset exceeds 83.8 percent.

In summary, the intelligent attack detection method based on DNA calculation in this embodiment mainly includes three steps of DNA calculation, classification performance index calculation, and population selection. DNA was calculated as DNA coding and biochemical reaction: firstly, the random initialization characteristic set is a DNA sequence parent population, all sequences are based on 4 types of base codes of DNA, and then new individual and offspring populations are formed through DNA biochemical reactions such as crossing, mutation, inversion and the like. The classification performance index calculation comprises the steps of firstly decoding a DNA sequence, selecting a corresponding base set as a characteristic subset, then optimizing a detection model by combining a hyper-parameter optimization method, and finally calculating the precision, imbalance index and optimal characteristic subset index of each category, and taking the results as the fitness of the DNA sequence. The population selection comprises the steps of firstly merging parent population and offspring population, then sequencing all DNA sequences according to fitness by using a non-dominated sequencing method based on a reference point, and finally selecting the parent population of the next generation according to the sequencing result. The three steps are circularly repeated until the preset iteration number is reached or the fitness of the DNA sequence is not changed any more.

In addition, the present embodiment also provides an intelligent attack detection system based on DNA calculation, which includes a microprocessor and a memory connected to each other, wherein the microprocessor is programmed or configured to execute the steps of the aforementioned intelligent attack detection method based on DNA calculation. Furthermore, the present embodiment also provides a computer-readable storage medium, in which a computer program is stored, the computer program being programmed or configured by a microprocessor to perform the steps of the aforementioned intelligent attack detection method based on DNA computation.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention should also be considered as within the scope of the present invention.

Claims

1. An intelligent attack detection method based on DNA calculation is characterized by comprising the following steps:

s7, judging whether the iteration times Gen is equal to the set maximum iteration times GenMax or whether the fitness of each DNA sequence is not changed, if not, adding 1 to the iteration times Gen and skipping to the step S2; otherwise, selecting an optimal DNA sequence from the new parent population P according to the fitness, and decoding the optimal DNA sequence into flow characteristics serving as an optimal characteristic subset.

2. The intelligent attack detection method based on DNA calculation according to claim 1, wherein the step S3 comprises: firstly, decoding bases into flow characteristics to obtain a characteristic subset aiming at each DNA sequence in a filial generation population S, carrying out attack classification detection based on the characteristic subset, and calculating a classification performance index of the characteristic subset; and then performing DNA biochemical reaction on the DNA sequences in the offspring population S based on the classification performance indexes, wherein the DNA biochemical reaction comprises the intersection among different DNA sequences, the variation in the same DNA sequence and part or all of the reverse sequence in the same DNA sequence.

3. The intelligent attack detection method based on DNA calculation according to claim 2, wherein the crossing between different DNA sequences comprises:

step 1, randomly selecting a plurality of pairs of DNA sequences;

in the above formula, K ₁ And K ₂ Is a constant parameter, K ₁ And K ₂ Has a value range of [0,p]P is the dimension of the flow characteristic; f. of _max The optimal classification performance index corresponding to all DNA sequences in the filial generation population S is shown as f', the average classification performance index after the current DNA sequence pair is crossed is shown as f _avg The average classification performance indexes corresponding to all DNA sequences in the filial generation population S; wherein f' is obtained by decoding bases into flow characteristics aiming at two DNA sequences obtained after the current DNA sequence pair is crossed, carrying out attack classification detection on the characteristic subsets based on the characteristic subsets and calculating the average value of classification performance indexes of the two characteristic subsets;

4. The DNA-computation-based smart attack detection method according to claim 2, wherein the variation within the same DNA sequence comprises: firstly, selecting a DNA sequence according to the variation probability Pc, and then randomly selecting bases in the selected DNA sequence to perform base variation, wherein the base variation comprises base replacement, loss and embedding, and the base replacement comprises conversion variation among bases of the same type and mutual replacement of bases of different types; wherein, the calculation function expression of the variation probability Pc is:

in the above formula, K ₃ And K ₄ Is a constant parameter, K ₃ And K ₄ Has a value range of [0,p]P is the dimension of the flow characteristic; f. of _max The optimal classification performance indexes corresponding to all DNA sequences in the filial generation population S are shown, f is the classification performance index corresponding to the selected DNA sequence, f is the optimal classification performance index corresponding to the selected DNA sequence _avg And obtaining the average classification performance index corresponding to all DNA sequences in the filial generation population S.

5. The intelligent attack detection method based on DNA calculation according to claim 2, wherein the reverse order inside the same DNA sequence comprises: firstly, selecting a DNA sequence according to a reverse sequence probability PI, and then randomly selecting two positions in the selected DNA sequence for reversing the base sequence, wherein a calculation function expression of the reverse sequence probability PI is as follows:

6. The intelligent attack detection method based on DNA calculation of claim 1, wherein the classification performance index in step S4 comprises part or all of actual detection accuracy, imbalance index σ and optimal characteristic index Γ.

7. The intelligent attack detection method based on DNA computing according to claim 6, characterized in that, when the attack classification detection is performed based on the feature subset, the function expression of the adopted classifier is as follows:

in the above formula, the first and second carbon atoms are,

the calculation function expression of the imbalance index sigma is as follows:

as a subset of features

P is the dimension of the flow characteristic; feature subsets

The screening function expression of (a) is:

8. The intelligent attack detection method based on DNA calculation according to claim 1, wherein the step S6 comprises: performing constraint domination sequencing on all DNA sequences based on fitness by adopting an NSGA3 algorithm; and selecting a specified number of DNA sequences to form a new parent population P according to the result of the constraint domination ordering by adopting a Niche-Preservation operation.

9. A DNA computation based smart attack detection system comprising a microprocessor and a memory connected to each other, characterized in that the microprocessor is programmed or configured to perform the steps of the DNA computation based smart attack detection method according to any one of claims 1 to 8.

10. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is adapted to be programmed or configured by a microprocessor to perform the steps of the DNA computation based smart attack detection method according to any one of claims 1 to 8.