Disclosure of Invention
In view of this, the present invention provides a deep learning intrusion detection method based on a dual population genetic algorithm in an industrial control network to overcome the defects of the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme: a deep learning intrusion detection method based on a double-population genetic algorithm in an industrial control network comprises the following steps:
reading data;
preprocessing the data;
constructing a novel industrial control network intrusion detection model by using an improved double-population genetic algorithm;
and predicting whether the industrial control network has intrusion behavior by using the novel industrial control network intrusion detection model so as to obtain a prediction result.
Optionally, the preprocessing the data includes:
selecting data characteristics to determine a data set;
dividing a training set, a verification set and a test set for a data set;
carrying out Min-Max normalization or Z-Score normalization processing on the divided data set;
and carrying out One-Hot coding on the labels of the various processed data sets.
Optionally, constructing a novel industrial control network intrusion detection model includes:
determining an optimal solution by adopting an improved double-population genetic algorithm;
and putting the optimal solution into a deep neural network model to obtain a novel industrial control network intrusion detection model.
Optionally, the determining an optimal solution by using an improved double-population genetic algorithm includes:
randomly generating an initial population;
dividing the initial population into two sub-populations;
selecting elite individuals of the two sub-populations respectively through elite senses, and removing the elite individuals from the two sub-populations respectively;
dividing the two sub-populations with the elite individuals removed respectively to obtain communication individuals and populations for selection;
carrying out selection operation, cross operation and mutation operation on the two sub-populations;
and combining the elite individuals, the communication individuals and the mutated individuals of the two sub-populations respectively.
Optionally, the selecting operation performed on the two sub-populations includes:
implementing a tournament selection strategy on the population for selection in the first sub-population; a roulette selection strategy is applied to the population for selection in the second sub-population.
Optionally, the interleaving operation includes:
combining the communication individuals in the first sub-population and the communication individuals in the second sub-population into a population;
randomly crossing all individuals in the combined population, wherein the crossing rate is set as 1;
and averagely dividing the population obtained after the intersection into two parts.
Optionally, the mutation operation comprises: applying an annealing algorithm to the genetic algorithm to vary the crossover rate and the variability rate;
specifically, the start phase starts with a higher mutation rate and crossover rate, and then gradually decreases the mutation rate and crossover rate as the number of iterations increases.
Optionally, the determining an optimal solution by using an improved double-population genetic algorithm further includes:
calculating the fitness value of individuals in the population;
the calculating of the fitness value of the individuals in the population specifically comprises:
and putting each individual in the population into the deep neural network model, and calculating the AUC of the model to be used as the fitness value of the population individual.
Optionally, the determining an optimal solution by using an improved double-population genetic algorithm further includes:
accessing each individual in the population to a fitness hash table;
judging whether the fitness value of the population individual exists in a fitness hash table or not;
if so, judging whether the current iteration reaches the maximum iteration number;
if the current iteration reaches the maximum iteration number, the iteration is ended to obtain the highest fitness value, and the individual corresponding to the highest fitness value is the optimal solution;
if the current iteration does not reach the maximum iteration number, the combined population is used as the next generation initial population, and the operations of division, selection, intersection and variation are carried out again until the maximum iteration number is reached;
and if the fitness value of the population individual does not exist in the fitness hash table, putting the individual into the deep neural network model, and calculating the fitness value of the individual.
The invention also provides a controller for executing the deep learning intrusion detection method based on the double-population genetic algorithm in the industrial control network.
By adopting the technical scheme, the intrusion detection method realizes the prediction of whether the intrusion behavior exists in the industrial control network by adopting a novel industrial control network intrusion detection model constructed by an improved double-population genetic algorithm. The model combines a double-population genetic algorithm, an annealing algorithm, a selection strategy based on population communication, a Hash dictionary storage strategy and an elite strategy, organically integrates the functions of various algorithms and optimization strategies, and further obtains an improved deep neural network model (a novel industrial control network intrusion detection model).
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
Fig. 1 is a schematic flow chart provided by an embodiment of a deep learning intrusion detection method based on a dual population genetic algorithm in an industrial control network according to the present invention.
As shown in fig. 1, the deep learning intrusion detection method based on dual population genetic algorithms in an industrial control network according to this embodiment includes:
s11: reading data;
s12: preprocessing the data;
further, the preprocessing the data includes:
selecting data characteristics to determine a data set;
dividing a training set, a verification set and a test set for a data set;
carrying out Min-Max normalization or Z-Score normalization processing on the divided data set;
and carrying out One-Hot coding on the labels of the various processed data sets.
S13: constructing a novel industrial control network intrusion detection model by using an improved double-population genetic algorithm;
further, constructing a novel industrial control network intrusion detection model includes:
determining an optimal solution by adopting an improved double-population genetic algorithm;
and putting the optimal solution into a deep neural network model to obtain a novel industrial control network intrusion detection model.
S14: and predicting whether the industrial control network has intrusion behavior by using the novel industrial control network intrusion detection model so as to obtain a prediction result.
The intrusion detection method provided by the embodiment realizes the prediction of whether the intrusion behavior exists in the industrial control network through the constructed novel industrial control network intrusion detection model. The model combines a double-population genetic algorithm, an annealing algorithm, a selection strategy based on population communication, a Hash dictionary storage strategy and an elite meaning strategy, organically integrates the functions of various algorithms and optimization strategies, and further obtains an improved deep neural network model.
Fig. 2 is a schematic flow chart provided by a second embodiment of the deep learning intrusion detection method based on the double population genetic algorithm in the industrial control network according to the present invention.
As shown in fig. 2, the method for detecting deep learning intrusion based on dual population genetic algorithms in an industrial control network according to this embodiment includes:
s201: reading data;
s202: preprocessing the data;
s203: randomly generating an initial population;
s204: putting each individual in the population into a deep neural network model, and calculating AUC of the model to be used as a fitness value of the population individual;
s205: dividing the population into two sub-populations;
s206: selecting elite individuals of the two sub-populations respectively through elite senses, and removing the elite individuals from the two sub-populations respectively;
s207: dividing the two sub-populations with the elite individuals removed respectively to obtain communication individuals and populations for selection; selecting the two sub-populations based on a selection strategy of population communication;
s208: performing population crossing operation;
s209: updating the crossing rate;
s210: performing population variation operation;
s211: updating the variation rate;
s212: combining the elite individuals, the communication individuals and the mutated individuals of the two sub-populations respectively;
s213: accessing each individual in the population to a fitness hash table;
s214: judging whether the fitness value of the population individual exists in a fitness hash table or not;
s215: if so, judging whether the current iteration reaches the maximum iteration number;
s216: if the current iteration reaches the maximum iteration number, the iteration is ended to obtain the highest fitness value, and the individual corresponding to the highest fitness value is the optimal solution;
otherwise, the merged population is used as the next generation initial population, and the operation of dividing, selecting, crossing and varying is carried out again in the step S205 until the maximum iteration number is reached;
s217: if the fitness value of the population individual does not exist in the fitness hash table, putting the individual into a deep neural network model, calculating the fitness value of the individual, judging whether the current iteration reaches the maximum iteration number, and executing the step S216;
s218: putting the optimal solution into a deep neural network model to obtain a novel industrial control network intrusion detection model;
s219: and predicting whether the industrial control network has intrusion behavior by using the novel industrial control network intrusion detection model to obtain a prediction result.
When the method described in this embodiment is performed,
the first step is to enter a data reading and preprocessing module. Firstly, simply selecting data characteristics, and deleting characteristic value characteristics or characteristics which have little influence on data according to actual meanings so as to save algorithm overhead; secondly, dividing a training set, a verification set and a test set for the data set; then carrying out Min-Max normalization or Z-Score normalization on the divided data sets, and determining the mode according to individual population; and finally, carrying out One-Hot coding on the labels of the classified data sets.
And the second step is entering a deep neural network module. The deep neural network module is applied to two places, namely calculation of fitness values of individuals in a genetic algorithm population, and final model training, verification and testing after final parameters are obtained. The first application here is to calculate the fitness value of an individual in a genetic algorithm population, specifically, the deep neural network adopts an error inverse propagation algorithm, for each training sample, an input example is provided for an input neuron first, signals are continuously transmitted layer by layer forward until an output result is generated, the error of an output layer is calculated after the result is obtained, the error is inversely propagated to a hidden layer neuron, finally, the weight and the bias are adjusted according to the error of the hidden layer neuron, and the process is repeated until a termination condition is reached.
And step three, entering a double population genetic algorithm module. The dual population genetic algorithm module is an important and innovative part of the model. Although the genetic algorithm has good performance, a single population selection mode still has a large promotion space, so that a framework of a double-population genetic algorithm is provided, the double-population genetic algorithm is improved by various optimization algorithms and optimization strategies, and the high searching speed and the accuracy of an optimal solution are guaranteed to be kept when the solution space is large.
And fourthly, entering a second application part of the deep neural network, putting the optimal solution into a deep neural network model, training, verifying and predicting by using data, finally obtaining various indexes of the intrusion detection model, and analyzing and evaluating the indexes.
The key point is that the intrusion detection model based on the deep neural network is automatically and efficiently constructed by adopting the genetic algorithm, the quality of the genetic algorithm directly determines the efficiency and the accuracy of the model, and the dual-population genetic algorithm is explained in detail below.
We decided to innovate and optimize in the genetic algorithm: firstly, abandoning the traditional genetic algorithm, adopting a double-population genetic algorithm, and enriching population individuals by increasing the population quantity and the like; in addition, the double-population genetic algorithm is effectively created, a selection strategy based on population exchange, an elite strategy and Hash fitness storage are used as optimization strategies, a simulated annealing algorithm is used as an optimization algorithm and is combined in the double-population genetic algorithm, and a novel double-population genetic algorithm comprehensive framework is obtained to replace the conventional algorithm.
The dual population genetic algorithm begins with the initial generation of a series of chromosomes, typically in a random generation manner. Subsequently, the dual population genetic algorithm divides the population into two classes, each of which gradually evolves toward an optimal solution by a combination of algorithms similar to the natural evolution process, such as selection, crossover, mutation, and the like. During the evolution of the algorithm, the optimal solution it produces needs to be evaluated according to the fitness function. When the algebra reaches a certain number or reaches a satisfactory fitness level, the algorithm is terminated, and the implementation flow is as shown in fig. 3.
In the framework of the dual-population genetic algorithm, the species diversity and the global searching capability of the genetic algorithm are enhanced through selection, intersection and mutation operations of two populations respectively. The traditional double-population genetic algorithm has higher overlapping performance on the processing capacity of various operators, and in order to avoid the problem, a selection strategy based on population communication is adopted.
The selection strategy based on population exchange is a very effective innovation for a double-population algorithm, the strategy is innovatively fused in the double-population algorithm, the defect of the traditional double-population genetic algorithm can be changed, namely the defect of single selection strategy of the traditional double-population genetic algorithm is changed, and the probability of high-quality individuals and the average score of the whole population are improved. Some individuals of one population may enter another population and breed offspring in it, which brings new genes to the latter population, which may have a great influence on this population due to the uncertainty of the genes, i.e. the genes of the population with new individuals will be superior to those of the original two populations. Before introducing a strategy based on population exchange, two necessary algorithms need to be known in advance, namely roulette selection and tournament selection.
The roulette selection method is to calculate the probability of each individual appearing in the offspring according to the fitness value of the individual and randomly select the individual to form an offspring population according to the probability, so that when the maximization problem is solved, the fitness value can be directly adopted for selection. The tournament selection method takes a certain number of individuals from the population each time, then selects the best one of them to enter the offspring population, and repeats the operation until the new population size reaches the original population size. On the basis of the above, the present embodiment introduces an improved algorithm based on the combination of the roulette algorithm and the tournament algorithm, i.e. a selection strategy based on population exchange. The selection strategy based on population communication is that one part of individuals adopt roulette selection, the other part adopts tournament selection, on the basis, populations on two sides communicate with each other, namely, a small number of individuals enter each other, and each generation adopts the mode until the iteration times are finished.
The key point of the model is to implement an optimized double-population genetic algorithm to obtain the optimal parameters for constructing the DNN model. The traditional genetic algorithm is replaced by the double-population genetic algorithm, the selection strategy based on population communication is used for optimization, and the adaptive value Hash storage, the simulated annealing algorithm and the elite meaning are integrated into a frame, so that the integral model is better in performance. The implementation process of the optimized double-population algorithm will be described in detail below.
(1) The dual population genetic algorithm process begins by randomly generating N individuals whose chromosomes represent potential optimization solutions, each chromosome being a binary string of 58 bits, each chromosome being a possible combination of the aforementioned relevant parameters. Each parameter can be considered as a gene in the chromosome, as shown in table 1.
TABLE 1 chromosomal coding
(2) And dividing the initial population according to the double-population idea, and combining the initial population with the elite meaning. The idea of eligibility is that the optimal solution obtained in some intermediate step may be lost when crossover and mutation create a new generation. Therefore, when a new generation is generated, the current optimal solution is copied into the new generation without change, and each next generation is performed according to the program. The elite method can greatly increase the computation speed because it can prevent missing the found excellent solution.
The initial population is divided into two sub-populations 1, 2, and at this time, it should be combined with the elite meaning, that is, according to the elite meaning, several individuals with best performance in the sub-populations 1, 2 are selected, the elite individuals are respectively removed from the populations 1, 2, the sub-population 1 from which the elite individual is removed is recorded as population 1, and the sub-population 2 from which the elite individual is removed is recorded as population 2, and the implementation process is shown in fig. 4. The elite individual is temporarily stored for subsequent use, and the populations 1 and 2 adopt subsequent operations.
(3) And realizing a selection strategy based on population exchange. A population exchange based selection strategy combines a roulette selection strategy, a tournament selection strategy, and a population individual crossover strategy to perform better than a roulette selection alone. Roulette selection, tournament selection, and population communication-based selection methods have been described above and will not be described in detail herein. Population 1 employed the tournament selection strategy and population 2 employed the roulette selection strategy, which was implemented as shown in fig. 5.
Next, the individual cross _1, cross _2 used for population communication are interleaved. Firstly, all individuals are combined into a population, the population is disordered and randomly crossed, the crossing rate at the moment is set to be 1, all the individuals are guaranteed to be crossed, and after the crossed population is obtained, the crossed population is averagely divided into two parts for subsequent population combination. This process is illustrated in fig. 6.
(4) And applying the simulated annealing algorithm to the population cross variation of the genetic algorithm. The algorithm begins with a high mutation rate and crossover rate and then gradually decreases as the algorithm iterates. This initial high mutation rate and crossover rate will force the genetic algorithm to search for the optimal solution in a larger search space to avoid trapping in locally optimal solutions. In the embodiment, a temperature variable Temp is introduced, and the process is realized by setting a cooling coefficient CoolingRate. At the end of each iteration of the genetic algorithm, the temperature is slightly cooled, thereby reducing the crossover and mutation rates used by the next round of genetic algorithm. Wherein Temp represents the temperature variable introduced in the simulated annealing algorithm, CoolingRate is the cooling coefficient for controlling the cooling process, and the crossover operator adopts the single-point crossover mode, i.e. only one bit is performed during each crossover; the mutation operator adopts a bit flipping mode.
(5) The merging operation is required after the genetic variation is finished.
The operations of accessing, storing and querying the hash table are performed in each generation of double populations after the initial population is generated and after various types of selection, crossover and mutation. For each instance, if its moderate value exists in the hash table, it is extracted from the hash table, otherwise the chromosome will be used to create an instance of the intrusion detection algorithm based on the deep neural network. Through continuous evolution, the genetic algorithm tends to be globally optimal through the aforementioned genetic operation, and when the optimization condition is met, the optimal chromosome is selected as a final result. Thus, a second order sub-population of all sub-populations is obtained, elite individuals, individuals for communication and individuals for inheritance are combined to obtain a new sub-population of the next generation, and the combination method is shown in fig. 7 to obtain a sub-population 1, and a new sub-population 2 is obtained according to the same method. And after the set algebra is reached, the double-population algorithm is ended. Putting the optimal solution into a deep neural network model to obtain a novel industrial control network intrusion detection model; finally, whether the intrusion behavior of the industrial control network exists or not can be predicted by utilizing the novel industrial control network intrusion detection model, and a prediction result is obtained.
In order to demonstrate the superior performance of the method described in this implementation compared to conventional genetic algorithms, the following experimental information is now provided.
The experimental indexes are as follows: as for the fitness evaluation index used in the experiment, the AUC index is selected as a moderate function of the genetic algorithm in the experiment. The AUC is a commonly used performance measurement index in a network intrusion detection algorithm, represents the capability of the detection algorithm for avoiding network data packet misclassification, and is a good balance between the rate of missing report and the rate of false report. For the final model evaluation index, accuracy (accuracycacy), precision (precision), Recall (Recall), Detectivity (DR) F-Score, TPR, FPR, etc. are also used herein.
Western data set: network transaction between a remote terminal unit and a master control unit in an SCADA natural gas pipeline inside the mississippi state university. A new data set is collected using a novel framework for simulating actual attacks and operator activities on a natural gas pipeline. The data set contains three separate classes of functions: network information, payload information, and tags. The CICIDS2017 dataset contains benign and recent common attacks, like real world data. It also includes the results of network traffic analysis using the CICFlowMeter, using markup streams based on timestamps, source and destination ip, source and destination ports, protocols, and attacks.
The experimental results of the conventional genetic algorithm and the double population genetic algorithm are shown in fig. 8. In the figure, the solid line is the test result of the data of the gas tank at the mississippi state university on the traditional genetic algorithm framework, and the dotted line is the test result of the data of the gas tank at the mississippi state university on the comprehensive framework adopting the dual population genetic algorithm. As can be seen from fig. 8, for the westward data set, the whole curve of the conventional genetic algorithm rises smoothly, and the AUC mean value is about 0.9496 after reaching the maximum number of iterations of 10 generations; the whole curve of the double-population genetic algorithm rises quickly, and the AUC mean value is about 0.9594 after the set maximum iteration number is reached for 10 generations.
After a natural gas tank experimental data set and a CICIDS2017 data set based on 2014 of Mississippi State university are subjected to a double population genetic algorithm comprehensive framework, the optimal parameters for constructing a deep neural network model are obtained, the parameters are used for constructing a final deep neural network, and experimental results are listed as the following table, which is shown in Table 2. The results include AUC, ACC, DR, FAR, Precision, Recall, F-Score, TNR and FNR, and the analysis shows that all the indexes are excellent.
TABLE 2 Final test results for each model
The dual population genetic algorithm involved in this embodiment injects new individuals into a gradually single population, and these newly injected individuals are cross-generated by excellent individuals that have undergone a certain evolutionary selection, thereby causing the new population to converge more efficiently toward an optimal solution. In the optimization measures, the embodiment introduces a selection strategy based on population exchange, and can combine a roulette selection algorithm and a tournament selection algorithm, namely: the roulette selection strategy has low convergence speed, high-quality individuals are reserved at a large probability, but the high-quality individuals are abandoned at a small probability and are easy to fall into a local optimal solution; the tournament selection strategy has high convergence speed, is easy to fall into a local optimal solution, and has better overall performance than the roulette selection. Through combination, the selection strategy based on population exchange reserves the advantages of two algorithms, the convergence speed is moderate, but a small probability is trapped in a local optimal solution, and for the solution, the simulated annealing algorithm plays a huge role, the selection cross variation is performed at a higher speed in the early stage, more new individuals are generated, the probability of generating a better solution is increased, the performance in the later stage is integrally higher, the cross variation speed is slowed down, and the generation of redundant low-quality individuals is avoided. While the elite sense always keeps the optimal solution of each generation, and ensures that the best quality individual can be propagated. The DNN anomaly detection model (namely, the novel industrial control network intrusion detection model) based on the improved double population genetic algorithm framework plays a great role in optimizing iteration time and improving algorithm accuracy.
The invention also provides a controller, which is used for executing the deep learning intrusion detection method based on the double-population genetic algorithm in the industrial control network shown in fig. 1 or fig. 2.
It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.
It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present invention, the meaning of "a plurality" means at least two unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.