CN111027697A - Genetic algorithm packaged feature selection power grid intrusion detection method - Google Patents

Genetic algorithm packaged feature selection power grid intrusion detection method Download PDF

Info

Publication number
CN111027697A
CN111027697A CN201911256743.4A CN201911256743A CN111027697A CN 111027697 A CN111027697 A CN 111027697A CN 201911256743 A CN201911256743 A CN 201911256743A CN 111027697 A CN111027697 A CN 111027697A
Authority
CN
China
Prior art keywords
learner
population
feature
algorithm
equation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911256743.4A
Other languages
Chinese (zh)
Inventor
韩永明
曹原
耿志强
汪鹏
欧阳智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Chemical Technology
Guizhou University
Original Assignee
Beijing University of Chemical Technology
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Chemical Technology, Guizhou University filed Critical Beijing University of Chemical Technology
Priority to CN201911256743.4A priority Critical patent/CN111027697A/en
Publication of CN111027697A publication Critical patent/CN111027697A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/06Electricity, gas or water supply

Abstract

The invention provides a genetic algorithm packaged feature selection power grid intrusion detection method, which comprises the following steps: binary coding is carried out on the feature population according to a genetic algorithm, the feature population after the binary coding is initialized, a Las Vegas wrapping type feature selection algorithm is used for judging whether the initialized feature population reaches a preset termination condition or not, an AdaBoost algorithm is used for training a base learner into a strong learner, a cross validation method is used for evaluating the strong learner, an optimal learner is obtained, and global optimization is updated according to the optimal learner. The intrusion detection method provided by the invention is established on a training data set, the training data set consists of monitoring data of different scenes of the power system, and the training data set has normal data and abnormal data, and a model obtained through training is used for monitoring the state of the power system, so that the intrusion behavior is found, an alarm is given out, and the loss caused by intrusion is reduced to the minimum.

Description

Genetic algorithm packaged feature selection power grid intrusion detection method
Technical Field
The invention relates to the technical field of power grid detection, in particular to a genetic algorithm packaged feature selection power grid intrusion detection method.
Background
Intrusion detection determines whether there is a security policy violation and a sign of an attack in a computer network or system by collecting and analyzing information from several key points in the network or system. Industrial control network security intrusion detection is the practical application of intrusion detection technology in industrial control. With the development of industrial control systems and computer networks, the industrial control networks and the computer networks are combined for the convenience of controlling the industrial production process, and the problem that the industrial control networks lack corresponding safety protection measures and mechanisms is solved, and the attack of the industrial control systems in China is increased according to the report issued by the Kabaski laboratory in 2017, so that the urgency of the safety problem of the industrial control networks in China is seen. It is understood that the importance of the power system as one of the industrial control systems is that attacks on the power grid occur in recent years, for example, an attack event on the ukrainian power grid in 2015, an attack event on the russian power grid in 2017, and the like, so it is necessary to enhance the safety protection capability of the power grid.
The development of intrusion detection technology is promoted by the emergence of smart grids, and intrusion detection based on the smart grids can be divided into two forms, namely misuse-based and anomaly-based. A misuse-based method starts with an intrusion behavior, converts the known intrusion behavior into a mode, and judges whether the intrusion is caused or not by matching the current behavior with the mode during detection, and the method cannot find a new intrusion behavior. On the contrary, the anomaly-based method is to establish a normal behavior track of the system, and then the intrusion behavior can be regarded as an intrusion behavior different from the normal track. Anomaly-based intrusion detection is often combined with machine learning algorithms, such as naive bayes, support vector machines, decision trees, random forests, association rule mining, and the like. In addition, smart grids are also involved in deep learning, such as recurrent neural networks, stacked autoencoders, and the like.
Disclosure of Invention
In order to solve the limitations and defects in the prior art, the invention provides a genetic algorithm packaged feature selection power grid intrusion detection method, which comprises the following steps:
step S1, binary coding is carried out on the feature population according to the genetic algorithm, each feature is represented by 1 or 0, 1 represents that the feature is selected, and 0 represents that the feature is not selected;
step S2, initializing the feature population after binary coding;
step S3, judging whether the initialized feature population reaches a preset termination condition by using a Las Vegas wrapping type feature selection algorithm;
if the judgment result is that the initialized feature population does not reach the preset termination condition, executing step S4, and if the judgment result is that the initialized feature population reaches the preset termination condition, outputting a training model;
step S4, training a base learner into a strong learner by using an AdaBoost algorithm;
step S5, evaluating the strong learner by using a cross validation method to obtain an optimal learner;
step S6, updating the global optimum according to the optimum learner;
step S7, forming a new characteristic population through selection, crossing and mutation;
after the completion of step S7, step S3 is performed.
Optionally, the step of evaluating the strong learner by using a cross-validation method to obtain an optimal learner includes:
calculating the fitness value of the individual according to the fitness function, wherein the fitness value is used for evaluating the quality of the individual;
using a linear combination of basis learners to minimize the loss function, the calculation formula is as follows:
Figure BDA0002310481080000021
loss=Ex~D[e-f(x)L(x)](2)
wherein f (x) represents the true value, and for the two-classification problem, the values of f (x) are 1 and-1;
the partial derivative is calculated for l (x) in the loss function loss to obtain the following formula:
Figure BDA0002310481080000022
let equation (3) equal zero, the following equation is obtained:
Figure BDA0002310481080000031
Figure BDA0002310481080000032
according to data distribution DtForm a basis learner, determine corresponding βtSo that the loss function is minimized, the calculation formula is as follows:
Figure BDA0002310481080000033
the following equation is obtained by simplifying equation (6):
Figure BDA0002310481080000034
wherein, thetatRepresentative base classifier ltThe probability that the classification result of (a) is not equal to the true value f (x);
the partial derivative is calculated from equation (7), and the following equation is obtained:
Figure BDA0002310481080000035
let equation (8) equal to 0, the weight update equation for the base classifier is obtained as follows:
Figure BDA0002310481080000036
optionally, the method further includes:
the sample data distribution is adjusted, and the calculation formula is as follows:
Figure BDA0002310481080000037
approximation processing is performed on equation (10) using the taylor expansion, which obtains the following equation:
Figure BDA0002310481080000038
wherein the Taylor expansion is:
Figure BDA0002310481080000039
the minimization process is performed on equation (12), and the following equation is obtained:
Figure BDA00023104810800000310
order to
Figure BDA00023104810800000311
Representing a data distribution DtAccording to the definition of mathematical expectation, an optimal base learner is obtained:
Figure BDA00023104810800000312
the updated formula for obtaining the data sample distribution is:
Figure BDA0002310481080000041
optionally, the step of forming a new feature population after selection, crossover and mutation comprises:
generating two individual indexes according to a random competition method, wherein the two individual indexes correspond to two individuals in a population;
comparing the fitness values of the two individuals, and selecting the individual with the large fitness value to be used for generating a parent of the next generation population;
randomly selecting one of the two individual genes selected as the parents to exchange to form two new individuals, and storing the two new individuals into a new population set.
Optionally, the method further includes:
setting a preset variation rate;
and randomly selecting one of the individual genes in the new population set according to the variation rate to invert to form a new individual, and storing the new individual into the new population set.
The invention has the following beneficial effects:
the invention provides a genetic algorithm packaged feature selection power grid intrusion detection method, which comprises the following steps: the method comprises the steps of binary coding a feature population according to a genetic algorithm, initializing the feature population after binary coding, judging whether the initialized feature population reaches a preset termination condition or not by using a Las Vegas wrapping type feature selection algorithm, if the judgment result is that the initialized feature population does not reach the preset termination condition, intermittently executing the next step, if the judgment result is that the initialized feature population reaches the preset termination condition, outputting a training model, training a base learner into a strong learner by using an AdaBoost algorithm, evaluating the strong learner by using a cross validation method, obtaining an optimal learner, and updating global optimality according to the optimal learner. The intrusion detection method provided by the invention is established on a training data set, the training data set consists of monitoring data of different scenes of the power system, and the training data set has normal data and abnormal data, and a model obtained through training is used for monitoring the state of the power system, so that the intrusion behavior is found, an alarm is given out, and the loss caused by intrusion is reduced to the minimum.
Drawings
Fig. 1 is a flowchart of a las vegas wrapped feature selection algorithm based on a genetic algorithm according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a three-bus two-wire transmission system and a hardware configuration thereof according to an embodiment of the present invention.
Fig. 3 is a flowchart of intrusion detection by a las vegas wrapped-type feature selection algorithm based on a genetic algorithm according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of an iteration result of the genetic algorithm-based las vegas wrapped-type feature selection algorithm on the data set data4 according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of an iteration result of the genetic algorithm-based las vegas wrapped-type feature selection algorithm on the data set data2 according to an embodiment of the present invention.
Fig. 6 is a schematic diagram illustrating an average classification accuracy of 15 data sets of the power system according to an embodiment of the present invention.
Fig. 7 shows an average F1 value of 15 data sets of the power system according to an embodiment of the present invention.
Fig. 8 is a schematic diagram illustrating a comparison between classification accuracy of a las vegas wrapped feature selection algorithm based on a genetic algorithm and classification accuracy of AdaBoost + JRip according to an embodiment of the present invention.
Fig. 9 is a schematic diagram comparing average F1 values of the las vegas wrapped feature selection algorithm based on the genetic algorithm and AdaBoost + JRip provided in the embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the genetic algorithm-wrapped feature selection power grid intrusion detection method provided by the present invention is described in detail below with reference to the accompanying drawings.
Example one
The embodiment provides a power grid intrusion detection method based on a Genetic Algorithm (GA) Las Vegas Wrapper type feature selection Algorithm (LVW), and the accuracy of classification and identification of each scene of a power system is further improved. The GA-LVW model provided by the embodiment combines the robustness and intelligence of GA and the configurability of LVW, realizes organic unification of feature selection and model training, and effectively improves the speed of feature selection. Moreover, the AdaBoost algorithm is used as an integrated learner, has a solid mathematical theory, can promote a weak learner to be a strong learner, and can further enhance the classification capability of a base classifier by configuring the weak learner under LVW, so that a power grid intrusion detection system under data driving obtains more accurate power grid state information, and the occurrence of security events and the expansion of events are avoided.
In the embodiment, a genetic algorithm packaged selection model (GA-LVW) is firstly established, a C4.5 decision tree algorithm is used as a base learner of AdaBoost to be embedded in LVW, then a framework for applying the model to intrusion detection of a smart grid is provided, a three-bus two-wire transmission system public data set of a power system is created by using Shengyi Pan, Thomas Morris and Uttam Adhikari in cooperation with Raymond Borges and Justin Beaver of an oak mountain national laboratory, the model is compared with an independent AdaBoost algorithm and an unmodified LVW algorithm, and finally the model is compared with the work done by the Raymond Borges and Justin Beaver.
The embodiment provides a power grid intrusion detection method, namely a Las Vegas wrapped feature selection algorithm (GA-LVW) improved based on a genetic algorithm, and an AdaBoost algorithm is configured into LVW for evaluation. The intrusion detection method is established on a training data set, the training data set consists of monitoring data of different scenes of the power system, normal data and abnormal data exist, a model obtained through training is used for monitoring the state of the power system, so that intrusion behaviors are found, an alarm is given out, and loss caused by intrusion is reduced to the minimum.
Fig. 1 is a flowchart of a las vegas wrapped feature selection algorithm based on a genetic algorithm according to an embodiment of the present invention. The genetic algorithm is a calculation model of a biological evolution process for simulating natural selection and genetic mechanism of Darwinian biological evolution theory, and is a method for searching an optimal solution by simulating the natural evolution process. The genetic algorithm is a heuristic search algorithm, the Las Vegas packaged feature selection algorithm is a random search algorithm, the random search strategy in the LVW algorithm is replaced by GA algorithm search, the performance of the algorithm can be improved, and the specific process of the algorithm is as follows:
in this embodiment, a feature population is binary-coded, each feature is expressed by 1 and 0, 1 indicates that the feature is selected, and 0 indicates that the feature is not selected, so that a feasible solution is a binary string, and the first-generation binary string population is randomly generated by a random function. The next steps are repeated until the set termination condition is reached.
The fitness value of the individual is calculated in the embodiment, and the fitness value is used for evaluating the quality of the individual and is calculated by a fitness function. The design of the fitness function is of paramount importance, which will determine whether the genetic algorithm can converge towards an optimal solution. The fitness function used in this embodiment is the accuracy of cross validation of the validation set on the AdaBoost training model, and the main purpose of using the AdaBoost algorithm is to further improve the classification accuracy of the base classifier.
To promote the base learner to a strong learner, the present embodiment uses a linear combination of base learners to minimize the loss function, which is calculated as follows:
Figure BDA0002310481080000071
loss=Ex~D[e-f(x)L(x)](2)
wherein, f (x) in the formula (2) represents a true value, and for the binary problem, the values thereof are 1 and-1. In the loss function loss, the partial derivative of l (x) can be obtained:
Figure BDA0002310481080000072
let equation (3) equal zero, the following equation is obtained:
Figure BDA0002310481080000073
Figure BDA0002310481080000074
each base learner will generate based on a data distribution Dt, which needs to be determined to be β t, with the goal of minimizing the loss function, and the calculation formula is as follows:
Figure BDA0002310481080000075
the following equation is obtained by simplifying equation (6):
Figure BDA0002310481080000076
wherein, thetatRepresentative base classifier ltIs not equal to the true value f (x). In this embodiment, the partial derivative is calculated by equation (7), and the following can be obtained:
Figure BDA0002310481080000077
let the partial derivative equal to 0, finally obtain the base classifier weight update formula:
Figure BDA0002310481080000078
in this embodiment, the sample data distribution is adjusted, which is also a minimization process, and the calculation formula is as follows:
Figure BDA0002310481080000079
the following example will approximate equation (10) using a commonly used Taylor expansion, which is calculated as follows:
Figure BDA0002310481080000081
wherein the Taylor expansion is:
Figure BDA0002310481080000082
minimizing loss "is equivalent to maximizing equation (13), which is calculated as follows:
Figure BDA0002310481080000083
order to
Figure BDA0002310481080000084
Representing a data distribution DtFrom the definition of the mathematical expectation, an ideal basis learner is obtained:
Figure BDA0002310481080000085
the final data sample distribution updating formula is as follows:
Figure BDA0002310481080000086
in this embodiment, the selection is a process of selecting the highest or lowest, and the higher the individual fitness value is, the more the chance of selection is. In the embodiment, a random competition method is used, two individual indexes are generated through random numbers, two individuals in a population are corresponding to each other, the two individuals compete, namely, the sizes of the fitness values of the individuals are compared, the individual with the large fitness value is selected as a parent for generating the next generation of the population, and the individual with the large fitness value may be selected for many times, so that the inheritance of excellent genes is ensured.
In the embodiment, one of the genes of the two individuals selected as the parents is randomly selected to be exchanged to form two new individuals, and the two new individuals are stored in a new population set. The mutation is to jump out the local minimum and accelerate the convergence of the algorithm, a smaller value is selected as the mutation rate, a bit is randomly selected for reversing some individual genes in a new population, and the mutation gives stronger searching capability to the genetic algorithm.
This embodiment exemplifies a common data set created by the cooperation of Shengyi Pan, Thomas Morris and Uttam Adhikari with Raymond weights and Justin Beaver in oak ridge national laboratory, usa, and includes measurement data related to the behavior of the power transmission system. The data in the dataset comes from synchrophasor measurements, Snort, analog control panel, and relays.
Fig. 2 is a schematic diagram of a three-bus two-wire transmission system and a hardware configuration thereof according to an embodiment of the present invention. As shown in fig. 2, the power transmission system is the source of data in the data set, which is a three-bus two-wire power transmission system. In the network diagram, G1 and G2 are generators, R1, R2, R3 and R4 are relays, and can control opening or closing of circuit breakers BR1, BR2, BR3 and BR4, one transmission line is connected with BR1 and BR2, and the other transmission line is connected with BR3 and BR 4. Real Time Digital Simulator (RTDS) is used to simulate generators, breakers, loads and transmission lines, relays are connected to the RTDS using actual physical hardware, and when a transmission line fails, the relays trip and open the breakers. The operator may manually issue commands to R1 through R4 to trip circuit breakers BR1 through BR 4. Manual override will be used when performing maintenance on the lines or other system components. Each relay incorporates a Phasor Measurement Unit (PMU) for measuring power system transmission line conditions including frequency, current phasor, voltage phasor, and sequence components. These real-time measurement data are transmitted to a Phasor Data Concentrator (PDC) for aggregation and then forwarded by a router for recording. The control panel computer simulates the function of an Energy Management System (EMS), and an operator can manually control the relay through the EMS, so that the aim of line maintenance is fulfilled. The intrusion detection system Snort runs on the PC to detect activity in the network and provides an alarm when a remote trip command activity in the network is detected, but the intrusion detection system Snort cannot distinguish between legitimate and illegitimate remote trip commands.
The power system has 37 event scenes, and can be roughly divided into five types of events, namely short-circuit fault, line maintenance, remote trip command injection, relay setting change and data injection. The remote trip command injection, the relay setting change and the data injection belong to attack events, line maintenance and short-circuit faults are regarded as normal events, and the events are further divided into different positions of a transmission line to form 37 event scenes. The data set divides these event scenarios into a two-class problem, a three-class problem, and a multi-class problem. The second classification problem divides 37 scenes into attack events and normal events; the three classification problem divides 37 scenarios into attack events, line maintenance events and no events. The data set has 128 features. Each PMU has 29 types of measurements, and Table 1 shows these 29 types of measurements. The following description will mainly take three classification problems as examples.
TABLE 1 PMU measurement types
Feature(s) Description of the invention
PA1:VH-PA3:VH Phase angle of A phase-C phase voltage
PM1:V-PM3:V Amplitude of A phase-C phase voltage
PA4:IH-PA6:IH Phase angle of A-phase to C-phase current
PM4:I-PM6:I Phase A-C current amplitude
PA7:VH-PA9:VH Positive-negativeZero voltage phase angle
PM7:V-PM9:V Positive-negativeZero voltage amplitude
PA10:IH-PA12:IH Positive-negativeZero current phase angle
PM10:I-PM12:I Positive-negativeZero current amplitude
F Frequency of relay
DF Rate of change of relay frequency: (dF/dt)
PA:Z Impedance of appearance relay
PA:ZH Appearance relay impedance angle
S Relay status mark
The genetic algorithm-based improved wrapped feature selection algorithm provided by the embodiment is applied to the power grid data analysis stage, and the final result can be used for developing an intrusion detection system. The data source of the model training is data generated by the power system in real time and historical data, and the data are stored in a database. The real-time data ensures the model's iterability and learning ability. The model completes the selection of the characteristics in the training process, the characteristics are recorded in the characteristic library, so that professionals can conveniently trace the source, a basis is provided for solving the problem, and simultaneously, each type of system state is recorded and stored in the state library in the training process. The final model obtained through training and testing is used for carrying out classification and identification on data generated by the system in real time.
Fig. 3 is a flowchart of intrusion detection by a las vegas wrapped-type feature selection algorithm based on a genetic algorithm according to an embodiment of the present invention. Regarding the processing of real-time data, a real-time data processing framework can be used, for example, Storm is a good distributed real-time data processing framework, and the advantages of Storm are as follows: firstly, programming is simple, developers only need to pay attention to application logic, secondly, processing data is rapid and efficient, thirdly, the data processing method is extensible and fault-tolerant, and fourthly, the data processing method can easily cope with a scene with large data volume. In Storm, Spout generates continuous data, and Bolt is the specific logic to process the data, and all Spout and Bolt constitute a complete topology. Therefore, the Storm framework can be applied to the specific scenario of power system intrusion detection, and is used for receiving real-time data generated by a system and sending the real-time data into a model after being processed. And (3) recording the abnormal state or the intrusion state detected by the model, comparing the abnormal state or the intrusion state with the state in the state library, if the abnormal state is a new abnormal state, adding the state into the state library to finish the updating operation of the state library, and simultaneously generating an alarm by the system, and receiving the signal by an administrator to perform related operation. The whole process of intrusion detection by the intrusion detection system is described above. The intrusion detection method provided by the embodiment is established on a training data set, the training data set consists of monitoring data of different scenes of the power system, normal data and abnormal data exist, and a model obtained through training is used for monitoring the state of the power system, so that an intrusion behavior is found, an alarm is given, and the loss caused by intrusion is reduced to the minimum.
The power system data set used in this embodiment is composed of 15 sub-data sets, and the data distribution in each sub-data set is substantially the same, and the number of the data sets is also equivalent, and the data sets are all marked as three types, namely, a normal state (novents), an intrusion state (attach), and a maintenance state (Natural). Two sets of experiments were performed during the analysis, and the effectiveness of the proposed method was verified by comparing the results of the two sets of experiments. The first set of experiments compared three methods, namely applying an independent AdaBoost + C4.5 decision tree without feature selection to data, configuring the AdaBoost + C4.5 decision tree on an unmodified LVW frame, and configuring the AdaBoost + C4.5 decision tree based on GA-LVW. In the experimental process, quintupling cross-validation is used, each of the 15 sub-data sets is divided into 5 groups, 4 groups of the 15 sub-data sets are subjected to model training, and the remaining 1 group of the 15 sub-data sets are subjected to testing, so that each data set is subjected to 5 times of testing, and the average value of the 5 times of testing is taken for summarizing to evaluate the performance of the algorithm.
Fig. 4 is a schematic diagram of an iteration result of the genetic algorithm-based las vegas wrapped-type feature selection algorithm on the data set data4 according to the first embodiment of the present invention, and fig. 5 is a schematic diagram of an iteration result of the genetic algorithm-based las vegas wrapped-type feature selection algorithm on the data set data2 according to the first embodiment of the present invention. In the experiment, each data set is trained, and the training is finished after 500 times of iteration, each iteration object is a group consisting of 10 individuals, the variation rate is 0.1, and the classification accuracy rate and the classification recall rate are recorded in the whole process. It can be seen from the figure that applying the characteristics of the genetic algorithm to feature selection can improve the classification accuracy of the classifier, and since the number of iterations is limited in the experiment, there is no convergence on some data sets. The evaluation indexes are average accuracy and average recall rate, and the calculation formula is as follows:
Figure BDA0002310481080000121
Figure BDA0002310481080000122
Figure BDA0002310481080000123
Figure BDA0002310481080000124
Figure BDA0002310481080000125
wherein P represents average accuracy, R represents average recall, F1The value represents the harmonic mean of P and R, TPiRepresenting the correct number of predictions for class i, FNiRepresenting the number of prediction errors of class i, FPiRepresenting the prediction of other classes as class i numbers.
TABLE 2 comparison of feature selection number to accuracy
Figure BDA0002310481080000126
Table 2 shows the results of feature selection by GA-LVW on these 15 datasets. As can be seen from the results in the table, a total of 128 features were selected with a majority distribution between 60-70. Compared with the experimental result without feature selection, the classification accuracy is improved while the dimension reduction is realized, but the classification accuracy after feature selection is lower than the accuracy without feature selection can be seen on the data12 data set, and the reason for this should be that the maximum iteration number is set in the experiment, and the genetic algorithm is terminated if the optimal solution is not reached. The genetic algorithm is a heuristic search algorithm, but is also a random search, and if the initial feature combination is poor, the convergence process may be slowed down.
Fig. 6 is a schematic diagram illustrating average classification accuracy of 15 data sets of the power system according to the first embodiment of the present invention, and fig. 7 is an average F1 value of the 15 data sets of the power system according to the first embodiment of the present invention. The experimental results obtained by comparing the AdaBoost algorithm of GA-LVW, the LVW + AdaBoost algorithm without using the genetic algorithm, and the independent AdaBoost algorithm in this embodiment are shown in fig. 6 and 7, where fig. 6 shows the average accuracy of the three experiments on 15 data sets, and fig. 7 shows the average F1 value of the three experiments, it can be seen that the AdaBoost algorithm of GA-LVW obtains better results on most data sets under the condition of limited iteration number, and the condition of low accuracy should belong to the unconverged state.
Fig. 8 is a schematic diagram illustrating comparison of classification accuracy of a las vegas wrapped-type feature selection algorithm based on a genetic algorithm and AdaBoost + JRip provided in the first embodiment of the present invention, and fig. 9 is a schematic diagram illustrating comparison of an average F1 value of the las vegas wrapped-type feature selection algorithm based on a genetic algorithm and AdaBoost + JRip provided in the first embodiment of the present invention. This embodiment is also compared with the AdaBoost + JRip method of Raymond targets and Justin Beaver, which classifies the features based on the information gain rank in the first 40 bits, and this embodiment selects all the features. Since Raymond Borges and Justin Beaver are experiments performed on Weka, this set of experiments was also performed on Weka for consistency. First, the results of the first set of experimental feature selection were trained using AdaBoost + JRip, and then ten-fold cross validation was used, with the experimental results shown in fig. 8 and 9.
Experimental results on some data sets, such as data5, data7 and data8, GA-LVW improves the classification accuracy of the AdaBoost + JRip algorithm, and on other data sets, the equivalent accuracy is achieved. It is noted that in this set of experiments, the AdaBoost + JRip is not embedded into LVW for feature selection again, but the results of the first set of experiments are used, in this case, a considerable accuracy is obtained, and even better results are obtained on some data sets, which indicates that the feature selection performed globally can further mine potential feature combinations, further improving the classification performance of the algorithm, and the selection of the top k features based on information gain ranking sometimes loses some potential information, which verifies the effectiveness of the algorithm.
For further analysis of the performance of the algorithm, taking data5 as an example, as shown in table 3, for the classification result of the Adaboost + JRip algorithm, a confusion matrix is used for analysis, and it can be found that the classifier has better recognition on both the attach and novents, but has a poor recognition effect on the Natural events, because the classifier can mistake a part of the Natural events as the attach events, which causes a reduction in classification accuracy, and this also has similar performance on the other 14 data sets. The reason for this problem is the class overlap, i.e. the classifier cannot effectively distinguish according to the feature combination, and the two classes of events overlap on a part of the subsets, which is also a bottleneck affecting the performance of the classifier.
Table 3 data5 confusion matrix
Figure BDA0002310481080000141
In the embodiment, as can be seen from the case that the GA-LVW is applied to the power system, the GA-LVW completes feature selection while training a model, and improves the classification accuracy, so that the effectiveness and the usability of the method are verified, and positive promotion effects are played on the aspects of development of a data-driven smart grid intrusion detection system, safety analysis of a power electronic system, fault detection and the like.
It will be understood that the above embodiments are merely exemplary embodiments taken to illustrate the principles of the present invention, which is not limited thereto. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims (5)

1. A genetic algorithm packaged feature selection power grid intrusion detection method is characterized by comprising the following steps:
step S1, binary coding is carried out on the feature population according to the genetic algorithm, each feature is represented by 1 or 0, 1 represents that the feature is selected, and 0 represents that the feature is not selected;
step S2, initializing the feature population after binary coding;
step S3, judging whether the initialized feature population reaches a preset termination condition by using a Las Vegas wrapping type feature selection algorithm;
if the judgment result is that the initialized feature population does not reach the preset termination condition, executing step S4, and if the judgment result is that the initialized feature population reaches the preset termination condition, outputting a training model;
step S4, training a base learner into a strong learner by using an AdaBoost algorithm;
step S5, evaluating the strong learner by using a cross validation method to obtain an optimal learner;
step S6, updating the global optimum according to the optimum learner;
step S7, forming a new characteristic population through selection, crossing and mutation;
after the completion of step S7, step S3 is performed.
2. The method for detecting intrusion into a power grid by using genetic algorithm-wrapped signature selection according to claim 1, wherein the step of evaluating the strong learner by using a cross validation method to obtain an optimal learner comprises the steps of:
calculating the fitness value of the individual according to the fitness function, wherein the fitness value is used for evaluating the quality of the individual;
using a linear combination of basis learners to minimize the loss function, the calculation formula is as follows:
Figure FDA0002310481070000011
loss=Ex~D[e-f(x)L(x)](2)
wherein f (x) represents the true value, and for the two-classification problem, the values of f (x) are 1 and-1;
the partial derivative is calculated for l (x) in the loss function loss to obtain the following formula:
Figure FDA0002310481070000021
let equation (3) equal zero, the following equation is obtained:
Figure FDA0002310481070000022
Figure FDA0002310481070000023
according to data distribution DtForm a basis learner, determine corresponding βtSo that the loss function is minimized, the calculation formula is as follows:
Figure FDA0002310481070000024
the following equation is obtained by simplifying equation (6):
Figure FDA0002310481070000025
wherein, thetatRepresentative base classifier ltThe probability that the classification result of (a) is not equal to the true value f (x);
the partial derivative is calculated from equation (7), and the following equation is obtained:
Figure FDA0002310481070000026
let equation (8) equal to 0, the weight update equation for the base classifier is obtained as follows:
Figure FDA0002310481070000027
3. the method of claim 2, further comprising:
the sample data distribution is adjusted, and the calculation formula is as follows:
Figure FDA0002310481070000028
approximation processing is performed on equation (10) using the taylor expansion, which obtains the following equation:
Figure FDA0002310481070000029
wherein the Taylor expansion is:
Figure FDA00023104810700000210
the minimization process is performed on equation (12), and the following equation is obtained:
Figure FDA00023104810700000211
order to
Figure FDA00023104810700000212
Representing a data distribution DtAccording to the definition of mathematical expectation, an optimal base learner is obtained:
Figure FDA0002310481070000031
the updated formula for obtaining the data sample distribution is:
Figure FDA0002310481070000032
4. the method for detecting intrusion into a power grid by using genetic algorithm wrapped signature selection according to claim 1, wherein the step of forming a new signature population after selection, crossing and mutation comprises:
generating two individual indexes according to a random competition method, wherein the two individual indexes correspond to two individuals in a population;
comparing the fitness values of the two individuals, and selecting the individual with the large fitness value to be used for generating a parent of the next generation population;
randomly selecting one of the two individual genes selected as the parents to exchange to form two new individuals, and storing the two new individuals into a new population set.
5. The method of claim 4, further comprising:
setting a preset variation rate;
and randomly selecting one of the individual genes in the new population set according to the variation rate to invert to form a new individual, and storing the new individual into the new population set.
CN201911256743.4A 2019-12-08 2019-12-08 Genetic algorithm packaged feature selection power grid intrusion detection method Pending CN111027697A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911256743.4A CN111027697A (en) 2019-12-08 2019-12-08 Genetic algorithm packaged feature selection power grid intrusion detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911256743.4A CN111027697A (en) 2019-12-08 2019-12-08 Genetic algorithm packaged feature selection power grid intrusion detection method

Publications (1)

Publication Number Publication Date
CN111027697A true CN111027697A (en) 2020-04-17

Family

ID=70208269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911256743.4A Pending CN111027697A (en) 2019-12-08 2019-12-08 Genetic algorithm packaged feature selection power grid intrusion detection method

Country Status (1)

Country Link
CN (1) CN111027697A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113162914A (en) * 2021-03-16 2021-07-23 江西理工大学 Intrusion detection method and system based on Taylor neural network
CN113610179A (en) * 2021-08-17 2021-11-05 安徽容知日新科技股份有限公司 Equipment fault detection classifier training method, computing equipment and storage medium
CN115174193A (en) * 2022-06-30 2022-10-11 北京炼石网络技术有限公司 Method, device and equipment for detecting data security intrusion based on GA algorithm

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113162914A (en) * 2021-03-16 2021-07-23 江西理工大学 Intrusion detection method and system based on Taylor neural network
CN113610179A (en) * 2021-08-17 2021-11-05 安徽容知日新科技股份有限公司 Equipment fault detection classifier training method, computing equipment and storage medium
CN115174193A (en) * 2022-06-30 2022-10-11 北京炼石网络技术有限公司 Method, device and equipment for detecting data security intrusion based on GA algorithm
CN115174193B (en) * 2022-06-30 2023-08-15 北京炼石网络技术有限公司 Data security intrusion detection method, device and equipment based on GA algorithm

Similar Documents

Publication Publication Date Title
CN112333194B (en) GRU-CNN-based comprehensive energy network security attack detection method
CN111027697A (en) Genetic algorithm packaged feature selection power grid intrusion detection method
CN109800875A (en) Chemical industry fault detection method based on particle group optimizing and noise reduction sparse coding machine
CN111598179B (en) Power monitoring system user abnormal behavior analysis method, storage medium and equipment
CN110969194B (en) Cable early fault positioning method based on improved convolutional neural network
CN110247910A (en) A kind of detection method of abnormal flow, system and associated component
CN111711608B (en) Method and system for detecting abnormal flow of power data network and electronic equipment
CN106569030A (en) Alarm threshold optimizing method and device in electric energy metering abnormity diagnosis
CN106649479A (en) Probability graph-based transformer state association rule mining method
Zheng et al. Real-time transient stability assessment based on deep recurrent neural network
CN108090606A (en) Equipment fault finds method and system
CN114580829A (en) Power utilization safety sensing method, equipment and medium based on random forest algorithm
CN113765880A (en) Power system network attack detection method based on space-time correlation
CN110580213A (en) Database anomaly detection method based on cyclic marking time point process
CN114841199A (en) Power distribution network fault diagnosis method, device, equipment and readable storage medium
CN113259379A (en) Abnormal alarm identification method, device, server and storage medium based on incremental learning
CN115129607A (en) Power grid safety analysis machine learning model test method, device, equipment and medium
CN108768750A (en) Communication network failure localization method and device
CN115185804A (en) Server performance prediction method, system, terminal and storage medium
Gao et al. The prediction role of hidden markov model in intrusion detection
CN115242441A (en) Network intrusion detection method based on feature selection and deep neural network
CN114861792A (en) Complex power grid key node identification method based on deep reinforcement learning
Boateng Unsupervised Ensemble Methods for Anomaly Detection in PLC-based Process Control
Wang et al. Network intrusion detection based on lightning search algorithm optimized extreme learning machine
CN114726622B (en) Back door attack influence evaluation method for power system data driving algorithm, system thereof and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200417