CN115348063A

CN115348063A - DNN and K-means-based power system network flow identification method

Info

Publication number: CN115348063A
Application number: CN202210882066.2A
Authority: CN
Inventors: 刘建戈; 张鹏宇; 季一木; 李茂�; 姜蒙娜; 王伟业; 刘尚东; 高山
Original assignee: Nanjing Dingyan Power Technology Co ltd; Nanjing University of Posts and Telecommunications; HuaiAn Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Nanjing Dingyan Power Technology Co ltd; Nanjing University of Posts and Telecommunications; HuaiAn Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2022-05-07
Filing date: 2022-07-25
Publication date: 2022-11-15

Abstract

The invention relates to the technical field of network security artificial intelligence, and discloses a DNN and K-means based power system network flow identification method, which is used for screening original data and selecting data items which can provide more information for classification to form a data sample; carrying out integration operation and normalization operation on sample data for preprocessing; iteratively training a DNN network model, using a preprocessed training set to train the DNN network model, and performing preliminary classification on the network flow of the power system to obtain a classification confidence coefficient and positive and negative example results; and classifying the samples which are judged to be suspected servers after the DNN network model is processed by using a K-means algorithm. Compared with the prior art, the method has higher accuracy in the classification application of the power network data, and can meet the requirement of the network flow classification of the power system in the real environment.

Description

DNN and K-means-based power system network flow identification method

Technical Field

The invention relates to the technical field of network security artificial intelligence, in particular to a DNN and K-means-based power system network flow identification method.

Background

With the development and application of the power system network technology, the scale of the power system network is continuously enlarged, the network complexity is also obviously improved, and the network security risk of the power system is increased accordingly. Under the current situation of power network threat normalization, the accurate detection and analysis capability and the early warning capability become the key of the safety capability of new generation of big data gradually.

Attacks on the power system network by attackers are basically performed in the form of network services, and the attack initiating end has similar traffic characteristics to the servers of the power system network. Therefore, identification and classification of power system network traffic is a key step in network security. The traditional network traffic identification method usually depends on a large amount of manual inquiry and verification or needs to manually determine rules, and the cost is not negligible under the current millions of large data traffic scenes. Meanwhile, the traditional method has long query interval, poor real-time performance and low accuracy, so that the method can only be used for passive defense strategies, has insufficient early warning capability and is difficult to deal with new network security threats of the power system.

Disclosure of Invention

The invention aims to: aiming at the problems in the prior art, the invention provides a DNN and K-means based power system network flow identification method, which has a good effect on classification of power system network flow, discovers abnormal flow and illegal service terminals as soon as possible and maintains the safety of a power system network.

The technical scheme is as follows: the invention provides a DNN and K-means based power system network flow identification method, which comprises the following steps:

step 1: screening original data, selecting data items which can provide more information for classification to form data samples, wherein the data samples comprise server IP addresses, ports, protocols, client IP addresses, byte numbers, bit rates, packet numbers, session total numbers and time;

step 2: performing integration operation and normalization operation on the sample data to perform preprocessing;

and step 3: iteratively training a DNN network model, using a preprocessed training set to train the DNN network model, and performing preliminary classification on the network flow of the power system to obtain a classification confidence coefficient and positive and negative example results;

and 4, step 4: and classifying the samples which are judged to be suspected servers after the DNN network model processing by using a K-means algorithm.

Further, in the step 1, some IPs of the power system network are known as power system servers, and some IPs that are determined not to be power system servers are also known, and sample data corresponding to the IPs are extracted to be respectively used as positive examples and negative examples of the training set.

Further, the integrating operation and the normalizing operation in step 2 specifically include:

and integrating the sample data of the same server IP, the same port, the same protocol and the same time period into one piece of data by using four-tuple of the server IP, the port, the protocol and the time as an index, and then performing max-min normalization processing on each sample data to map the sample value to a [0,1] interval.

Further, the DNN network model structure in step 3 is: the method comprises the steps that an input layer inputs an n-dimensional data sample, data characteristics are output to an output layer through three full-connection layers of a hidden layer, the output layer outputs a local value and becomes a predicted confidence value through a Sigmoid activation function, the three full-connection layers use a nonlinear activation function Relu to use real-time test data to input into a network to obtain a predicted result confidence, and then a result probability value output by a DNN network model is judged to obtain a positive case and a negative case.

Further, the number of the neurons of the three fully-connected layers of the hidden layer is 512, 256 and 128.

Further, the preliminary classification of the network traffic of the power system in the step 3 to obtain the classification confidence and the positive and negative case results includes the following specific operations:

step 3.1: acquiring data, namely acquiring a group of power system network flow training data { x (N), y (N) |1 is not less than N and not more than N }, wherein x is a network flow sample and comprises statistical information such as the total packet number, the packet number per second, the total byte number, the byte number per second and the like of an IP (Internet protocol) of a certain end; y is a sample label, is a manually labeled value, and represents whether the training data is a server; n is the total number of training data;

step 3.2: for a DNN network output function f, inputting data into the network to obtain a classification result:

wherein, the first and the second end of the pipe are connected with each other,

the confidence coefficient of the classification result is, and theta is a network parameter of the DNN network model;

step 3.3, iterative training is carried out on the DNN network model, and a binary cross entropy function is used as a training loss function:

step 3.4: and optimizing and updating the network parameter theta, and selecting random gradient descent as an optimizer.

Further, the specific steps of classifying by using the K-means algorithm in the step 4 are as follows:

step 4.1: determining a threshold value, dividing the classification result, considering the classification result as a server if the classification result is higher than the threshold value, and finding out all original flow samples classified as the result of the server if the classification result is lower than the threshold value, adding respective classification confidence degrees into the original flow samples to form new sample data

M is the total number of new samples;

step 4.2: determining the value K of the cluster, thereby determining K cluster centers { c (K) |1 ≦ K ≦ K }, and performing random initialization on the cluster centers; calculating the Euclidean distance from each sample to each clustering center, sequentially comparing the distance from each sample to each clustering center, and then distributing the samples to the cluster of the clustering center closest to the sample to obtain K clusters { s (K) |1 is less than or equal to K };

step 4.3: after the class clusters are obtained, the position of a clustering center is updated through the class clusters by a K-means algorithm, and the new clustering center is the mean value of each sample in the class clusters on each dimension.

Further, the specific method for determining the K value of the cluster in the step 4.2 is to calculate the residual square sum SSE from the sample in the cluster to the center of the cluster, sequentially take the K value as 1,2,3 \8230, then use the K value as an independent variable and the average SSE as a dependent variable to construct a graph, find an inflection point where the image slope rapidly drops to a gentle drop, and the K value at the point is the optimal K value.

Has the advantages that:

compared with a manual method, the method provided by the invention has the advantages of higher real-time performance, stronger data processing capacity and lower cost. Compared with the traditional data analysis method based on ports and flow, the method has higher accuracy, adds a machine learning clustering algorithm after judging whether the server is used or not, automatically performs multi-classification, and provides greater convenience for the analysis and verification of results by subsequent workers.

Drawings

FIG. 1 is a diagram of a system model architecture;

FIG. 2 is a schematic diagram of a deep neural network architecture;

fig. 3 is a deep neural network connection diagram.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The invention discloses a DNN and K-means based power system network flow identification method, which comprises the following steps:

step 1: and acquiring real-time data from a network flow database of the power system, and screening out required data items.

Step 2: the data is integrated and normalized and preprocessed into a form suitable for classification.

The data items of the data samples in the power network traffic database are many, and some data items, such as a TCP synchronization packet, an average ACK delay of a client, and the like, do not help in classifying the network traffic. Therefore, the invention firstly screens the original data, selects the data items which can provide more information for classification to form data samples, and finally selects the data items such as server IP address, port, protocol, client IP address, byte number, bit rate, packet number, total session number, time and the like as sample data. In all the flow data, the invention knows that some power system network IPs are power system servers, and also knows that some IPs are determined not to be servers, and sample data corresponding to the IPs are extracted to be respectively used as positive examples and negative examples of the training set.

Although the number of known server IPs in the power network is a very small proportion of the number of all end IPs, the data samples belonging to the known server IPs account for a vast majority of all data samples, which indicates that a vast majority of sessions in the network are initiated by a very small number of servers, which results in a number of positive cases far greater than negative cases, and thus, a great influence is exerted on the training effect. Therefore, the present invention performs an integration operation on the sample data to solve this problem. Specifically, the present invention integrates sample data of the same server IP, the same port, the same protocol, and the same time period into one piece of data using a quadruple (server IP, port, protocol, time) as an index. The number of positive cases is effectively reduced, the number of positive cases and negative cases is more balanced after integration, and a better experiment effect is obtained.

Finally, because the magnitude difference of each data item in the data sample is too large, for example, the total byte number data item can reach 10 ⁶ Of the order of magnitude, but only 10 packets per second ¹ The direct training using such data may cause the neural network to have a gradient explosion condition, affecting the model performance. Therefore, the invention performs max-min normalization processing on each sample, and maps the sample value to [0, 1')]And (3) interval, so that the fast and stable convergence of the model is realized in training, and for a certain sample x, the normalization formula is shown as formula 1:

wherein x is _min Denotes the minimum value, x, of all samples _max Represents the maximum of all samples, and x' represents the normalized sample.

And step 3: and training a two-class DNN model by using data to distinguish whether the IP of a certain terminal is the IP of the server. The data to be classified is processed by DNN to output a two-classification prediction result which represents the confidence that the data belongs to a server in the network. And screening the results according to the confidence level, wherein the case that the confidence level is greater than a certain threshold value is a positive case, namely the server data, and the case that the confidence level is greater than the certain threshold value is a negative case, namely the server data.

For the deep neural network model construction mentioned in step 3:

firstly, acquiring data, namely acquiring a group of power system network flow training data { x (N), y (N) |1 is more than or equal to N and less than or equal to N }, wherein x is a network flow sample and comprises statistical information such as the total packet number, the packet number per second, the total byte number, the byte number per second and the like of an IP (Internet protocol) of a certain end; y is a sample label, is a manually labeled value, and represents whether the training data is a server; n is the total number of training data. At this time, for a DNN network output function f, data is input into the network to obtain a classification result, as shown in formula 2:

wherein the content of the first and second substances,

is the confidence of the classification result, and theta is the network parameter of DNN. The DNN network structure is schematically shown in fig. 2. The input layer packs the data of the network traffic into batch and transmits the batch into the hidden layer neural network. There are groups of neurons in the hidden layer. The hidden layer transmits the extracted features into the output layer, and the output layer outputs a result logits which can be converted into probability through a Sigmoid function.

To make it possible to

The results are as accurate as possible, requiring iterative training of the DNN. The invention uses a Binary cross entropy function as a loss function of training, wherein the Binary cross entropy function is shown as formula 3:

wherein N is the total amount of samples. The loss function is used for iteratively training the DNN network model, which is an optimization process aimed at minimizing the value of the loss function so that the training result value is as close to the label as possible, i.e. the objective function J can be expressed as:

J＝min L(θ，x，y) (4)

in the process, the network parameter theta is optimized and updated, random gradient descent (SGD) is selected as an optimizer, and the optimization method can be expressed as follows:

θ _t+1 ＝θ _t -λ _t g _t (5)

where t denotes a certain iteration, λ _t Is an optimization weight, commonly referred to as learning rate, g represents a random gradient (stochastic gradient) that is expected to be a gradient of f, i.e., satisfy

The DNN can be fully learned to the characteristics of the power network flow data by iterative training for a certain number of times by using the algorithm, so that accurate classification can be made when real-time power network flow data is faced.

When the DNN model was trained as described above, the number of epochs used was 300, the batch size was 128, and the learning rate was 0.02.

And 4, step 4: and adding a confidence coefficient item into the data sample of the positive case, and carrying out K-means cluster analysis to cluster a multi-classification result representing the specific type of the server.

By training DNN by using a deep learning method, a classification result indicating whether a certain flow sample belongs to a certain server can be obtained, but in order to further explore which server the flow sample belongs to, the invention adopts a K-means clustering algorithm to perform clustering operation on the classified server flow sample. K-means is an unsupervised machine learning clustering method, and the flow is as follows: first, a threshold is determined, and the classification result is classified, wherein a value higher than the threshold is considered as a server, and a value lower than the threshold is not. Then, find all the clothes classified as clothesAdding respective classification confidence degrees to the original flow sample of the server result to form new sample data

M is the total number of new samples.

Subsequently, the value K of the cluster is determined, so that K cluster centers { c (K) |1 ≦ K ≦ K } are determined, and random initialization is performed on the cluster centers. Next, the euclidean distance from each sample to the center of each cluster is calculated, as shown in equation 6:

wherein the content of the first and second substances,

representing new sample data, c _j Is the cluster center.

And sequentially comparing the distance from each sample to each clustering center, and then distributing the samples to the cluster of the nearest clustering center to obtain K clusters { s (K) |1 ≦ K }.

After the class clusters are obtained, the position of a clustering center is updated through the class clusters by a K-means algorithm, and the new clustering center is the mean value of each sample in the class clusters on each dimension, namely:

wherein the content of the first and second substances,

is the updated k-th cluster center. The algorithm is repeated for a plurality of times until convergence, and samples belonging to the server can be divided into K types.

The optimal K value of the K-means clustering algorithm can be determined through experiments, and the most common determination method is the elbow method. First, the Sum of squares of residuals (SSE) from the samples in the cluster to the cluster center is calculated, which is a commonly used index for measuring the classification quality of the samples in the cluster, as shown in equation 8:

in which p is a cluster of species s _i A sample of (2), c _i Is the corresponding cluster center. A smaller value of SSE indicates a better quality of classification of the samples in the cluster.

As the number of clusters K increases, the sample partitioning becomes finer and the SSE value for each cluster should be correspondingly smaller. Moreover, when K is smaller than the optimal cluster number, the decrease of SSE is large because the increase of K greatly increases the classification quality of each cluster, and when K reaches the optimal cluster number, the classification quality return obtained by increasing K is rapidly reduced, so the decrease of SSE tends to be gentle. The method for determining the optimal K value sequentially takes the K value as 1,2,3 \8230, then uses the K value as an independent variable and the average SSE as a dependent variable to carry out mapping, finds the inflection point of the image slope from the rapid reduction to the gentle reduction, and the K value of the point is the optimal K value.

For the DNN classification model, the commonly used indicators are accuracy (precision) and recall (recall), and their formulas are as follows:

wherein tp represents the number of true positive examples, namely the number of samples which are actually positive examples and are predicted to be positive examples; fp represents the number of false positive examples, namely the number of samples which are actually negative examples but are predicted to be positive examples; the direction represents the number of false negative examples, i.e. the number of samples that are actually positive examples but predicted to be negative examples. The performance of the model can be measured from two dimensions using accuracy and recall indicators. In general, different threshold values δ are taken as sample points in an experiment, an accuracy-Recall curve (P-R) is drawn, and the superiority and inferiority of a model are determined through the height of an equilibrium point of the curve.

For the evaluation of the K-means clustering algorithm, the evaluation index can be determined by using a contour Coefficient method (Silhouette Coefficient) in addition to the above-mentioned residual sum of squares SSE, and for a certain sample x, the formula of the contour Coefficient method is as follows:

where a (x) is called intra-cluster dissimilarity, which is the sum of the distances of sample x from other samples in the cluster; b (x) is the dissimilarity between clusters, which is the sum of the distances between the sample x and other samples in other clusters. The value range of SC is [ -1,1], and the closer to 1, the better the clustering performance is.

The present invention uses network data acquired in the actual power system network to validate the proposed method. The method selects the power network flow data of 25 days in total, and the time stamps of the data are divided according to hours. Firstly, a known intranet server IP and an IP which is known not to belong to a server are screened, 2257429 pieces of original data are selected in total, and after the data are subjected to operations such as marking, integration and normalization, 162952 pieces of training data samples are obtained to form a training set. Then, preprocessing operations except for marking are carried out on the rest 2784335 unknown data to obtain 6988459 test data samples, and a test set is formed.

The method uses a training set to construct a model, predicts a test set by using the model, and divides the test set according to a threshold value delta =0.5 to obtain 591 suspected IPs. Because the test set is unknown non-label data, manual verification is carried out, and some of the IPs are found to belong to a real service server, and some are class server devices, such as monitoring devices, IAD devices, soft switch devices, and the like. The specific experimental effects are shown in table 1:

TABLE 1 test set Classification results

In table 1, the occupation ratio refers to the proportion of a certain type of device in all suspected IPs, and the other types of devices refer to non-grid internal devices. Most suspected IPs are unreported devices, accounting for 70.38%, accounting for 25.92% of service servers, and the average confidence is higher, reaching 97.59%, which indicates that the model has the capability of more accurately discovering suspected servers.

In cluster analysis of data, it is found that there are also different classifications between the service servers. Of the suspected IPs found, 66.67% of the traffic servers were clustered into one class, and the remaining 4 classes of the cluster also included different traffic servers. In the class server device, the cluster analysis is successful in distinguishing various devices, and the specific cluster analysis result is shown in table 2.

TABLE 2 test set Cluster analysis results

The ratio in table 2 represents the proportion of devices clustered into this class to the total number of such devices. It can be seen that, in the case of class server devices, most of the various devices are clustered into the same class, which indicates that the clustering analysis has a certain distinguishing effect on the device types.

The above embodiments are merely illustrative of the technical concepts and features of the present invention, and the purpose of the embodiments is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A DNN and K-means based power system network flow identification method is characterized by comprising the following steps:

step 1: screening original data, and selecting data items which can provide more information for classification to form data samples, wherein the data samples comprise server IP addresses, ports, protocols, client IP addresses, byte numbers, bit rates, packet numbers, session total numbers and time;

and 4, step 4: and classifying the samples which are judged to be suspected servers after the DNN network model is processed by using a K-means algorithm.

2. The method for identifying network traffic of electric power system based on DNN and K-means as claimed in claim 1, wherein in step 1, some IP of electric power system network are known as server of electric power system, and some IP determined not to be server of electric power system, and sample data corresponding to said IP are extracted as positive and negative examples of training set respectively.

3. The DNN and K-means based power system network traffic identification method according to claim 1, wherein the integrating operation and the normalizing operation in the step 2 specifically comprise:

4. The DNN and K-means based power system network traffic identification method according to claim 1, wherein the DNN network model structure in the step 3 is: the method comprises the steps that an input layer inputs an n-dimensional data sample, data characteristics are output to an output layer through three full-connection layers of a hidden layer, the output layer outputs a logic value and becomes a predicted confidence value through a Sigmoid activation function, the three full-connection layers use a nonlinear activation function Relu to input real-time test data into a network to obtain a predicted result confidence, and then a result probability value output by a DNN network model is judged to obtain a positive case and a negative case.

5. The DNN and K-means based power system network traffic identification method of claim 4, wherein the number of neurons of the three fully connected layers of the hidden layer is 512, 256 and 128.

6. The DNN and K-means based power system network traffic identification method according to claim 4 or 5, wherein the preliminary classification of the power system network traffic in the step 3 to obtain the classification confidence and the positive and negative case results comprises the following specific operations:

7. The DNN and K-means-based power system network traffic identification method of claim 1, wherein the K-means algorithm used for classification in the step 4 comprises the following specific steps:

M is the total number of new samples;

8. The DNN and K-means-based power system network flow identification method of claim 7, wherein the specific method for determining the K value of the cluster in the step 4.2 is to calculate the residual Sum of Squares (SSE) from the samples in the cluster to the center of the cluster, sequentially take the K value of 1,2,3 \8230, then use the K value as an independent variable and the average SSE as a dependent variable to map, find the inflection point where the image slope is rapidly reduced to be gently reduced, and the K value of the point is the optimal K value.