CN117478390A

CN117478390A - Network intrusion detection method based on improved density peak clustering algorithm

Info

Publication number: CN117478390A
Application number: CN202311461821.0A
Authority: CN
Inventors: 黄鑫; 杨帆; 李嫄源; 朱智勤; 周志浩; 安翼尧; 陈诗尧; 李家兴; 龚康; 刘秋卓; 文斌; 刘阳; 周嘉靖
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-11-06
Filing date: 2023-11-06
Publication date: 2024-01-30

Abstract

The invention relates to a network intrusion detection method based on an improved density peak clustering algorithm, and belongs to the technical field of machine learning and computer network security. The method comprises the steps of coding character type features in a network intrusion data set into digital features, carrying out standardized processing, carrying out feature extraction on the network intrusion data set by using a principal component analysis method, removing redundant data and reducing the dimension; calculating neighbors of network intrusion data, and calculating k neighbors when the network intrusion data reach a stable state by using a natural neighbor search algorithm; calculating the density of each point so as to obtain a local representative point according to the density; calculating the distance between the local representative points, and applying a density peak clustering algorithm to the distance to obtain a cluster result; and calculating an outlier factor based on the clusters for each cluster, and taking the detected outlier as abnormal attack data. The method solves the problem that the existing method often ignores cluster abnormal points, and solves the problem that the current intrusion detection method based on the clusters can not well identify manifold clusters.

Description

Network intrusion detection method based on improved density peak clustering algorithm

Technical Field

The invention belongs to the technical field of machine learning and computer network security, and relates to a network intrusion detection method based on an improved density peak clustering algorithm.

Background

With the rapid development of big data and artificial intelligence technology, the network scale is continuously enlarged, which introduces more network security problems. Among them, outlier detection plays an important role in the field of network intrusion detection. Outliers refer to data points that deviate significantly from other data points in the data set due to different mechanisms or unusual processes. Outliers in network data sets are often generated by abnormal network attacks. The network intrusion detection based on the clustering is often applied to an offline environment, and when the data scale is smaller, the intrusion detection method based on the clustering can easily detect abnormal points, and the method can effectively identify burst attacks and isolated attacks. Common cluster-based intrusion detection techniques are generally based on clustering methods such as K-means, DBSCAN, density peak values and the like, and when the conventional cluster-based intrusion detection methods are applied to a network intrusion data set with popular clusters, the problem that manifold clusters cannot be well identified usually exists, so that the representativeness of outlier detection results is reduced.

Therefore, a new network intrusion detection method is needed to solve the above problems.

Disclosure of Invention

In view of the above, the present invention aims to provide a network intrusion detection method based on an improved density peak clustering algorithm, which uses an accurate clustering result obtained by the improved density peak clustering algorithm to improve the representativeness of local outliers, thereby improving the representativeness of outlier detection results. The invention evaluates the clustered clusters by evaluating the outlier degree and takes a small cluster as a whole, and has better effect on detecting the outlier based on the clusters compared with other outlier detection algorithms aiming at single-point outliers.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a network intrusion detection method based on an improved density peak clustering algorithm obtains a historical network intrusion data set R, performs data numerical standardization processing on the R, extracts main features by applying a principal component analysis method, reduces dimensions, and performs dimension reduction on each point p in the data set R _i Calculating the limited neighbors, calculating the iteration number when reaching a stable state by using a natural neighbor search algorithm, marking the iteration number as k, and calculating each point p _i To obtain each point p _i Calculating the distance of shared neighbors among the cores, and obtaining a cluster C by using a density peak clustering algorithm on the local representative points ₁ ,C ₂ ,…,C _k And calculating an outlier factor for each cluster, and sorting according to the outlier factor results to select the lowest n clusters as outlier detection results.

The method specifically comprises the following steps:

s1: preprocessing a historical network intrusion data set, including uniformly encoding character type features or labels in the data set into numerical values, and carrying out standardized processing on the numerical values; and carrying out data dimension reduction on the standardized data;

s2: creating a ball tree to calculate each point p in the data set R based on Euclidean distance _i Traversing the ball tree to form an ordered k nearest neighbor matrix and an ordered distance matrix;

s3: according to the k neighbor matrix and the distance matrix obtained in the step S2, a natural neighbor search algorithm is utilized to obtain a iteration number k in a self-adaptive mode;

s4: for each point p according to the density calculation formula _i Calculate its density rho (p _i ) Sequencing the density value matrixes and obtaining descending density value matrixes and index value matrixes thereof;

s5: each point p is selected _i The most dense point in k-nearest neighbor of (2) is taken as point p _i Is a local representative point core;

s6: point p having the same local representative point _i Dividing into an initial fuzzy sub-cluster;

s7: calculating the distance between the local representative points core according to the formula, thereby obtaining the shortest path between the local representative points core;

s8: applying a density peak clustering algorithm to the local representative points core, constructing a two-dimensional decision graph, selecting a decision center, and distributing non-local representative points to clusters corresponding to the representative points so as to obtain a final cluster C ₁ ,C ₂ ,…,C _k ；

S9: calculating the last outlier upper limit u to be selected by utilizing a formula according to the outlier proportion a to be selected;

s10: calculating cluster C according to the formula ₁ ,C ₂ ,…,C _k And (3) sequencing the calculation results, selecting the lowest n clusters as the final outlier detection result, and identifying the cluster to which the n clusters belong as an abnormal attack type.

In step S1, a principal component analysis method is applied to reduce the data dimension of the normalized data.

Further, in step S4, the density calculation formula is:

wherein rho (p) _i ) Representing point p _i Density of N _k (p _i ) Representing point p _i Is set of k neighbors, eu (p _i O) represents point o to point p _i Euclidean distance between them.

Further, in step S7, the formula for calculating the distance between the local representative points core is:

where inset (i, j) represents the intersection between the blurred cluster representing point i and the blurred cluster representing point j.

Further, in step S7, the shortest path is acquired using the Floyd algorithm.

Further, in step S9, the calculation formula of the outlier upper limit is:

{|C ₁ |+|C ₂ |+…+|C _i-1 |≥|R|×a}∩{|C ₁ |+|C ₂ |+…+|C _i-2 |＜|R|×a}

then cluster C corresponding to i _i The number of points in (a) is the upper limit of outliers, where |c| represents the number of points in the cluster.

Further, in step S10, the calculation formula of the outlier factor of the cluster is:

wherein CBOF (C) _i ) Represent C _i Outlier factor of cluster, C _j Cluster is C _i A hypothetical normal cluster next to the cluster; d (C) _i ,C _j ) The calculation formula of (2) is as follows:

d(C _i ,C _j )＝min{eu(p,q)|p∈C _i ,q∈C _j }

wherein d (C) _i ,C _j ) Represent C _i Cluster and C _j Shortest distance between clusters.

The invention has the beneficial effects that:

1) The method starts from the characteristic that the historical network intrusion data set is mapped to the manifold data set obtained after low dimension through high dimension space, namely the network intrusion data set after dimension reduction contains complex manifold clusters, the existing abnormal point detection method based on clustering is difficult to accurately identify, and the density peak clustering algorithm is improved to be introduced, so that the method has higher accuracy in processing the network intrusion data set.

2) The problem that the sample data volume with the labels is not large exists in the network intrusion data set, and the existing intrusion detection model based on machine learning often needs a large number of training sets with labels, which can lead to low practicability; the invention uses an intrusion detection model based on unsupervised learning, and does not need a sample with a label when intrusion detection is performed, so that the invention has higher practicability when being applied to a network intrusion data set.

3) Many single point outliers are associated with sporadic trivial events, while clustered outliers are associated with some significant persistent anomalies, such as network anomalies caused by anomaly attacks over a period of time. Compared with the method based on the local outliers, the method does not need to calculate the outlier degree of each point, only calculates the outlier degree of each cluster, and reduces the time cost.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of a network intrusion detection method based on an improved density peak clustering algorithm of the present invention;

FIG. 2 is an original outlier distribution on dataset 1;

FIG. 3 shows the detection result after the method proposed by the present invention is applied to the data set 1;

FIG. 4 is a test result after application of an orphan forest algorithm on dataset 1;

FIG. 5 is a detection result after the local anomaly factor algorithm is applied to dataset 1;

FIG. 6 is an original outlier distribution on dataset 2;

FIG. 7 is a test result after applying the method proposed by the present invention on dataset 2;

FIG. 8 is a test result after applying an orphan forest algorithm to dataset 2;

fig. 9 is a detection result after the local anomaly factor algorithm is applied to the data set 2.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Referring to fig. 1 to 9, the present invention provides a network intrusion detection method based on an improved density peak clustering algorithm, wherein a relatively accurate clustering result is obtained by clustering a data set by the improved density peak clustering algorithm, the result cluster is evaluated, an outlier factor of each cluster is calculated, and outliers with a larger outlier degree are selected according to the outlier factor ranking, so as to detect an outlier caused by an abnormal network attack.

Firstly, calculating the limited neighbors of each point in the data set by using Euclidean distance, acquiring a distance matrix of each point and the neighbors thereof, sequencing the distance matrix, finding the k neighbors when the distance reaches a stable state by using a natural neighbor search algorithm, calculating the density according to the k neighbors of each point in the data set, selecting representative points, calculating the distance between the representative points and obtaining the shortest path. And obtaining a clustering result by using a density peak clustering algorithm on the representative points, calculating an outlier upper limit by using a clustering result cluster, calculating outlier factors of the clustering result cluster, sequencing the outlier factors and obtaining a final outlier result.

(1) For the method of computing neighbors of a data collection point:

and (3) establishing a Euclidean distance-based ball tree, traversing and inquiring the limited neighbor nodes of each node, and obtaining the ordered k neighbor matrix and distance matrix.

(2) For the natural neighbor search algorithm:

and iterating the k neighbors of the nodes according to the obtained result of the spherical tree, and reaching a stable state when the k neighbor of each node with one node is the node or the data object without the node in the data set is unchanged. The k-nearest neighbor at this time is called a natural nearest neighbor.

(3) The density calculation method for the nodes comprises the following steps:

wherein rho (p) _i ) Representing point p _i Density, p _i Is each node in the data set, k is the iterative result in the natural neighbor search algorithm, N _k (p _i ) Representing point p _i Is set of k neighbors, eu (p _i O) represents point o to point p _i Euclidean distance between them.

eu(p _i The calculation method of o) is as follows:

(4) The selection method for the representative point comprises the following steps:

if one node does not have a representative point, the point with the highest density in the k neighbor of the node is selected as the representative point, if the representative point exists and one point with a larger density and a closer distance than the representative point exists, the representative point is replaced, and if the point is the representative point of other points, the representative point of other points is replaced by the point with the larger density and the closer distance as the representative point.

(5) For a cluster class that divides representative points, a method of calculating distances between representative points:

before calculation, the cluster class needs to be divided, the data point set with the same representative point is divided into the same cluster, and k neighbors of each point in the cluster, which are not in the cluster, are also divided into the same cluster class. Wherein, the calculation formula when there is an intersection between clusters is:

wherein inset (i, j) is the set of data points where the cluster corresponding to representative point i intersects the cluster corresponding to representative point j. When there is no intersection between clusters, the distance between representative points is the maximum value. Thus, an adjacency matrix is constructed with representative points as nodes and distances between the representative points as weights.

(6) For the method of calculating the shortest path between representative points:

the method used to calculate the shortest path between representative points is the Floyd algorithm. The algorithm is a classical algorithm for shortest path from one point to all other points in the weighted graph. The state transfer equation is as follows:

matrix[i,j]＝min{matrix[i,k]+matrix[k,j],matrix[i,j]}

where matrix [ i, j ] represents the shortest distance from point i to point j, and point k is an intermediate point that may pass between i and j. And accessing the intermediate nodes traversed by each path by a recursion method, thereby completing the shortest path output between the representative points.

(7) For the application of the density peak clustering algorithm:

calculating the relative distance delta, i.e. the representative point p _i To density ratio p _i Large and distance p _i Nearest point and p _i The distance between them is plotted as the density rho (p _i ) And a two-dimensional decision graph with delta on the ordinate, wherein points with large delta and rho are manually selected as cluster centers. Then the representative points of other non-clustering centers are classified into clustering centers which are more dense and closer to each other than the representative points, and then the points in the cluster class of the representative points are also classified into the cluster class corresponding to the clustering center, so as to obtain a clustering result C ₁ ,C ₂ ,…,C _k 。

(8) The calculation method for the outlier upper limit is as follows:

if the clustering result obtained by clustering the density peak value, obtaining C ₁ ,C ₂ ,…,C _k Wherein |C ₁ |≥|C ₂ |≥…≥|C _k I being C ₁ To C _k Is an ordered cluster sequence obtained according to the number of data points in the cluster. The parameter a is given such that the cluster satisfies the following condition:

cluster C at this time _i The number of data points in the model is the upper limit of outliers.

(9) The calculation formula for the outlier factor is:

wherein C is _j Cluster is C _i The hypothetical normal cluster next to the cluster, d (C _i ,C _j ) The calculation formula of (2) is as follows:

d(C _i ,C _j )＝min{eu(p,q)|p∈C _i ,q∈C _j }

the higher the CBOF, the more abnormal the outlier based on the cluster is, so all outlier scores are ranked from big to small, and n clusters which do not exceed the upper limit of the outlier are selected as outlier results.

Comparison experiment:

the experiment adopts a KDCUP 99 data set as an embodiment verification, wherein the data set is widely used in the field of network intrusion detection and is developed by researchers at university of Mejil Canada and used for evaluating the performance and accuracy of a network intrusion detection system. The data set is derived from network traffic information of 1998 U.S. air force research laboratories, which contains normal traffic and attack traffic, which is divided into four categories, about 49 ten thousand records, 41 features.

Selecting two data sets with different abnormal attack proportions from the KDCUP 99 data set as test objects through a rectangular frame, wherein the data set 1 comprises 8710 pieces of data and 2178 pieces of abnormal data; dataset 2 contained 9818 pieces of data, 303 pieces of anomaly data. Data set 1 and data set 2 are two data sets of different outlier ratios, with the outlier ratio of data set 1 being about 30% and the outlier ratio of data set 2 being about 3%. The anomaly data contains anomaly points caused by different attack types, but the present embodiment simplifies intrusion detection only to a two-classification problem, i.e., only detects whether it is a network attack. The local anomaly factor algorithm and the isolated forest algorithm are two classical methods for outlier detection. The detection results of the method and the algorithm for detecting the two classical outliers are visually represented by a graph, wherein the normal points are represented by solid circles, and the outliers are represented by solid pentagons. The results are shown in FIGS. 2 to 9.

1) In this embodiment, the result evaluation index is the accuracy, recall, and F1-score. The accuracy is the proportion of the correct sample to the total sample number; the recall is the proportion of the detected correct sample in the actual correct samples to the actual correct samples; the F1-score is a harmonic average value of the accuracy rate and the recall rate, the accuracy rate represents the distinguishing capability of the model to the negative sample, the recall rate represents the identifying capability of the model to the positive sample, the F1-score is the combination of the accuracy rate and the recall rate, and the higher the F1-score is, the more robust the model is. The detection result indexes of the two classical outlier detection methods, namely the method, the local anomaly factor algorithm and the isolated forest algorithm, on the data sets 1 and 2 are shown in the table 1, and the data in the table indicate that the effect of the method on the network intrusion data set is obviously superior to that of the conventional outlier detection method.

Table 1 results of the tests

2) As can be seen from the data in table 1, the results of both classical algorithms perform poorly in the data set with an outlier ratio of 30%, whereas the method proposed by the present invention can perform well. While both classical algorithms have better performance in the 3% anomaly data set, the proposed method has only a small performance degradation. In general, the common intrusion detection method is generally suitable for a data set with a low anomaly ratio, performance is reduced due to the increase of anomaly points, and the recognition effect on an anomaly cluster caused by network anomaly attack is poor. The method provided by the invention has better universality, is not influenced by the abnormal proportion of the data set, has pertinence to the identification of the abnormal cluster, and is suitable for the detection of network intrusion.

Experimental results show that the method provided by the invention can well detect the abnormal attack points in the network intrusion data sets with different abnormal data rates, and has better robustness.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. The network intrusion detection method based on the improved density peak clustering algorithm is characterized by comprising the following steps of:

2. The network intrusion detection method based on the improved density peak clustering algorithm according to claim 1, wherein in step S1, a principal component analysis method is applied to perform data dimension reduction on the normalized data.

3. The network intrusion detection method based on the improved density peak clustering algorithm according to claim 1, wherein in step S4, the density calculation formula is:

4. The network intrusion detection method based on the improved density peak clustering algorithm according to claim 3, wherein in step S7, the formula for calculating the distance between the local representative points core is:

5. The network intrusion detection method based on the improved density peak clustering algorithm according to claim 1, wherein in step S7, the shortest path is obtained by using the Floyd algorithm.

6. The network intrusion detection method based on the improved density peak clustering algorithm according to claim 4, wherein in step S9, the calculation formula of the outlier upper limit is:

7. The network intrusion detection method based on the improved density peak clustering algorithm according to claim 6, wherein in step S10, the calculation formula of the outlier factor of the cluster is:

d(C _i ,C _j )＝min{eu(p,q)|p∈C _i ,q∈C _j }