CN107682319A

CN107682319A - A kind of method of data flow anomaly detection and multiple-authentication based on enhanced angle Outlier factor

Info

Publication number: CN107682319A
Application number: CN201710823063.0A
Authority: CN
Inventors: 首照宇; �田�浩; 邹风波; 张彤; 程夏威; 文辉; 赵晖; 莫建文; 汪延国; 曾情; 李希成
Original assignee: GUILIN YUHUI INFORMATION TECHNOLOGY Co Ltd; Guilin University of Electronic Technology
Current assignee: GUILIN YUHUI INFORMATION TECHNOLOGY Co Ltd; Guilin University of Electronic Technology
Priority date: 2017-09-13
Filing date: 2017-09-13
Publication date: 2018-02-09
Anticipated expiration: 2037-09-13
Also published as: CN107682319B

Abstract

A kind of method for disclosing data flow anomaly detection based on enhanced angle Outlier factor and multiple-authentication, it is characterized in that, comprise the following steps：1) real-time stream is handled；2) data set S in sliding window is set；3) initiation parameter k, r, ξ；4) distance matrix dist is obtained；5) r neighborhood point sets are obtained；6) angular factors of r neighborhood point sets are obtainedAnd local density7) distinctiveness ratio is obtained；8) the cluster heart factor of each data point is obtained；9) ownership matrix is obtained；10) determine the cluster heart and cluster；11) abnormality detection is carried out respectively to each cluster after cluster；12) multiple-authentication.This approach application sliding window and basic window technology, construct efficient data Stream Processing Model, and occupancy, the real-time for reducing internal memory are good, abnormality detection accuracy rate is high, time complexity is low.

Description

Enhanced angle anomaly factor-based data flow anomaly detection and multi-verification method

Technical Field

The invention relates to data flow anomaly detection and data clustering, in particular to a data flow anomaly detection and multiple verification method based on enhanced angle anomaly factors.

Background

The rapid development of network technology and the continuous improvement of social informatization lead to the explosive increase of information quantity, so that various industries generate massive, high-speed and dynamic stream data, such as network intrusion monitoring, commercial transaction management and analysis, video monitoring, sensor network monitoring and the like. Due to the characteristics of real-time infinite dynamic data flow and the like, the traditional static data anomaly detection method cannot accurately and effectively analyze and process the large-scale dynamically-increased flow data, so that the construction of a real-time effective anomaly detection method suitable for the data flow becomes particularly important.

For the practical problems faced by different stages, different data stream anomaly detection methods are provided by scientific and technological workers. The conventional data flow anomaly detection methods can be roughly classified into density-based data flow anomaly detection methods, angle-based data flow anomaly detection methods, and cluster-based data flow anomaly detection methods. The density-based data flow anomaly detection method applies density as the most basic anomaly measurement mode and constructs an anomaly factor which can be dynamically updated and is used for measuring the data anomaly degree, pokrajac et al quotes a static data anomaly detection method LOF into a data flow and researches an incremental local anomaly detection method INCLOF which can be applied to the dynamic data flow, and the INCLOF deletes historical data and dynamically updates the anomaly factor of each data point along with the insertion of new data; the method of improving INCLOF by Ke Gao et al introduces the idea of sliding window, and proposes an n-INCLOF method, wherein the n-INCLOF method only updates the abnormal factors of each data object in the sliding window at the current moment; in some cases, some data points are abnormal at a certain moment, but are not abnormal at the next moment, based on the problem, karimian S H et al proposes an I-IncLOF method, the I-IncLOF method introduces a multiple verification idea, the I-IncLOF method judges data objects which are always abnormal in the whole sliding process of a window as abnormal points, the I-IncLOF method greatly reduces the misjudgment rate, but the I-IncLOF method is poor in effectiveness under the multidimensional condition; xinjie Lu et al proposed an INCLOCI method, which introduced a multi-granularity anomaly factor MDEF, and which was able to detect not only scattered outliers but also abnormal clusters. In order to solve the problem that the effectiveness of similarity measurement modes such as distance and density is reduced in a high-dimensional data space, some scientific researchers provide angle measurement modes, the basic idea of the angle similarity measurement is that the angle formed by an abnormal point and other points is generally small and the fluctuation range is small, and the angle formed by a conventional point and other points is large and small and the fluctuation range is large, HPKriegel et al provide an angle-based anomaly detection method ABOD, the ABOD method takes the variance of the angle as an anomaly factor ABOF for measuring the anomaly degree of a data point, and the ABOD method still has high detection accuracy in the high-dimensional space; yeH provides an angle-based data stream anomaly detection method DSABOD, the DSABOD dynamically updates an anomaly factor of each data point relative to a neighborhood point of the data point along with the continuous flow of the data point of the data stream into a memory, the DSABOD provides a new idea for anomaly detection in a high-dimensional data stream, but the traditional angle-based data stream anomaly detection method has the problem of low anomaly detection rate. The data flow abnormity detection method based on clustering comprises two stages of clustering data points and carrying out abnormity detection on the data points in each cluster, elahi M and the like provide a data flow abnormity detection method based on clustering, a method for combining K-Means and LOF is adopted, abnormity factors are defined by regions in the method, and the abnormity detection accuracy of the method is improved; thakran Y et al also propose a method of combining DBSCAN method with W-K-Means method, this method applies DBSCAN method to carry on clustering and getting candidate abnormal point and initial cluster to the data block of the present moment, this method combines candidate abnormal point to be multiple-verified that the previous moment got, apply W-K-Means method to carry on clustering again, get candidate abnormal point and conventional point cluster of the present moment, this method adopts multiple verification to delete the abnormal point release memory of erroneous judgement to candidate abnormal point at the same time, this method adjusts attribute weight of parameter MinPts, epsilon, W-K-Means method that DBSCAN method needs dynamically in the whole course, this method is higher to the accuracy of the abnormal detection, but the necessary artificial parameter is set for too much, the artificial intervention is serious, the complexity of the method is higher, and the validity of this method in the multidimensional space is worse.

Data flow abnormity detection is a research hotspot and difficulty in the field of data mining nowadays, and the main aim is to accurately detect information which does not conform to a conventional mode in real time from a complex data environment which is dynamically changed.

Disclosure of Invention

The invention provides a data flow anomaly detection and multi-verification method based on enhanced angle anomaly factors, which aims at the problems of high time complexity, large memory occupation, low use efficiency, excessive manual parameter intervention, low effectiveness in a multi-dimensional data environment and the like of a traditional method. The method can reduce the occupancy rate of the memory, and has good real-time performance, high accuracy rate of abnormal detection and low time complexity.

The technical scheme for realizing the purpose of the invention is as follows:

a method for data flow abnormity detection and multiple verification based on an enhanced angle abnormity factor comprises the following steps:

1) Processing the real-time data stream: processing various real-time data streams acquired by a data acquisition terminal;

2) Setting a data set S in a sliding window: step 1) processing to obtain a data set S in the current sliding window, and setting S = { X = ₁ ,X ₂ ,…,X _n N data points, each data point being represented by its attribute asFor subsequent clustering and anomaly detection;

3) Initialization parameters k, r, ξ: setting initialization parameters, wherein k represents the number of k nearest neighbors of a data point, r is the spatial neighborhood radius of the data point, ξ is an anomaly decision threshold adjustment coefficient, and an anomaly decision threshold theta = mu + ξ · δ, wherein mu and δ correspond to the mean value and standard deviation of all data point enhanced angle anomaly factors;

4) Obtaining a distance matrix dist: calculating the distances between all data points by combining the data set S in the step 2), and obtaining an n × n distance matrix dist, dist = [ d ] _ij ] _n×n The calculation formula is formula (1):

5) Obtaining a r neighborhood point set: according to the spatial neighborhood radius r, obtaining an r neighborhood point set of each data point, namely a set of all circled data points at the point by taking the neighborhood radius r as the radius;

6) Obtaining r neighborhood point setAngle factor ofAnd local densityObtaining an angle factor of the r neighborhood point set by combining the distance matrix distAnd local density of r neighborhood point set

7) Obtaining a dissimilarity degree delta (x) _i ): according to the local density of the r neighborhood point set obtained in the step 6)After sorting, the corresponding dissimilarity degree delta (x) is calculated _i )；

8) Obtaining a cluster heart factor τ (x) for each data point _i ): combining the step 6) and the step 7) to obtain the cluster heart factor tau (x) _i ) The calculation formula is formula (5):cluster heart factor tau (x) _i ) To measure how well the data points are at the cluster center;

9) Acquiring an attribution matrix: sorting all data point cluster heart factors obtained in the step 8) in a descending way to obtain tau (p) ₁ )≥τ(p ₂ )≥…≥τ(p _n ) So as to obtain a home matrix F = [ F ] for clustering ₁ ,f ₂ ,…,f _n ]；

10 Determine cluster centers and cluster: performing cluster center determination and clustering on the data set S by using the cluster center factor and the attribution matrix, and forming a set, namely a cluster, by using all data points with the same class label to obtain m (m = C) _{center_id} ) An individual cluster C ₁ ,C ₂ ,…,C _m Finishing clustering on the data set S;

11 Differentiating the clustered clusters respectivelyFrequently detecting: obtaining each cluster C in step 10) _i (i =1,2,l, m), each cluster C in the clustered data set S is first aligned ₁ ,C ₂ ,…,C _m Respectively carrying out anomaly detection to obtain a cluster of anomaly point set O _i Finally, all abnormal point sets O = { O } in the data set S are obtained ₁ ,…,O _m The formula involved in anomaly detection is: intra cluster angle factorIs formula (7):

local delta value H (X) _j ) Is formula (8):

distance sum of k nearest neighbors L (X) _j ) As in equation (9):

wherein the content of the first and second substances,represents the data point X _j K neighborhoods consisting of k nearest neighbors in the cluster to which the neighbors belong;

enhanced angular anomaly factor EAOF (X) _j ) Is formula (10):

wherein o is the data point X _j Cluster center of the cluster, dist (o, X) _j ) Is a data point X _j The distance from the cluster center of the cluster,represents a cluster C _i (i =1,2,l, m) the angular factor, H (X), of each data point within the cluster relative to the cluster _j ) Is a local delta value;

12 Multiple validation: and verifying all candidate abnormal points for multiple times, judging the candidate abnormal points which are still shown to be abnormal after limited verification as determined abnormal points, outputting and storing the determined abnormal points, and directly discarding the abnormal points if the candidate abnormal points are shown to be normal points in the verification process, so that the accuracy rate of abnormal detection can be increased.

The processing in the step 1) means that the data acquired by the data acquisition terminal is cached in a stream form, and the cached data is divided into E ₀ ,E ₁ ,E ₂ The method comprises the following steps of (i) \8230; \8230, data blocks, wherein each data block represents a basic window, each sliding window W comprises epsilon (epsilon = 2) basic windows, and the insertion and deletion of data are realized by combining the basic window and the sliding window, wherein the process of combining the basic window and the sliding window is as follows: at T _i Time of day transition to T _i+1 At the moment, the sliding window is formed by W _i Slide to W _i+1 Accompanied by a new basic window E _i+1 Merge and History base Window of E _i-1 While removing T _i Time W _i Incorporation of detected candidate outliers into W _i+1 In (3) performing multiple validations.

The angle factor calculation formula of the r neighborhood point set in the step 6) is a formula (2):

the local density calculation formula of the r neighborhood point set in the step 6) is a formula (3):

the local density is related to the number of the neighborhood data points and the position of the neighborhood data points, and the more the number of the neighborhood data points is, the more the neighborhood data points are positioned in the center of the data set, the larger the local density is.

Dissimilarity δ (x) described in step 7) _i ) The local densities of all data points are sorted in descending order, and the dissimilarity degree delta (x) _i ) The calculation formula of (2) is formula (4):

home matrix F = [ F) described in step 9) ₁ ,f ₂ ,…,f _n ]The formula is used for recording the attribution relationship between data points, and the expression formula of each element is formula (6):

wherein, { p _i Denotes the cluster heart factor τ (x) _i ) And descending the sorted original subscript sequence numbers.

The data flow abnormity detection method is divided into 2 processes, namely a data flow processing process and a data flow abnormity detection process. In the data flow processing process, dynamic data flow is converted into static data blocks, so that subsequent abnormal detection is facilitated, and the real-time performance and the high efficiency of the whole detection are ensured; the data flow abnormity detection process is used for carrying out abnormity detection on the static data set processed in the data flow processing process, and in order to improve the abnormity detection accuracy, a method of clustering firstly and then carrying out abnormity detection is adopted. In the technical scheme, the real-time data stream processing method combining the sliding window and the basic window is the core of the data stream processing process, the memory occupancy rate is reduced, the quality of subsequent abnormal detection is improved, the cluster center factor and the attribution matrix are two parameters which are newly introduced in the technical scheme and used for determining the cluster center and clustering, the cluster center of the multidimensional data space can be rapidly and effectively determined, and the clustering is accurately performed according to the determined cluster center; the enhanced angle anomaly factor is another important parameter in the technical scheme, makes up for partial defects of the traditional anomaly factor, retains the effectiveness of an angle measurement mode in a multi-dimensional space, and is the core of an anomaly detection part.

The method applies sliding window and basic window technologies, constructs an efficient data stream processing model, reduces the occupancy rate of the memory, and has good real-time performance, high accuracy of abnormal detection and low time complexity.

Drawings

FIG. 1 is a schematic flow chart of the method in the example;

FIG. 2 shows example t ₁ A schematic diagram of a data point distribution diagram in a time sliding window;

FIG. 3 shows example t ₂ A schematic diagram of a data point distribution diagram in a time sliding window;

FIG. 4 is a diagram illustrating the combination of the sliding window and the base window to process the real-time data stream and the multiple verification processes in one embodiment;

FIG. 5 is a graph showing an angular measure of data points in an embodiment;

FIG. 6 is a schematic diagram illustrating a data point distribution of the U-shaped cluster data based on the conventional angle measurement method in the embodiment;

FIG. 7 is a schematic diagram illustrating a data point distribution of multi-cluster data misjudged based on a conventional angle measurement method in an embodiment;

FIG. 8 is a diagram illustrating the distribution of original coordinates of a data set in an embodiment;

FIG. 9 is a schematic diagram showing a local density-degree of dissimilarity distribution in the example;

FIG. 10 is a diagram showing the distribution of the cluster cofactors in the example;

FIG. 11a is a schematic diagram of the distribution of the data set 1 in the example;

FIG. 11b is a diagram showing the distribution of outliers in the data set 1 according to the example;

FIG. 11c is a schematic diagram showing the abnormal point identifiers detected by the abnormal detection of the data set 1 in the embodiment;

FIG. 11d is a schematic diagram illustrating the data set 1 shown in the embodiment where the abnormal detection has falsely detected a normal point as an abnormal point identifier;

FIG. 12a is a schematic diagram of the distribution of the data set 2 in the example;

FIG. 12b is a diagram illustrating the distribution of the data set 2 in the embodiment;

FIG. 12c is a schematic diagram showing the identification of an abnormal point detected by the abnormal detection of the data set 2 in the embodiment;

FIG. 12d is a diagram illustrating the abnormal point is detected as the normal point by the abnormal detection of the data set 2 in the embodiment.

Detailed Description

The invention will be further illustrated, but not limited, by the following description of the embodiments with reference to the accompanying drawings.

Referring to fig. 1, a method for data stream anomaly detection and multi-verification based on enhanced angle anomaly factors includes the following steps:

1) Processing the real-time data stream: processing various real-time data streams acquired by a data acquisition terminal, wherein the real-time data streams have dynamic and changeable characteristics, and some data objects are represented as abnormal in a current sliding window but are represented as normal points in a sliding window at the next moment, as shown in fig. 2 and 3, and t is t in fig. 2 ₁ A profile of the time-of-day sliding-window data points, where point P 'appears abnormal, but as data points continue to flow in, more and more data points accumulate around point P', fig. 3, t ₂ The distribution diagram of the data points of the time sliding window shows that the point P' is normal at the time;

2) Setting a data set S in a sliding window: step 1), processing to obtain a data set S in a current sliding window: let S = { X ₁ ,X ₂ ,…,X _n N data points, each data point being represented by its attributeFor subsequent clustering and anomaly detection;

3) Initialization parameters k, r, ξ: setting initialization parameters, wherein k represents the number of k nearest neighbors of a data point, r is the radius of a spatial neighborhood of the data point, ξ is an anomaly decision threshold adjustment coefficient, and an anomaly decision threshold theta = mu + xi × δ, wherein mu and δ correspond to the mean value and standard deviation of all data point enhanced angle anomaly factors;

6) Obtaining an angle factor of a r neighborhood point setAnd local densityObtaining an angle factor of the r neighborhood point set by combining the distance matrix distAnd local density of r neighborhood point setAs shown in FIG. 5, the method is based on the angle measurement idea, which calculates the angle between the data point and each other pair of data points, and then takes the variance to find the core region point A ₁ The angle change range formed by the point pair and other points is large, so the variance is large; for anomaly point A ₃ The angle change range formed by the point pair and other point pairs is very small, so the variance is small; and for the boundary point A ₂ The angle between it and other point pairs is in the range of A ₁ And A ₃ The variance is between the range of variation, so the variance is between the core region point and the outlier, but this has some defects, as shown in fig. 6 and 7, the outlier B in fig. 6 ₁ Located at the center of the U-shaped cluster, and the angle formed by the U-shaped cluster and the surrounding point pair is wide in change range, namely the variance is large, and the edge point B is ₂ The angle change range formed by the point pairs and other point pairs is small, namely the variance is small; similarly, the abnormal point D in FIG. 7 ₁ Located in the middle of the two clusters, the angle formed by the point pair between the point and the two clusters is wide, and the edge point D ₂ The angle change range formed by the point pairs is smaller; the obtained result is just opposite to the actual result, and missing and misjudgment occur;

7) Obtaining a dissimilarity degree delta (x) _i ): according to the local density of the r neighborhood point set obtained in the step 6)After sorting, the corresponding dissimilarity δ (x) is calculated _i )；

8) Obtaining a cluster heart factor τ (x) for each data point _i ): combining the step 6) and the step 7) to obtain the cluster heart factor tau (x) _i ) The calculation formula is formula (5):cluster heart factor τ (x) _i ) The method is used for measuring the degree of a data point in a cluster center, the cluster center factor is an improved parameter factor for quickly and effectively determining the cluster center of a multidimensional data space in the embodiment method, and is a crucial step in clustering, the implementation process is shown in fig. 8, 9 and 10, and it can be seen that the data set is composed of two clusters, wherein a point 13 and a point 25 are the cluster centers of the two clusters respectively; fig. 9 is a graph showing ρ - δ (local density-dissimilarity) distributions of points in the data set obtained by the equations (3) and (4), and it can be seen that the local densities and dissimilarities of the points 13 and 25 are large; FIG. 10 is a distribution diagram of the points sorted by descending cluster center factors according to equation (5), and it can be seen that the cluster center factors of the points 13 and 25 are the largest and thus most likely to be the cluster centers;

9) Acquiring an attribution matrix: sorting all data point cluster heart factors obtained in the step 8) in a descending way to obtain tau (p) ₁ )≥τ(p ₂ )≥…≥τ(p _n ) To obtain a membership moment for clusteringArray F = [ F = [ ] ₁ ,f ₂ ,…,f _n ]；

10 Determine cluster centers and cluster: performing cluster center determination and clustering on the data set S by using the cluster center factor and the attribution matrix, and forming a set, namely a cluster, by using all data points with the same class label to obtain m (m = C) _{center_id} ) An individual cluster C ₁ ,C ₂ ,…,C _m Completing the clustering of the data set S;

11 Anomaly detection is performed on each clustered cluster: obtaining each cluster C in the step 10) _i (i =1,2,l, m), each cluster C in the clustered data set S is first aligned ₁ ,C ₂ ,…,C _m Respectively carrying out anomaly detection to obtain a cluster of anomaly point set O _i Finally, all abnormal point sets O = { O } in the data set S are obtained ₁ ,…,O _m The formula involved in anomaly detection is: intra cluster angle factorIs formula (7):

local delta value H (X) _j ) As in equation (8):

distance sum of k nearest neighbors L (X) _j ) Is formula (9):

enhanced angular anomaly factor EAOF (X) _j ) Is formula (10):

wherein o is the data point X _j Cluster center of the cluster, dist (o, X) _j ) Is a data point X _j The distance from the cluster center of the cluster,is represented by C _i (i =1,2,l, m) the angular factor, H (X), of each data point within a cluster relative to the cluster _j ) Is a local delta value;

12 Multiple validation: and verifying all candidate abnormal points for multiple times, judging the candidate abnormal points which are still shown to be abnormal after limited verification as determined abnormal points, outputting and storing the determined abnormal points, and directly discarding the abnormal points if the candidate abnormal points are shown to be normal points in the verification process, so that the effect of the accuracy rate of abnormal detection can be improved.

The processing in the step 1) means that the data acquired by the data acquisition terminal is cached in a stream form, and the cached data is divided into E ₀ ,E ₁ ,E ₂ The method comprises the following steps of (i) \8230; \8230, data blocks, wherein each data block represents a basic window, each sliding window W comprises epsilon (epsilon = 2) basic windows, the insertion and deletion of data are realized by adopting the combination of the basic window and the sliding window, and the process of combining the basic window and the sliding window is shown in FIG. 4: at T _i Time of day transition to T _i+1 At the moment, the sliding window is formed by W _i Slide to W _i+1 With a new basic window E _i+1 Merge and History base Window of E _i-1 While removing T _i Time W _i Incorporation of detected candidate outliers into W _i+1 In (3) performing multiple validations.

Dissimilarity δ (x) described in step 7) _i ) The local densities of all data points are sorted in a descending order to obtain the dissimilarity delta (x) _i ) Is the formula (4): the dissimilarity is a measure of the probability of different clusters between data points, and is obtained by sorting the local densities obtained in step 6) in descending order from a given data set SWherein, { p _i Denotes local densityA descending original subscript number, d (p) _i ,p _j ) Representing a data point p _i And p _j The Euclidean distance between them, a certain data point p _i The degree of dissimilarity of (c) can be defined as follows:

the home matrix F = [ F ] described in step 9) ₁ ,f ₂ ,…,f _n ]The formula is used for recording the attribution relationship between data points, and the expression formula of each element is formula (6):

wherein, { p _i Denotes the cluster heart factor τ (x) _i ) Sort in descending orderThe latter original subscript number.

The step 10) of determining the cluster centers and clustering refers to that the serial number of the cluster centers is defined as C _{center_id} Data points are labeled as C _{cluster_label} And initializes the cluster core number to 1, i.e., C _{center_id} =1; the data point with the largest cluster center factor obtained in step 8) is also labeled with 1, i.e.Then according to the descending subscript serial number { p) obtained in the step 8) _i Fourthly, the condition traversal is carried out on the whole data set S, if yes, the condition traversal is carried outAndthe distances of all points satisfy(wherein r is the initial parameter value neighborhood radius), redefining the point as a new cluster center, increasing the class label of the point by 1, and accordingly obtaining all cluster centers; then, according to the obtained cluster center, the attribution matrix F = [ F ] in the step 9) is reused ₁ ,f ₂ ,…,f _n ]The same label (i.e. class label) is attached to the points belonging to the same cluster center by the following method: by the descending subscript number { p) obtained in step 9) _i Fourthly, the condition traversal is carried out on the whole data set S, if p is _i Non-clustered centers, based on the home matrixCorresponding label is assigned to p _i Otherwise p _i The label of (1) is itself, and all data points with the same class label are finally grouped into a set, i.e. a cluster, to obtain m (m = C) _{center_id} ) An individual cluster C ₁ ,C ₂ ,…,C _m And finishing clustering the data set S.

Step 11) is to perform anomaly detection on each clustered cluster, and the anomaly detection specifically includes the following steps:

(1) for arbitrary cluster C _i (i =1,2,l, m), calculating an angle factor for each data point within the cluster relative to the cluster

As in equation (7):

wherein, C _i (i =1,2,l,m) represents an arbitrary cluster after clustering;

(2) computing a local increment value H (X) in the neighborhood of each data point in the cluster with respect to its space r _j ) As in equation (8):

the local increment is to reflect the density of the data points within the spatial neighborhood of the cluster to which the data points belong, wherein,data points X are represented _j In the r neighborhood of its clusterNumber of data points in

(3) Calculating the distance dist (o, X) between each data point and the cluster center of the cluster according to the cluster centers confirmed in the step 10) _j )；

(4) Calculate the distance sum L (X) of each data point from its k nearest neighbors _j ) As in equation (9):

wherein, the first and the second end of the pipe are connected with each other,represents the data point X _j K neighborhoods consisting of k nearest neighbors in the cluster to which the neighbor belongs, and the sum of distances L (X) of the k nearest neighbors _j ) Reflecting how far and near the data point is from the surrounding data points, so as to avoid the angle-based abnormality factor appearing similarly to B in FIG. 6 ₁ The presence of defects;

(5) computing an enhanced angular anomaly factor EAOF (X) for each data point _j ) Is formula (10):

wherein o is the data point X _j Cluster center of the cluster, dist (o, X) _j ) Is a data point X _j Distance from its cluster center, V _Ci (X _j ) Is represented by C _i (i =1,2,l, m) the angular factor, H (X), of each data point within a cluster relative to the cluster _j ) Is a local delta value; the enhanced angle anomaly factor EAOF not only has excellent measurement performance of an angle measurement mode in a multi-dimensional space, but also introduces the ideas of distance and density, and makes up the defects of the traditional angle anomaly factor-based method;

(6) calculating the mean value mu and the standard deviation delta of all the data point enhanced angle abnormal factors obtained in the step (5), and calculating an abnormal judgment threshold theta by using the mean value and the standard deviation, wherein theta = mu + xi · delta, and xi is an initially set abnormal judgment threshold adjustment coefficient;

(7) enhancing each point obtained in (5) by an angle anomaly factor EAOF (X) _j ) Comparing the judgment threshold value theta obtained in the step (6), and if the judgment threshold value theta meets EAOF (X) _j )&G, theta, marking the point as a candidate abnormal object in the cluster, and storing the candidate abnormal point set O of the cluster _i In (1).

The embodiment provides a data stream anomaly detection and multiple verification method based on enhanced angle anomaly factors, which adopts a technology of combining a sliding window and a basic window, constructs a high-efficiency real-time data stream processing technology, and introduces the enhanced angle anomaly factors, thereby solving the problems of high memory occupancy rate and low data processing efficiency of the traditional method, and simultaneously ensuring the advantages of high real-time performance, high anomaly detection accuracy and low time complexity.

In order to verify the effectiveness of the method of the present embodiment, the following will be further explained by comparing the simulation results:

in this embodiment, verification is performed on both a manually generated data set and a real data set, and the verification is compared with a weighted clustering-based data flow unsupervised anomaly detection method (abbreviated as method I) proposed by the traditional methods I-IncLOF, thakran and the like, experimental data set information is shown in table 1, table 1 is experimental data set information, and the three data sets are data sets with different dimensions, different data amounts and different data characteristics.

The data distribution of the artificial data set 1 is shown in FIG. 11a, which has 1615 data points in total, and consists of 5 clusters and 15 discrete points, wherein the cluster 1 is a Gaussian distribution N ₁ (u ₁ ,∑ ₁ ) The 500 data points generated are composed, and the cluster 2 is a Gaussian distribution N ₂ (u ₂ ,∑ ₂ ) The 500 data points generated are composed, and the cluster 3 is a Gaussian distribution N ₃ (u ₃ ,∑ ₃ ) 500 data points are generated, and the cluster 4 and the cluster 5 are respectively composed of Gaussian distribution N ₄ (u ₄ ,∑ ₄ ) And N ₅ (u ₅ ,∑ ₅ ) 50 data points generated are composed, and N is ₄ And N ₅ The number of data points is very small and is therefore considered an outlier cluster. Meanwhile, according to the distribution characteristics of the data set, 15 discrete abnormal points are randomly generated, so the data set contains 115 abnormal points in total, the distribution situation is shown in fig. 11b, the abnormal points are marked by circles, in the experimental process, the abnormal clusters and the discrete abnormal points are randomly mixed into the normal clusters, and the following parameters are used for generating the data set 1 by gaussian distribution:

μ ₁ ＝[+1 +1]，μ ₂ ＝[-1 -1],μ ₃ ＝[+1 -1],μ ₄ ＝[-1 +1],μ ₅ ＝[0 0]

the data distribution of the artificial data set 2 is shown in fig. 12a, and there are 860 data points, which are composed of 3 normal clusters and 1 abnormal cluster, and 48 discrete abnormal points, wherein the abnormal cluster is composed of 21 abnormal points. Therefore, the data set has 69 abnormal points, and the distribution of the abnormal points is shown in fig. 12 b.

The real data set Breast Cancer is shown in Table 1, and the data set is derived from a UCI machine learning library, comprises 699 data points, and consists of two normal clusters, wherein in order to verify the validity of the method, 34 abnormal points are added to the real data set according to statistical characteristics such as mean, variance, and the like, and are used for comparison and verification of abnormal detection.

In the verification experiment of the method of this embodiment, the length of a basic window is set to be 20, two basic windows form a sliding window, the number of nearest neighbor points k =3, the radius of a spatial neighborhood is determined as the mean value of the first 20% distance values of the descending order of the distance values between the data points in the sliding window at the current time, the adjustment coefficient of the anomaly determination threshold is 2.5, the number of times of multiple verification is 3, and meanwhile, the detection rate and the false determination rate which can most reflect the effectiveness of the anomaly detection method are selected for comparison, as shown in fig. 11a to 11d and fig. 12a to 12d, which are the visualization experiment results of the data set 1 and the data set 2.

For the artificial data set 1, as can be seen from fig. 11a to 11d, with this method, 2 abnormal clusters and 15 discrete abnormal points can be effectively detected, and the effect of zero missing detection is achieved, and as can be seen from fig. 11d, 3 normal points are mistakenly detected as abnormal points because these normal points are generated by normal gaussian distribution, but slightly far away from the normal clusters, and appear as abnormalities in 3 consecutive multiple verifications, and are therefore determined as abnormal points;

for the artificial data set 2, as can be seen from fig. 12a to 12d, the method still maintains good effectiveness in the three-dimensional data space, and as can be seen from fig. 12b, 12c, and 12d, all the points in the abnormal cluster can be detected, and 47 of the 48 discrete abnormal points are detected, and one discrete abnormal point is missed, and the reason for the missed detection is that the missed detection point is closer to the normal cluster, so that a certain time appears normal in the multi-verification, and therefore the point is determined to be the normal point.

While the effectiveness of the method of the present embodiment is verified, the method of the present embodiment is compared with a conventional method, and the advantages of the method of the present embodiment are further verified, as shown in table 2, table 2 is statistical information of experimental results, and detailed statistical results of comparative experiments on three data sets are performed on the three methods. As can be seen from table 2, the method provided by this embodiment has high detection rate, low false positive rate, and effectiveness is significantly better than the other two methods, and the superiority of the method is more significant when the dimension of the data set is higher, method I combines W-K-Means and DBSCAN methods, and dynamically updates parameters and weights of each dimension required by DBSCAN, so method I has good adaptability to dynamic data streams, but because it uses a conventional distance and density-based abnormal measurement mode, the effectiveness is reduced when the dimension increases; the I-IncLOF method is based on the idea of local density, is also influenced by dimension disasters, and has good performance when the data dimension is low, but has poor effectiveness when the dimension is increased.

Through the verification of different data sets and the comparative analysis with the traditional method, it can be seen that the method for data stream anomaly detection and multi-verification based on the enhanced angle anomaly factor provided by the embodiment has better effectiveness and feasibility.

TABLE 1

TABLE 2

Claims

1. A method for data flow abnormity detection and multiple verification based on an enhanced angle abnormity factor is characterized by comprising the following steps:

2) Setting a data set S in a sliding window: step 1), processing to obtain a data set S in the current sliding window: let S = { X ₁ ,X ₂ ,...,X _n N data points, each data point being represented by its attribute asFor subsequent clustering and anomaly detection;

5) Obtaining a r neighborhood point set: according to the spatial neighborhood radius r, obtaining an r neighborhood point set of each data point, namely a set of all data points encircled at the point by taking the neighborhood radius r as the radius;

6) Obtaining an angle factor of the r neighborhood point setAnd local densityObtaining an angle factor of the r neighborhood point set by combining the distance matrix distAnd local density of r neighborhood point set

8) Obtaining a cluster heart factor tau (x) for each data point _i ): combining the step 6) and the step 7) to obtain the cluster heart factor tau (x) _i ) As in equation (5):cluster heart factor τ (x) _i ) To measure how well a data point is in the cluster center;

9) Acquiring an attribution matrix: sorting all the data point cluster heart factors obtained in the step 8) in a descending order to obtain tau (p) ₁ )≥τ(p ₂ )≥...≥τ(p _n ) So as to obtain a home matrix F = [ F ] for clustering ₁ ,f ₂ ,...,f _n ]；

10 Determine cluster centers and cluster: performing cluster center determination and clustering on the data set S by using the cluster center factor and the attribution matrix, and forming a set, namely a cluster, by using all data points with the same class label to obtain m (m = C) _{center_id} ) An individual cluster C ₁ ,C ₂ ,...,C _m Finishing clustering on the data set S;

11 Respectively carrying out anomaly detection on each clustered cluster: obtaining each cluster C in step 10) _i (i =1,2,l, m), each cluster C in the clustered data set S is first aligned ₁ ,C ₂ ,...,C _m Respectively carrying out anomaly detection to obtain a set of anomaly points O of each cluster _i Finally, all abnormal point sets O = { O ] in the data set S are obtained ₁ ,...,O _m } anomaly detection involvesThe formula of (1) is: intra cluster angle factorAs in equation (7):

local increment value H (X) _j ) Is formula (8):

distance sum of k nearest neighbors L (X) _j ) Is formula (9):

enhanced angular anomaly factor EAOF (X) _j ) Is formula (10):

wherein o is the data point X _j Cluster center of the cluster, dist (o, X) _j ) Is a data point X _j The distance from the center of its cluster is,is represented by C _i (i =1,2,l, m) the angular factor, H (X), of each data point within a cluster relative to the cluster _j ) Is a local delta value;

12 Multiple validation: and verifying all candidate abnormal points for multiple times, judging the candidate abnormal points which are still shown to be abnormal after limited verification as determined abnormal points, outputting and storing the determined abnormal points, and directly discarding the abnormal points if the candidate abnormal points are shown to be normal points in the verification process. This increases the effect of the abnormality detection accuracy.

2. The method for data stream anomaly detection and multi-validation based on enhanced angle anomaly factor as claimed in claim 1, wherein said processing in step 1) means that data collected by the data collection terminal is buffered in stream form, and the buffered data is divided into E ₀ ,E ₁ ,E ₂ A. At T _i Time of day transition to T _i+1 At the moment, the sliding window is formed by W _i Slide to W _i+1 With a new basic window E _i+1 Merge and History base Window of E _i-1 While removing T _i Time W _i Incorporation of detected candidate outliers into W _i+1 In (3) performing multiple validations.

3. The method for data stream anomaly detection and multi-verification based on enhanced angle anomaly factor as claimed in claim 1, wherein the calculation formula of the angle factor of the r neighborhood point set in step 6) is formula (2):

4. the method for data stream anomaly detection and multi-verification based on enhanced angle anomaly factor as claimed in claim 1, wherein said local density calculation formula of r neighborhood point set in step 6) is formula (3):

the local density is related to the number of neighborhood data points and the position of the neighborhood data points, and the more the number of the neighborhood data points is, the more the neighborhood data points are positioned in the center of the data set, the larger the local density is.

5. The method for enhanced angle anomaly factor based data stream anomaly detection and multi-verification as claimed in claim 1, wherein said dissimilarity δ (x) in step 7) _i ) The local densities of all data points are sorted in a descending order to obtain the dissimilarity delta (x) _i ) The calculation formula of (2) is formula (4):

6. the method for enhanced angle anomaly factor-based data stream anomaly detection and multi-verification according to claim 1, wherein said home matrix F = [ F ] in step 9) ₁ ,f ₂ ,...,f _n ]The formula is used for recording the attribution relationship between data points, and the expression formula of each element is formula (6):