CN111666981B

CN111666981B - System data anomaly detection method based on genetic fuzzy clustering

Info

Publication number: CN111666981B
Application number: CN202010402204.3A
Authority: CN
Inventors: 田园; 原野; 马文; 黄祖源; 付谱平
Original assignee: Information Center of Yunnan Power Grid Co Ltd
Current assignee: Information Center of Yunnan Power Grid Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2023-03-31
Anticipated expiration: 2040-05-13
Also published as: CN111666981A

Abstract

The invention relates to a system data anomaly detection method based on genetic fuzzy clustering, and belongs to the technical field of data anomaly detection. The method comprises the steps of firstly carrying out discrete standardization processing on a data set acquired by a system platform, randomizing the data set subjected to the discrete standardization processing, and dividing the data set into a training sample set and a testing sample set. And carrying out fuzzy clustering processing on the test sample set, and carrying out genetic operation on the clustering center obtained after the fuzzy processing. Then, an optimal classification number and a corresponding clustering result set are obtained. And then, carrying out classification on the clustering result set to obtain each clustering center of the normal data set and each clustering center of the abnormal data set. And then calculating the distance between each sample in the test sample set and the center of each data aggregation class marked, wherein the subclass with the minimum distance measure to each sample in the test sample set can be regarded as the cluster to which the subclass belongs, so that abnormal data in the test sample set can be tested.

Description

System data anomaly detection method based on genetic fuzzy clustering

Technical Field

The invention relates to a system data anomaly detection method based on genetic fuzzy clustering, and belongs to the technical field of data anomaly detection.

Background

With the rapid development of information technology, service-based system platform data may cause data exception for various reasons during transmission. The FCM fuzzy clustering algorithm is often applied to the field of data anomaly detection, but the traditional FCM fuzzy clustering algorithm is easy to fall into the problem of local optimal points. In order to solve the problem, the invention adopts an abnormality detection method combining the FCM fuzzy clustering algorithm and the genetic algorithm to be applied to the field of system platform data abnormality detection, and can solve the problem that the FCM algorithm is easy to fall into a local optimum point. The abnormal data set is often characterized by mixed attributes, and the calculation amount is very large in the process of processing the abnormal data sets with the characteristics of the mixed attributes.

Disclosure of Invention

The invention provides a system data anomaly detection method based on genetic fuzzy clustering, which firstly considers the characteristic that a data set provided by a system platform often has mixed attributes, improves the calculation of distance measure, and solves the problem that a fuzzy clustering algorithm is easy to fall into a local optimum point by combining with a genetic algorithm.

The technical scheme of the invention is as follows: a system data anomaly detection method based on genetic fuzzy clustering comprises the following specific steps:

step1: firstly, all data in a data set provided by a system are standardized, then all the data after the standardization are respectively randomized, then the data after the randomization are divided, and finally a training sample set TR and a test sample set TE are obtained.

Step2: determining the maximum classification number C of the training sample set TR _max And the minimum classification number C _min Form a maximum classification number C _max And the minimum classification number C _min Set of (C) = { C _min ,C _min +1,...,C _max Constructing a fuzzy clustering model and a genetic algorithm model of the mixed attribute data set, and classifying the maximum classification number C _max And the minimum classification number C _min The set c is transmitted to the models to obtain a set OFV about the objective function value, each value of the OFV in the set corresponds to a cluster number, and the cluster numbers are combined into a set which is set as CN.

Step3: analyzing to obtain the optimal classification number C by combining the set OFV and the set CN obtained by Step2 and the minimum element set and variance analysis in the set OFV ^* 。

Step4: the optimal classification number C obtained in Step3 ^* Generating a corresponding clustering result denoted as C, andC _i ,i＝1,2,...,C ^* in which C is _i Representing the set of class i results, and the corresponding cluster-centric PCC, which is _i ,i＝1,2,...,C ^* Wherein PCC _i Representing the i-th class center.

Step5: performing mark classification on the clustering result C obtained in Step4, wherein the mark classification aims to distinguish normal clustering and abnormal clustering in the result;

the distinguishing principle is as follows:

setting a proportionality coefficient eta, 0 < eta < 1, if

And judging the cluster result type as a normal cluster result type, otherwise, judging the cluster result type as an abnormal cluster result type.

Wherein, count (C) _i ) The number of the ith class result sets in the clustering result C is represented, and the Count (TR) represents the number of the training sample sets.

The final normal clustering result is marked as NCRC, and the NCRC _i I =1, 2.. Wherein iN denotes the ith class iN the normal clustering result, the corresponding normal clustering result class center is denoted as PNCRC, and PNCRC _i I =1, 2., iN, denotes the i-th class center iN the normal clustering result.

The abnormal clustering result class is marked as ACRC, and ACRC _j J =1, 2.. Said, jN denotes the jth class in the abnormal clustering results, which cluster the center PACRC, and PACRC _j J =1, 2.. Times.jn denotes the jth class center iN the outlier cluster result, and iN + jN = C ^* 。

Step6: and acquiring a normal clustering result type NCRC and an abnormal clustering result type ACRC from Step6, and a corresponding normal clustering result type center PNCRC and an abnormal clustering result type center PACRC to perform abnormal detection on the data set.

TE = { x' for test sample set preprocessed according to Step1 ₁ ,x ₂ ,…,x _n Let x be assumed _i For the data to be detected, x is calculated separately _i Distance measure between the data to be detected and the PNCRC and PACRC obtained in Step5 is set as x _i Has the advantages ofAnd the subclass corresponding to the clustering center with the minimum distance measure is the cluster to which the subclass belongs.

When data x to be detected _i And the cluster subclass belongs to the normal clustering result class NCRC obtained in Step5, and the cluster subclass is normal data.

When data x to be detected _i And if the cluster subclass belongs to the abnormal clustering result class ACRC obtained at Step5, the cluster subclass is abnormal data.

The specific steps of Step1 are as follows:

using discrete normalization, all data X = { X in all datasets provided by the system ₁ ,x ₂ ,…,x _n Is mapped to [0,1 ]]In the above-mentioned manner,

for each data x _i Normalization was performed using the following formula:

where min { X } is the minimum value in the system-provided data set, max { X } is the maximum value in the system-provided data set, and X _i ' for each data x _i The normalized data values are then randomized, and finally divided into a training sample set TR and a test sample set TE.

The concrete steps of Step2 are as follows:

data set TR = { x after preprocessing by Step1 ₁ ,x ₂ ,…,x _i ,…,x _n Is a set of data sets with mixed attributes,

wherein x is _j ＝[x _j1 ,…,x _jl ,...,x _jm ] ^T Representing a mixing property, x, of the jth sample of the data set TR _jl Represents a sample x _j M is x _j A dimension containing an attribute feature; sample x with mixed properties _i And x _j The dissimilarity measure may be expressed as follows:

/>

wherein x is _i ，x _j For the ith and jth samples in the TR data set, d _ij Minkowski distance, representing the ith to jth sample in the TR dataset, maximum class number C obtained from initialization in Step2 _max And the minimum classification number C _min The set c of (a) needs to continuously transmit each element in the set c to the membership calculation function, so as to search which classification number is the best.

Step2.1: calculating the degree of membership, wherein the function of the degree of membership is as follows:

wherein u is _ij Is a sample, x _j A membership matrix belonging to the i-th class, wherein h is a fuzzy coefficient; the value ranges of i and j are [1, n ]]And n is the number of the training sample sets TR, and finally a membership grade set u can be obtained, wherein c (k) represents the kth element in the set c.

Step2.2: degree of membership u obtained by Step2.1 _ij Its clustering center can be obtained, which is:

wherein, P _i Is the cluster center of the ith class, i ranges from [1,n]And n is the number of the training sample sets TR, and finally a cluster center set P can be obtained.

Step2.3: the membership u obtained by mixing Step2.1 and Step2.2 _ij And in clusteringHeart P _i The objective function of the mixed attribute data set is:

wherein, J _h For the value of the objective function, i has a value in the range of [1,n ]]And n is the number of the training sample sets TR, and finally an objective function set J can be obtained.

Step2.4: by the value of the objective function J _h Establishing a fitness function, wherein the fitness function is as follows:

wherein, F _i The value range of i is [1, n ] for the value returned by the fitness function]N is the number of the training sample set TR, and finally a fitness function value set F can be obtained, wherein epsilon is a small enough positive number and the range is (e) ^-10 ,e ^-20 ) In the meantime.

Step2.5: fitness function value F obtained by Step2.4 _i Determining the selection operator formula as follows:

wherein, the PSE _i For selecting an operator, i and k have a value range of [1, n ]]And n is the number of the training sample sets TR, and finally a selection operator set PSE can be obtained.

Step2.6: and coding the clustering center P obtained from Step2.2 according to a multi-parameter binary coding mode.

Step2.7: initialization of crossover operator P _c Mutation operator P _m And population number PQ and a maximum genetic algebra MGA, performing genetic operation on the encoded clustering center P obtained by Step2.6, then substituting the clustering center P subjected to the genetic operation into Step2.3, and then recalculating Step2.4 to obtain a new fitnessA function value set, and then continuously passing through a selection operator PSE and a crossover operator P _c And mutation operator P _m And evolving a new fitness function set value of the next generation.

When the change of each element value in the fitness function set is not large, or the genetic algebra reaches MGA, the genetic operation is stopped, the objective function value set J corresponding to the obtained latest fitness function value set and the minimum objective function value in the objective function value set J are added into the set OFV of the objective function values, the corresponding classification number k is added into the clustering number set CN, k = k +1 is made, step2 is repeatedly executed until k = C _max -C _min +1 is the maximum classification number C _max And the minimum classification number C _min The last element in set c of (c) stops execution.

The specific steps of Step3 are as follows:

step3.1: obtaining a cluster number set SCN in a set CN corresponding to the minimum element set ME = min { OFV } in the set OFV and the minimum element set min { OFV }.

Step3.2: if the minimum element set ME has only one element, the cluster number in the corresponding set SCN also has only one element, and the element is the optimal cluster number C ^* (ii) a Otherwise, step3.3 is executed for analysis of variance.

Step3.3: setting the set ME as a control variable and the set SCN as an observation variable, carrying out variance analysis, analyzing the square difference analysis result, and finding out the set SCN element corresponding to the group with the largest significance difference, namely the optimal clustering number C ^* 。

The specific steps of Step5 are as follows:

step5.1: for the clustering result C obtained from Step4, C _i ,i＝1,2,3,...,C ^* In which C is _i The ith class result set is represented, and the class result sets are sorted from large to small according to the number of each class result set.

Step5.2: let i =1, K = N ·, where N = Count (TR), i.e., the total number of training sample sets, and K is a threshold value, which represents a critical point for dividing the number of normal clusters and abnormal clusters.

Step5.3: if Count (C) _i ) If > K, then C is _i For normal cluster partitioning, divide C _i Adding the cluster centers to the set NCRC, and adding the corresponding cluster centers to the set PNCRC; otherwise C _i For abnormal cluster partitioning, divide C _i Add to the set ACRC, add its corresponding cluster center to the set PACRC, perform 5.4.

Step5.4: if i = i +1 and i ≦ C ^* Then 5.3 is executed, otherwise the loop is exited.

6, the Step6 comprises the following specific steps:

step6.1: separately calculate x _i Distance measures from each element in the PNCRC and PACRC sets obtained in Step5.

Step6.2: comparing the distance measure with the data x to be detected _i The subclass corresponding to the cluster center with the minimum distance measure is the cluster to which the subclass belongs.

Step6.3: thus when data x to be detected is detected _i The cluster subclass belongs to the normal clustering result class NCRC obtained in Step5, and the cluster subclass is normal data;

The invention has the beneficial effects that: the invention classifies the system platform data by using a fuzzy clustering method, and improves the calculation of distance measure aiming at the characteristic that the system data set has mixed attributes; the invention adopts the genetic algorithm to carry out genetic operation on the clustering center, thus solving the problem that the fuzzy clustering algorithm is easy to fall into a local optimum point; the data anomaly detection algorithm of distance measurement is simple and clear, and has high detection efficiency.

Drawings

FIG. 1 is a flow chart of the steps of the present invention;

FIG. 2 is a diagram of the function of classification number and fitness according to the present invention.

Detailed Description

The invention is further described with reference to the following drawings and detailed description.

As shown in fig. 1, the technical solution of the present invention is: a system data anomaly detection method based on genetic fuzzy clustering comprises the following specific steps:

step1: firstly, all data in a data set provided by a system are standardized, then all the data after the standardized processing are respectively randomized, then the data after the randomized processing are divided, and finally a training sample set TR and a testing sample set TE are obtained.

Step2: determining the maximum classification number C of the training sample set TR _max And the minimum classification number C _min Form a maximum classification number C _max And the minimum classification number C _min Set of (C) = { C _min ,C _min +1,...,C _max Constructing a fuzzy clustering model and a genetic algorithm model of the mixed attribute data set, and classifying the maximum classification number C _max And the minimum classification number C _min The set c of (2) is transmitted to the models to obtain a set OFV related to the objective function value, each value of the OFV in the set corresponds to a cluster number, and the cluster numbers are combined into a set which is set as CN.

Step4: the optimal classification number C obtained in Step3 ^* Generating a corresponding clustering result marked as C, and C _i ,i＝1,2,...,C ^* In which C is _i Representing the ith class result set, and the corresponding cluster-centric PCC, which _i ,i＝1,2,...,C ^* Wherein PCC _i Representing the ith class center.

the distinguishing principle is as follows:

setting a proportionality coefficient eta, 0 < eta < 1, if

Then it is considered asAnd (4) clustering the result type normally, or else, clustering the result type abnormally.

The final normal clustering result is classified as NCRC, and NCRC _i I =1, 2.. Wherein iN denotes the ith class iN the normal clustering result, the corresponding normal clustering result class center is denoted as PNCRC, and PNCRC _i I =1, 2.,. IN, denotes the i-th class center iN the normal clustering result.

The abnormal clustering result class is denoted as ACRC, and ACRC _j J =1, 2.. Said, jN denotes the jth class in the abnormal clustering results, which cluster the center PACRC, and PACRC _j J =1, 2.. JN denotes the jth class center iN the outlier cluster result, and iN + jN = C ^* 。

Step6: and acquiring a normal clustering result type NCRC and an abnormal clustering result type as ACRC from Step6, and a corresponding normal clustering result type center PNCRC and an abnormal clustering result type center PACRC, thereby carrying out abnormal detection on the data set.

For test sample set TE = { x after pretreatment according to Step1 ₁ ,x ₂ ,…,x _n Let x be assumed _i For the data to be detected, x is calculated separately _i Distance measure between the detected data x and the PNCRC and PACRC obtained in Step5 is set as _i The subclass corresponding to the cluster center with the minimum distance measure is the cluster to which the subclass belongs.

The specific steps of Step1 are as follows:

the use of discrete normalization allows for the use of discrete normalization, all data X = { X in all data sets to be provided by the system ₁ ,x ₂ ,…,x _n Mapping to [0,1 ]]In the above-mentioned manner,

for each data x _i Normalization was performed using the following formula:

where min { X } is the minimum value in the system-supplied data set, max { X } is the maximum value in the system-supplied data set, X _i ' for each data x _i The normalized data values are then randomized, and finally divided into a training sample set TR and a test sample set TE.

The specific steps of Step2 are as follows:

wherein x is _j ＝[x _j1 ,…,x _jl ,...,x _jm ] ^T Mixed property, x, representing the jth sample of the data set TR _jl Representing a sample x _j M is x _j A dimension containing an attribute feature; sample x with mixed properties _i And x _j The dissimilarity measure may be expressed as follows:

wherein u is _ij Is a sample, x _j A membership matrix belonging to the i-th class, h being a fuzzy coefficient; the value ranges of i and j are [1, n ]]N is the number of training sample sets TR, a membership set u can be finally obtained, c (k) represents the kth element in the set c, let k =1, and then c (k) is the first element value in the set c.

Step2.3: the degree of membership u obtained by separately mixing Step2.1 and Step2.2 _ij And a cluster center P _i The objective function of the mixed attribute data set is:

wherein, F _i The value of i is in the range of [1,n ] for the value returned by the fitness function]N is the number of the training sample set TR, and finally a fitness function value set F can be obtained, wherein epsilon is a small enough positive number and the range is (e) ^-10 ,e ^-20 ) In the meantime.

Step2.7: initialization of crossover operator P _c Mutation operator P _m And population quantity PQ and a maximum genetic algebra MGA, performing genetic operation on the encoded clustering center P obtained by Step2.6, then substituting the clustering center P subjected to the genetic operation into Step2.3, then recalculating Step2.4 to obtain a new fitness function value set, and then continuously passing through a selection operator PSE and a crossover operator P _c Sum mutation operator P _m And evolving a new fitness function set value of the next generation.

When the value of each element in the fitness function set is not changed greatly or the genetic algebra reaches MGA, the genetic operation is stopped, the objective function value set J corresponding to the obtained latest fitness function value set and the minimum objective function value in the objective function value set J are added into a set OFV of objective function values, the corresponding classification number k is added into a cluster number set CN, k = k +1, step2 is repeatedly executed until k = C _max -C _min +1，Is the maximum classification number C _max And the minimum classification number C _min Stops execution at the last element in set c.

The specific steps of Step3 are as follows:

step3.1: a cluster number set SCN in the set CN corresponding to the minimum element set ME = min { OFV } in the set OFV and the minimum element set min { OFV } is obtained.

Step3.3: setting the set ME as a control variable and the set SCN as an observation variable, performing variance analysis, analyzing the square difference analysis result, and finding out the set SCN element corresponding to the group with the largest significance difference, namely the optimal clustering number C ^* 。

The specific steps of Step4 are as follows:

the optimal classification number C obtained from Step3 ^* Then the algorithm flow is transferred to Step2 to obtain its membership degree, then the correspondent clustering result can be produced by means of its membership degree, and is marked as C, and C _i ,i＝1,2,...,C ^* In which C is _i Representing the set of class i results, and the corresponding cluster-centric PCC, which is _i ,i＝1,2,...,C ^* Wherein PCC _i Representing the i-th class center.

The specific steps of Step5 are as follows:

Step5.2: let i =1, K = N × η, where N = Count (TR), i.e., the total number of training sample sets, and K is a threshold value representing a critical point dividing the number of normal clusters and abnormal clusters.

Step5.3: if Count (C) _i ) If > K, then C is _i Is normalClustering and dividing C _i Adding the cluster center to the set NCRC, and adding the corresponding cluster center to the set PNCRC; otherwise C _i For abnormal cluster partitioning, divide C _i Add to the set ACRC, add its corresponding cluster center to the set PACRC, perform 5.4.

6, the Step6 comprises the following specific steps:

Step6.3: thus when data x to be detected _i The cluster subclass belongs to the normal clustering result class NCRC obtained in Step5, and the cluster subclass is normal data;

when data x to be detected _i And if the cluster subclass belongs to the abnormal clustering result ACRC obtained in Step5, the cluster subclass is abnormal data.

Further, the following example is made for the steps in the present application:

the data set adopted by the invention is a California hosting dataset provided for a Google platform, which is referred to as CHD data set hereinafter, the CHD data set is randomly ordered firstly, 12000 records are extracted from the data set as a training sample set TR, wherein 11880 records are normal data packets, and 120 records are abnormal data packets. Then, 1200 strips in this data set are extracted as a test sample set TE. And 9 key attributes on the data set were selected for testing.

Some initial parameters were set based on human experience before the experiment began. Setting P _c The value of P is 0.7 _m The value is 0.1, the population size PQ is 55, the maximum genetic algebra is 180, and the clustering number c is in the range of [1,20 ]]. After the preparation steps are completed, the steps are carried out according to the system data abnormity detection method based on genetic fuzzy clustering, and the optimal classification number is obtainedAnd clustering results. As can be seen from fig. 2, the fitness function determines the optimal classification number to be 8.

Continuing to execute the algorithm, and reserving the chromosome with the optimal fitness until the maximum genetic algebra is completed, and obtaining a clustering result as follows:

/>

table 1: cluster gauge

Further, a true case (TP), a false positive case (FP), a false negative case (FN), and a true negative case (TN) were obtained. TP represents that the real situation is abnormal data, and the detected data is also abnormal data; FP represents that the real situation is normal data, and the detected data is abnormal data; FN represents that the real condition is abnormal data, and the detected data is normal data; TN represents the real condition as normal data, and the detected data is also normal data. The TP, FP, FN and TN can measure the Accuracy (ACC) and the recall Rate (REC), wherein ACC refers to the percentage of the result which is predicted to be correct, and REC represents the percentage of all abnormal data which can be correctly identified. The calculation formula is as follows:

ACC＝(TP+TN)/(TP+TN+FP+FN)

REC＝TP/(TP+FN)

finally, a clustering table with accuracy and recall is obtained as follows:

table 2: clustering table with accuracy and recall

As can be seen from Table 2, the accuracy and the recall rate are high, which fully shows that the reliability of the invention is very high.

The working principle of the invention is as follows: firstly, carrying out discrete standardization processing on a data set acquired by a system platform, randomizing the data set after the discrete standardization processing, and dividing the data set into a training sample set and a testing sample set. And carrying out fuzzy clustering processing on the test sample set, and carrying out genetic operation on the clustering center obtained after the fuzzy processing. Then, an optimal classification number and a corresponding clustering result set are obtained. And then, carrying out classification on the clustering result set to obtain each clustering center of the normal data set and each clustering center of the abnormal data set. And then calculating the distance between each sample in the test sample set and the center of each data aggregation class after the class marking, wherein the subclass with the minimum distance measure to each sample in the test sample set can be considered as the cluster to which the subclass belongs, so that abnormal data in the test sample set can be tested.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A system data anomaly detection method based on genetic fuzzy clustering is characterized in that:

step1: firstly, standardizing all data in a data set provided by a system, then respectively randomizing all the data after the standardization, then dividing the data after the randomization, and finally obtaining a training sample set TR and a test sample set TE;

step2: determining the maximum classification number C of the training sample set TR _max And the minimum classification number C _min Form a maximum classification number C _max And the minimum classification number C _min Set of (C) = { C _min ,C _min +1,...,C _max Constructing a fuzzy clustering model and a genetic algorithm model of the mixed attribute data set, and classifying the maximum classification number C _max And the minimum classification number C _min The set c is transmitted to the models to obtain a set OFV related to the objective function value, each value of the OFV in the set corresponds to a cluster number, and the cluster numbers form a set which is set as CN;

step3: analyzing to obtain the optimal classification number C by combining the set OFV and the set CN obtained by Step2 and the minimum element set and variance analysis in the set OFV ^* ；

Step4: the optimal classification number C obtained in Step3 ^* Generating a corresponding clustering result marked as C, and C _i ,i＝1,2,...,C ^* In which C is _i Representing the set of class i results, and the corresponding cluster-centric PCC, which is _i ,i＝1,2,...,C ^* Wherein PCC _i Representing the i-th class center;

step5: marking the clustering result C obtained in Step4, thereby distinguishing normal clustering and abnormal clustering in the result;

the principle of distinction is:

setting a proportionality coefficient eta, 0 < eta < 1, if

Judging the cluster result type to be a normal cluster result type, otherwise, judging the cluster result type to be an abnormal cluster result type;

wherein, count (C) _i ) Representing the number of ith class result sets in the clustering result C, and representing the number of training sample sets by Count (TR);

the final normal clustering result is classified as NCRC, and NCRC _i I =1, 2.. Wherein iN denotes the ith class iN the normal clustering result, the corresponding normal clustering result class center is denoted as PNCRC, and PNCRC _i I =1, 2.. Ann, iN, representing the i-th class center iN the normal clustering result;

the abnormal clustering result class is marked as ACRC, and ACRC _j J =1, 2.. Said, jN denotes the jth class in the abnormal clustering results, which cluster the center PACRC, and PACRC _j J =1, 2.. Times.jn denotes the jth class center iN the outlier cluster result, and iN + jN = C ^* ；

Step6: acquiring a normal clustering result type NCRC and an abnormal clustering result type as ACRC from Step6, and a corresponding normal clustering result type center PNCRC and an abnormal clustering result type center PACRC, thereby carrying out abnormal detection on the data set;

TE = { x' for test sample set preprocessed according to Step1 ₁ ,x ₂ ,…,x _n Let x be assumed _i For the data to be detected, x is calculated separately _i Distance measure between the detected data x and the PNCRC and PACRC obtained in Step5 is set as _i The subclass corresponding to the clustering center with the minimum distance measure is the cluster to which the subclass belongs;

when data x to be detected _i The cluster subclass belongs to the normal clustering result class NCRC obtained in Step5, and the cluster subclass is normal data;

2. The genetic fuzzy clustering-based systematic data anomaly detection method according to claim 1, wherein the Step1 comprises the following specific steps:

0 < i < n, x for each data _i Normalization was performed using the following formula:

where min { X } is the minimum value in the system-provided data set, max { X } is the maximum value in the system-provided data set, and X _i ' is for each data x _i The normalized data values are then randomized, and finally divided into a training sample set TR and a test sample set TE.

3. The method for detecting the abnormal data in the system based on the genetic fuzzy clustering as claimed in claim 1, wherein the Step2 comprises the following steps:

j is more than or equal to 1 and less than or equal to n, wherein x _j ＝[x _j1 ,…,x _jl ,...,x _jm ] ^T Representing a mixing property, x, of the jth sample of the data set TR _jl Represents a sample x _j M is x _j A dimension containing attribute features; sample x with mixed properties _i And x _j The dissimilarity measure can be expressed as follows:

wherein x is _i ，x _j For the ith and jth samples in the TR data set, d _ij Minkowski distance representing the ith to jth sample in the TR dataset;

step2.1: and calculating the membership degree, wherein the function of the membership degree is as follows:

wherein u is _ij Is a sample, x _j A membership matrix belonging to the i-th class, h being a fuzzy coefficient; the value ranges of i and j are [1, n ]]N is the number of training sample sets TR to obtain a membership set u, and c (k) represents the kth element in the set c;

step2.2: degree of membership u obtained by Step2.1 _ij And obtaining a clustering center of the system, wherein the clustering center is as follows:

wherein, P _i Is the cluster center of the ith class, i ranges from [1,n]N is the number of training sample sets TR, and a cluster center set P is obtained;

wherein, J _h For the value of the objective function, i has a value in the range of [1,n ]]N is the number of the training sample sets TR to obtain an objective function set J;

wherein, F _i The value of i is in the range of [1,n ] for the value returned by the fitness function]N is the number of the training sample set TR to obtain a fitness function value set F, epsilon is a positive number, and the range is (e) ^-10 ,e ^-20 ) To (c) to (d);

wherein, the PSE _i To select an operator, i and k are scaled to [1, n ]]N is the number of the training sample sets TR, and a selection operator set PSE is obtained;

step2.6: encoding the clustering center P obtained from Step2.2 according to a multi-parameter binary encoding mode;

step2.7: initialising a crossover operator P _c Mutation operator P _m And the population quantity PQ and a maximum genetic algebra MGA, performing genetic operation on the encoded clustering center P obtained by Step2.6, then substituting the clustering center P subjected to the genetic operation into Step2.3, then recalculating Step2.4 to obtain a new fitness function value set, and then continuously passing through a selection operator PSE and a crossover operator P _c And mutation operator P _m Evolving a next generation of new fitness function set value;

4. The genetic fuzzy clustering-based systematic data anomaly detection method according to claim 1, wherein the specific steps of Step3 are:

step3.1: obtaining a cluster number set SCN in a set CN corresponding to a minimum element set ME = min { OFV } and the minimum element set min { OFV } in the set OFV;

step3.2: if the minimum element set ME has only one element, the cluster number in the corresponding set SCN also has only one element, and the element is the optimal cluster number C ^* (ii) a Otherwise, performing Step3.3 and performing variance analysis;

step3.3: set ME is set as a control variable, set SCN is set as an observation variable, variance analysis is carried out, the analysis result of the square difference is analyzed, and the set SCN corresponding to the group with the largest significance difference is foundElement, i.e. the optimum number of clusters C ^* 。

5. The method for detecting the abnormal data in the system based on the genetic fuzzy clustering as claimed in claim 1, wherein the Step5 comprises the following steps:

step5.1: for the clustering result C obtained from Step4, and C _i ,i＝1,2,3,...,C ^* In which C is _i Representing the ith class result set, and sequencing the class result sets from large to small according to the quantity of each class result set;

step5.2: let i =1, K = N · η, where N = Count (TR), i.e., the total number of training sample sets, and K is a threshold value representing a critical point for dividing the number of normal clusters and abnormal clusters;

step5.3: if Count (C) _i ) If > K, then C is _i For normal cluster partitioning, divide C _i Adding the cluster centers to the set NCRC, and adding the corresponding cluster centers to the set PNCRC; otherwise C _i For abnormal cluster partitioning, divide C _i Adding the cluster center to the set ACRC, adding the corresponding cluster center to the set PACRC, and executing 5.4;

6. The method for detecting the abnormal data in the system based on the genetic fuzzy clustering as claimed in claim 1, wherein the Step6 comprises the following steps:

step6.1: separately calculate x _i Distance measurement with each element in the PNCRC and PACRC sets obtained in Step 5;

step6.2: comparing the distance measure with the data x to be detected _i The subclass corresponding to the clustering center with the minimum distance measure is the cluster to which the subclass belongs;

step6.3: when data x to be detected _i The cluster subclass belongs to the normal clustering result class NCRC obtained in Step5, and the cluster subclass is normal data;

when data x to be detected _i The cluster subclass belongs to the abnormal cluster result ACRC obtained by Step5, thenIs the exception data.