CN111666981B - System data anomaly detection method based on genetic fuzzy clustering - Google Patents

System data anomaly detection method based on genetic fuzzy clustering Download PDF

Info

Publication number
CN111666981B
CN111666981B CN202010402204.3A CN202010402204A CN111666981B CN 111666981 B CN111666981 B CN 111666981B CN 202010402204 A CN202010402204 A CN 202010402204A CN 111666981 B CN111666981 B CN 111666981B
Authority
CN
China
Prior art keywords
data
cluster
clustering
center
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010402204.3A
Other languages
Chinese (zh)
Other versions
CN111666981A (en
Inventor
田园
原野
马文
黄祖源
付谱平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Center of Yunnan Power Grid Co Ltd
Original Assignee
Information Center of Yunnan Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Center of Yunnan Power Grid Co Ltd filed Critical Information Center of Yunnan Power Grid Co Ltd
Priority to CN202010402204.3A priority Critical patent/CN111666981B/en
Publication of CN111666981A publication Critical patent/CN111666981A/en
Application granted granted Critical
Publication of CN111666981B publication Critical patent/CN111666981B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physiology (AREA)
  • Genetics & Genomics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to a system data anomaly detection method based on genetic fuzzy clustering, and belongs to the technical field of data anomaly detection. The method comprises the steps of firstly carrying out discrete standardization processing on a data set acquired by a system platform, randomizing the data set subjected to the discrete standardization processing, and dividing the data set into a training sample set and a testing sample set. And carrying out fuzzy clustering processing on the test sample set, and carrying out genetic operation on the clustering center obtained after the fuzzy processing. Then, an optimal classification number and a corresponding clustering result set are obtained. And then, carrying out classification on the clustering result set to obtain each clustering center of the normal data set and each clustering center of the abnormal data set. And then calculating the distance between each sample in the test sample set and the center of each data aggregation class marked, wherein the subclass with the minimum distance measure to each sample in the test sample set can be regarded as the cluster to which the subclass belongs, so that abnormal data in the test sample set can be tested.

Description

System data anomaly detection method based on genetic fuzzy clustering
Technical Field
The invention relates to a system data anomaly detection method based on genetic fuzzy clustering, and belongs to the technical field of data anomaly detection.
Background
With the rapid development of information technology, service-based system platform data may cause data exception for various reasons during transmission. The FCM fuzzy clustering algorithm is often applied to the field of data anomaly detection, but the traditional FCM fuzzy clustering algorithm is easy to fall into the problem of local optimal points. In order to solve the problem, the invention adopts an abnormality detection method combining the FCM fuzzy clustering algorithm and the genetic algorithm to be applied to the field of system platform data abnormality detection, and can solve the problem that the FCM algorithm is easy to fall into a local optimum point. The abnormal data set is often characterized by mixed attributes, and the calculation amount is very large in the process of processing the abnormal data sets with the characteristics of the mixed attributes.
Disclosure of Invention
The invention provides a system data anomaly detection method based on genetic fuzzy clustering, which firstly considers the characteristic that a data set provided by a system platform often has mixed attributes, improves the calculation of distance measure, and solves the problem that a fuzzy clustering algorithm is easy to fall into a local optimum point by combining with a genetic algorithm.
The technical scheme of the invention is as follows: a system data anomaly detection method based on genetic fuzzy clustering comprises the following specific steps:
step1: firstly, all data in a data set provided by a system are standardized, then all the data after the standardization are respectively randomized, then the data after the randomization are divided, and finally a training sample set TR and a test sample set TE are obtained.
Step2: determining the maximum classification number C of the training sample set TR max And the minimum classification number C min Form a maximum classification number C max And the minimum classification number C min Set of (C) = { C min ,C min +1,...,C max Constructing a fuzzy clustering model and a genetic algorithm model of the mixed attribute data set, and classifying the maximum classification number C max And the minimum classification number C min The set c is transmitted to the models to obtain a set OFV about the objective function value, each value of the OFV in the set corresponds to a cluster number, and the cluster numbers are combined into a set which is set as CN.
Step3: analyzing to obtain the optimal classification number C by combining the set OFV and the set CN obtained by Step2 and the minimum element set and variance analysis in the set OFV *
Step4: the optimal classification number C obtained in Step3 * Generating a corresponding clustering result denoted as C, andC i ,i=1,2,...,C * in which C is i Representing the set of class i results, and the corresponding cluster-centric PCC, which is i ,i=1,2,...,C * Wherein PCC i Representing the i-th class center.
Step5: performing mark classification on the clustering result C obtained in Step4, wherein the mark classification aims to distinguish normal clustering and abnormal clustering in the result;
the distinguishing principle is as follows:
setting a proportionality coefficient eta, 0 < eta < 1, if
Figure BDA0002489918320000021
And judging the cluster result type as a normal cluster result type, otherwise, judging the cluster result type as an abnormal cluster result type.
Wherein, count (C) i ) The number of the ith class result sets in the clustering result C is represented, and the Count (TR) represents the number of the training sample sets.
The final normal clustering result is marked as NCRC, and the NCRC i I =1, 2.. Wherein iN denotes the ith class iN the normal clustering result, the corresponding normal clustering result class center is denoted as PNCRC, and PNCRC i I =1, 2., iN, denotes the i-th class center iN the normal clustering result.
The abnormal clustering result class is marked as ACRC, and ACRC j J =1, 2.. Said, jN denotes the jth class in the abnormal clustering results, which cluster the center PACRC, and PACRC j J =1, 2.. Times.jn denotes the jth class center iN the outlier cluster result, and iN + jN = C *
Step6: and acquiring a normal clustering result type NCRC and an abnormal clustering result type ACRC from Step6, and a corresponding normal clustering result type center PNCRC and an abnormal clustering result type center PACRC to perform abnormal detection on the data set.
TE = { x' for test sample set preprocessed according to Step1 1 ,x 2 ,…,x n Let x be assumed i For the data to be detected, x is calculated separately i Distance measure between the data to be detected and the PNCRC and PACRC obtained in Step5 is set as x i Has the advantages ofAnd the subclass corresponding to the clustering center with the minimum distance measure is the cluster to which the subclass belongs.
When data x to be detected i And the cluster subclass belongs to the normal clustering result class NCRC obtained in Step5, and the cluster subclass is normal data.
When data x to be detected i And if the cluster subclass belongs to the abnormal clustering result class ACRC obtained at Step5, the cluster subclass is abnormal data.
The specific steps of Step1 are as follows:
using discrete normalization, all data X = { X in all datasets provided by the system 1 ,x 2 ,…,x n Is mapped to [0,1 ]]In the above-mentioned manner,
Figure BDA0002489918320000031
for each data x i Normalization was performed using the following formula:
Figure BDA0002489918320000032
where min { X } is the minimum value in the system-provided data set, max { X } is the maximum value in the system-provided data set, and X i ' for each data x i The normalized data values are then randomized, and finally divided into a training sample set TR and a test sample set TE.
The concrete steps of Step2 are as follows:
data set TR = { x after preprocessing by Step1 1 ,x 2 ,…,x i ,…,x n Is a set of data sets with mixed attributes,
Figure BDA0002489918320000033
wherein x is j =[x j1 ,…,x jl ,...,x jm ] T Representing a mixing property, x, of the jth sample of the data set TR jl Represents a sample x j M is x j A dimension containing an attribute feature; sample x with mixed properties i And x j The dissimilarity measure may be expressed as follows:
Figure BDA0002489918320000034
Figure BDA0002489918320000035
/>
wherein x is i ,x j For the ith and jth samples in the TR data set, d ij Minkowski distance, representing the ith to jth sample in the TR dataset, maximum class number C obtained from initialization in Step2 max And the minimum classification number C min The set c of (a) needs to continuously transmit each element in the set c to the membership calculation function, so as to search which classification number is the best.
Step2.1: calculating the degree of membership, wherein the function of the degree of membership is as follows:
Figure BDA0002489918320000036
wherein u is ij Is a sample, x j A membership matrix belonging to the i-th class, wherein h is a fuzzy coefficient; the value ranges of i and j are [1, n ]]And n is the number of the training sample sets TR, and finally a membership grade set u can be obtained, wherein c (k) represents the kth element in the set c.
Step2.2: degree of membership u obtained by Step2.1 ij Its clustering center can be obtained, which is:
Figure BDA0002489918320000041
wherein, P i Is the cluster center of the ith class, i ranges from [1,n]And n is the number of the training sample sets TR, and finally a cluster center set P can be obtained.
Step2.3: the membership u obtained by mixing Step2.1 and Step2.2 ij And in clusteringHeart P i The objective function of the mixed attribute data set is:
Figure BDA0002489918320000042
wherein, J h For the value of the objective function, i has a value in the range of [1,n ]]And n is the number of the training sample sets TR, and finally an objective function set J can be obtained.
Step2.4: by the value of the objective function J h Establishing a fitness function, wherein the fitness function is as follows:
Figure BDA0002489918320000043
wherein, F i The value range of i is [1, n ] for the value returned by the fitness function]N is the number of the training sample set TR, and finally a fitness function value set F can be obtained, wherein epsilon is a small enough positive number and the range is (e) -10 ,e -20 ) In the meantime.
Step2.5: fitness function value F obtained by Step2.4 i Determining the selection operator formula as follows:
Figure BDA0002489918320000044
wherein, the PSE i For selecting an operator, i and k have a value range of [1, n ]]And n is the number of the training sample sets TR, and finally a selection operator set PSE can be obtained.
Step2.6: and coding the clustering center P obtained from Step2.2 according to a multi-parameter binary coding mode.
Step2.7: initialization of crossover operator P c Mutation operator P m And population number PQ and a maximum genetic algebra MGA, performing genetic operation on the encoded clustering center P obtained by Step2.6, then substituting the clustering center P subjected to the genetic operation into Step2.3, and then recalculating Step2.4 to obtain a new fitnessA function value set, and then continuously passing through a selection operator PSE and a crossover operator P c And mutation operator P m And evolving a new fitness function set value of the next generation.
When the change of each element value in the fitness function set is not large, or the genetic algebra reaches MGA, the genetic operation is stopped, the objective function value set J corresponding to the obtained latest fitness function value set and the minimum objective function value in the objective function value set J are added into the set OFV of the objective function values, the corresponding classification number k is added into the clustering number set CN, k = k +1 is made, step2 is repeatedly executed until k = C max -C min +1 is the maximum classification number C max And the minimum classification number C min The last element in set c of (c) stops execution.
The specific steps of Step3 are as follows:
step3.1: obtaining a cluster number set SCN in a set CN corresponding to the minimum element set ME = min { OFV } in the set OFV and the minimum element set min { OFV }.
Step3.2: if the minimum element set ME has only one element, the cluster number in the corresponding set SCN also has only one element, and the element is the optimal cluster number C * (ii) a Otherwise, step3.3 is executed for analysis of variance.
Step3.3: setting the set ME as a control variable and the set SCN as an observation variable, carrying out variance analysis, analyzing the square difference analysis result, and finding out the set SCN element corresponding to the group with the largest significance difference, namely the optimal clustering number C *
The specific steps of Step5 are as follows:
step5.1: for the clustering result C obtained from Step4, C i ,i=1,2,3,...,C * In which C is i The ith class result set is represented, and the class result sets are sorted from large to small according to the number of each class result set.
Step5.2: let i =1, K = N ·, where N = Count (TR), i.e., the total number of training sample sets, and K is a threshold value, which represents a critical point for dividing the number of normal clusters and abnormal clusters.
Step5.3: if Count (C) i ) If > K, then C is i For normal cluster partitioning, divide C i Adding the cluster centers to the set NCRC, and adding the corresponding cluster centers to the set PNCRC; otherwise C i For abnormal cluster partitioning, divide C i Add to the set ACRC, add its corresponding cluster center to the set PACRC, perform 5.4.
Step5.4: if i = i +1 and i ≦ C * Then 5.3 is executed, otherwise the loop is exited.
6, the Step6 comprises the following specific steps:
step6.1: separately calculate x i Distance measures from each element in the PNCRC and PACRC sets obtained in Step5.
Step6.2: comparing the distance measure with the data x to be detected i The subclass corresponding to the cluster center with the minimum distance measure is the cluster to which the subclass belongs.
Step6.3: thus when data x to be detected is detected i The cluster subclass belongs to the normal clustering result class NCRC obtained in Step5, and the cluster subclass is normal data;
when data x to be detected i And if the cluster subclass belongs to the abnormal clustering result class ACRC obtained at Step5, the cluster subclass is abnormal data.
The invention has the beneficial effects that: the invention classifies the system platform data by using a fuzzy clustering method, and improves the calculation of distance measure aiming at the characteristic that the system data set has mixed attributes; the invention adopts the genetic algorithm to carry out genetic operation on the clustering center, thus solving the problem that the fuzzy clustering algorithm is easy to fall into a local optimum point; the data anomaly detection algorithm of distance measurement is simple and clear, and has high detection efficiency.
Drawings
FIG. 1 is a flow chart of the steps of the present invention;
FIG. 2 is a diagram of the function of classification number and fitness according to the present invention.
Detailed Description
The invention is further described with reference to the following drawings and detailed description.
As shown in fig. 1, the technical solution of the present invention is: a system data anomaly detection method based on genetic fuzzy clustering comprises the following specific steps:
step1: firstly, all data in a data set provided by a system are standardized, then all the data after the standardized processing are respectively randomized, then the data after the randomized processing are divided, and finally a training sample set TR and a testing sample set TE are obtained.
Step2: determining the maximum classification number C of the training sample set TR max And the minimum classification number C min Form a maximum classification number C max And the minimum classification number C min Set of (C) = { C min ,C min +1,...,C max Constructing a fuzzy clustering model and a genetic algorithm model of the mixed attribute data set, and classifying the maximum classification number C max And the minimum classification number C min The set c of (2) is transmitted to the models to obtain a set OFV related to the objective function value, each value of the OFV in the set corresponds to a cluster number, and the cluster numbers are combined into a set which is set as CN.
Step3: analyzing to obtain the optimal classification number C by combining the set OFV and the set CN obtained by Step2 and the minimum element set and variance analysis in the set OFV *
Step4: the optimal classification number C obtained in Step3 * Generating a corresponding clustering result marked as C, and C i ,i=1,2,...,C * In which C is i Representing the ith class result set, and the corresponding cluster-centric PCC, which i ,i=1,2,...,C * Wherein PCC i Representing the ith class center.
Step5: performing mark classification on the clustering result C obtained in Step4, wherein the mark classification aims to distinguish normal clustering and abnormal clustering in the result;
the distinguishing principle is as follows:
setting a proportionality coefficient eta, 0 < eta < 1, if
Figure BDA0002489918320000071
Then it is considered asAnd (4) clustering the result type normally, or else, clustering the result type abnormally.
Wherein, count (C) i ) The number of the ith class result sets in the clustering result C is represented, and the Count (TR) represents the number of the training sample sets.
The final normal clustering result is classified as NCRC, and NCRC i I =1, 2.. Wherein iN denotes the ith class iN the normal clustering result, the corresponding normal clustering result class center is denoted as PNCRC, and PNCRC i I =1, 2.,. IN, denotes the i-th class center iN the normal clustering result.
The abnormal clustering result class is denoted as ACRC, and ACRC j J =1, 2.. Said, jN denotes the jth class in the abnormal clustering results, which cluster the center PACRC, and PACRC j J =1, 2.. JN denotes the jth class center iN the outlier cluster result, and iN + jN = C *
Step6: and acquiring a normal clustering result type NCRC and an abnormal clustering result type as ACRC from Step6, and a corresponding normal clustering result type center PNCRC and an abnormal clustering result type center PACRC, thereby carrying out abnormal detection on the data set.
For test sample set TE = { x after pretreatment according to Step1 1 ,x 2 ,…,x n Let x be assumed i For the data to be detected, x is calculated separately i Distance measure between the detected data x and the PNCRC and PACRC obtained in Step5 is set as i The subclass corresponding to the cluster center with the minimum distance measure is the cluster to which the subclass belongs.
When data x to be detected i And the cluster subclass belongs to the normal clustering result class NCRC obtained in Step5, and the cluster subclass is normal data.
When data x to be detected i And if the cluster subclass belongs to the abnormal clustering result class ACRC obtained at Step5, the cluster subclass is abnormal data.
The specific steps of Step1 are as follows:
the use of discrete normalization allows for the use of discrete normalization, all data X = { X in all data sets to be provided by the system 1 ,x 2 ,…,x n Mapping to [0,1 ]]In the above-mentioned manner,
Figure BDA0002489918320000081
for each data x i Normalization was performed using the following formula:
Figure BDA0002489918320000082
where min { X } is the minimum value in the system-supplied data set, max { X } is the maximum value in the system-supplied data set, X i ' for each data x i The normalized data values are then randomized, and finally divided into a training sample set TR and a test sample set TE.
The specific steps of Step2 are as follows:
data set TR = { x after preprocessing by Step1 1 ,x 2 ,…,x i ,…,x n Is a set of data sets with mixed attributes,
Figure BDA0002489918320000083
wherein x is j =[x j1 ,…,x jl ,...,x jm ] T Mixed property, x, representing the jth sample of the data set TR jl Representing a sample x j M is x j A dimension containing an attribute feature; sample x with mixed properties i And x j The dissimilarity measure may be expressed as follows:
Figure BDA0002489918320000084
Figure BDA0002489918320000085
wherein x is i ,x j For the ith and jth samples in the TR data set, d ij Minkowski distance, representing the ith to jth sample in the TR dataset, maximum class number C obtained from initialization in Step2 max And the minimum classification number C min The set c of (a) needs to continuously transmit each element in the set c to the membership calculation function, so as to search which classification number is the best.
Step2.1: calculating the degree of membership, wherein the function of the degree of membership is as follows:
Figure BDA0002489918320000086
wherein u is ij Is a sample, x j A membership matrix belonging to the i-th class, h being a fuzzy coefficient; the value ranges of i and j are [1, n ]]N is the number of training sample sets TR, a membership set u can be finally obtained, c (k) represents the kth element in the set c, let k =1, and then c (k) is the first element value in the set c.
Step2.2: degree of membership u obtained by Step2.1 ij Its clustering center can be obtained, which is:
Figure BDA0002489918320000091
wherein, P i Is the cluster center of the ith class, i ranges from [1,n]And n is the number of the training sample sets TR, and finally a cluster center set P can be obtained.
Step2.3: the degree of membership u obtained by separately mixing Step2.1 and Step2.2 ij And a cluster center P i The objective function of the mixed attribute data set is:
Figure BDA0002489918320000092
wherein, J h For the value of the objective function, i has a value in the range of [1,n ]]And n is the number of the training sample sets TR, and finally an objective function set J can be obtained.
Step2.4: by the value of the objective function J h Establishing a fitness function, wherein the fitness function is as follows:
Figure BDA0002489918320000093
wherein, F i The value of i is in the range of [1,n ] for the value returned by the fitness function]N is the number of the training sample set TR, and finally a fitness function value set F can be obtained, wherein epsilon is a small enough positive number and the range is (e) -10 ,e -20 ) In the meantime.
Step2.5: fitness function value F obtained by Step2.4 i Determining the selection operator formula as follows:
Figure BDA0002489918320000094
wherein, the PSE i For selecting an operator, i and k have a value range of [1, n ]]And n is the number of the training sample sets TR, and finally a selection operator set PSE can be obtained.
Step2.6: and coding the clustering center P obtained from Step2.2 according to a multi-parameter binary coding mode.
Step2.7: initialization of crossover operator P c Mutation operator P m And population quantity PQ and a maximum genetic algebra MGA, performing genetic operation on the encoded clustering center P obtained by Step2.6, then substituting the clustering center P subjected to the genetic operation into Step2.3, then recalculating Step2.4 to obtain a new fitness function value set, and then continuously passing through a selection operator PSE and a crossover operator P c Sum mutation operator P m And evolving a new fitness function set value of the next generation.
When the value of each element in the fitness function set is not changed greatly or the genetic algebra reaches MGA, the genetic operation is stopped, the objective function value set J corresponding to the obtained latest fitness function value set and the minimum objective function value in the objective function value set J are added into a set OFV of objective function values, the corresponding classification number k is added into a cluster number set CN, k = k +1, step2 is repeatedly executed until k = C max -C min +1,Is the maximum classification number C max And the minimum classification number C min Stops execution at the last element in set c.
The specific steps of Step3 are as follows:
step3.1: a cluster number set SCN in the set CN corresponding to the minimum element set ME = min { OFV } in the set OFV and the minimum element set min { OFV } is obtained.
Step3.2: if the minimum element set ME has only one element, the cluster number in the corresponding set SCN also has only one element, and the element is the optimal cluster number C * (ii) a Otherwise, step3.3 is executed for analysis of variance.
Step3.3: setting the set ME as a control variable and the set SCN as an observation variable, performing variance analysis, analyzing the square difference analysis result, and finding out the set SCN element corresponding to the group with the largest significance difference, namely the optimal clustering number C *
The specific steps of Step4 are as follows:
the optimal classification number C obtained from Step3 * Then the algorithm flow is transferred to Step2 to obtain its membership degree, then the correspondent clustering result can be produced by means of its membership degree, and is marked as C, and C i ,i=1,2,...,C * In which C is i Representing the set of class i results, and the corresponding cluster-centric PCC, which is i ,i=1,2,...,C * Wherein PCC i Representing the i-th class center.
The specific steps of Step5 are as follows:
step5.1: for the clustering result C obtained from Step4, C i ,i=1,2,3,...,C * In which C is i The ith class result set is represented, and the class result sets are sorted from large to small according to the number of each class result set.
Step5.2: let i =1, K = N × η, where N = Count (TR), i.e., the total number of training sample sets, and K is a threshold value representing a critical point dividing the number of normal clusters and abnormal clusters.
Step5.3: if Count (C) i ) If > K, then C is i Is normalClustering and dividing C i Adding the cluster center to the set NCRC, and adding the corresponding cluster center to the set PNCRC; otherwise C i For abnormal cluster partitioning, divide C i Add to the set ACRC, add its corresponding cluster center to the set PACRC, perform 5.4.
Step5.4: if i = i +1 and i ≦ C * Then 5.3 is executed, otherwise the loop is exited.
6, the Step6 comprises the following specific steps:
step6.1: separately calculate x i Distance measures from each element in the PNCRC and PACRC sets obtained in Step5.
Step6.2: comparing the distance measure with the data x to be detected i The subclass corresponding to the cluster center with the minimum distance measure is the cluster to which the subclass belongs.
Step6.3: thus when data x to be detected i The cluster subclass belongs to the normal clustering result class NCRC obtained in Step5, and the cluster subclass is normal data;
when data x to be detected i And if the cluster subclass belongs to the abnormal clustering result ACRC obtained in Step5, the cluster subclass is abnormal data.
Further, the following example is made for the steps in the present application:
the data set adopted by the invention is a California hosting dataset provided for a Google platform, which is referred to as CHD data set hereinafter, the CHD data set is randomly ordered firstly, 12000 records are extracted from the data set as a training sample set TR, wherein 11880 records are normal data packets, and 120 records are abnormal data packets. Then, 1200 strips in this data set are extracted as a test sample set TE. And 9 key attributes on the data set were selected for testing.
Some initial parameters were set based on human experience before the experiment began. Setting P c The value of P is 0.7 m The value is 0.1, the population size PQ is 55, the maximum genetic algebra is 180, and the clustering number c is in the range of [1,20 ]]. After the preparation steps are completed, the steps are carried out according to the system data abnormity detection method based on genetic fuzzy clustering, and the optimal classification number is obtainedAnd clustering results. As can be seen from fig. 2, the fitness function determines the optimal classification number to be 8.
Continuing to execute the algorithm, and reserving the chromosome with the optimal fitness until the maximum genetic algebra is completed, and obtaining a clustering result as follows:
Figure BDA0002489918320000111
/>
Figure BDA0002489918320000121
table 1: cluster gauge
Further, a true case (TP), a false positive case (FP), a false negative case (FN), and a true negative case (TN) were obtained. TP represents that the real situation is abnormal data, and the detected data is also abnormal data; FP represents that the real situation is normal data, and the detected data is abnormal data; FN represents that the real condition is abnormal data, and the detected data is normal data; TN represents the real condition as normal data, and the detected data is also normal data. The TP, FP, FN and TN can measure the Accuracy (ACC) and the recall Rate (REC), wherein ACC refers to the percentage of the result which is predicted to be correct, and REC represents the percentage of all abnormal data which can be correctly identified. The calculation formula is as follows:
ACC=(TP+TN)/(TP+TN+FP+FN)
REC=TP/(TP+FN)
finally, a clustering table with accuracy and recall is obtained as follows:
Figure BDA0002489918320000122
table 2: clustering table with accuracy and recall
As can be seen from Table 2, the accuracy and the recall rate are high, which fully shows that the reliability of the invention is very high.
The working principle of the invention is as follows: firstly, carrying out discrete standardization processing on a data set acquired by a system platform, randomizing the data set after the discrete standardization processing, and dividing the data set into a training sample set and a testing sample set. And carrying out fuzzy clustering processing on the test sample set, and carrying out genetic operation on the clustering center obtained after the fuzzy processing. Then, an optimal classification number and a corresponding clustering result set are obtained. And then, carrying out classification on the clustering result set to obtain each clustering center of the normal data set and each clustering center of the abnormal data set. And then calculating the distance between each sample in the test sample set and the center of each data aggregation class after the class marking, wherein the subclass with the minimum distance measure to each sample in the test sample set can be considered as the cluster to which the subclass belongs, so that abnormal data in the test sample set can be tested.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims (6)

1. A system data anomaly detection method based on genetic fuzzy clustering is characterized in that:
step1: firstly, standardizing all data in a data set provided by a system, then respectively randomizing all the data after the standardization, then dividing the data after the randomization, and finally obtaining a training sample set TR and a test sample set TE;
step2: determining the maximum classification number C of the training sample set TR max And the minimum classification number C min Form a maximum classification number C max And the minimum classification number C min Set of (C) = { C min ,C min +1,...,C max Constructing a fuzzy clustering model and a genetic algorithm model of the mixed attribute data set, and classifying the maximum classification number C max And the minimum classification number C min The set c is transmitted to the models to obtain a set OFV related to the objective function value, each value of the OFV in the set corresponds to a cluster number, and the cluster numbers form a set which is set as CN;
step3: analyzing to obtain the optimal classification number C by combining the set OFV and the set CN obtained by Step2 and the minimum element set and variance analysis in the set OFV *
Step4: the optimal classification number C obtained in Step3 * Generating a corresponding clustering result marked as C, and C i ,i=1,2,...,C * In which C is i Representing the set of class i results, and the corresponding cluster-centric PCC, which is i ,i=1,2,...,C * Wherein PCC i Representing the i-th class center;
step5: marking the clustering result C obtained in Step4, thereby distinguishing normal clustering and abnormal clustering in the result;
the principle of distinction is:
setting a proportionality coefficient eta, 0 < eta < 1, if
Figure FDA0002489918310000011
Judging the cluster result type to be a normal cluster result type, otherwise, judging the cluster result type to be an abnormal cluster result type;
wherein, count (C) i ) Representing the number of ith class result sets in the clustering result C, and representing the number of training sample sets by Count (TR);
the final normal clustering result is classified as NCRC, and NCRC i I =1, 2.. Wherein iN denotes the ith class iN the normal clustering result, the corresponding normal clustering result class center is denoted as PNCRC, and PNCRC i I =1, 2.. Ann, iN, representing the i-th class center iN the normal clustering result;
the abnormal clustering result class is marked as ACRC, and ACRC j J =1, 2.. Said, jN denotes the jth class in the abnormal clustering results, which cluster the center PACRC, and PACRC j J =1, 2.. Times.jn denotes the jth class center iN the outlier cluster result, and iN + jN = C *
Step6: acquiring a normal clustering result type NCRC and an abnormal clustering result type as ACRC from Step6, and a corresponding normal clustering result type center PNCRC and an abnormal clustering result type center PACRC, thereby carrying out abnormal detection on the data set;
TE = { x' for test sample set preprocessed according to Step1 1 ,x 2 ,…,x n Let x be assumed i For the data to be detected, x is calculated separately i Distance measure between the detected data x and the PNCRC and PACRC obtained in Step5 is set as i The subclass corresponding to the clustering center with the minimum distance measure is the cluster to which the subclass belongs;
when data x to be detected i The cluster subclass belongs to the normal clustering result class NCRC obtained in Step5, and the cluster subclass is normal data;
when data x to be detected i And if the cluster subclass belongs to the abnormal clustering result class ACRC obtained at Step5, the cluster subclass is abnormal data.
2. The genetic fuzzy clustering-based systematic data anomaly detection method according to claim 1, wherein the Step1 comprises the following specific steps:
using discrete normalization, all data X = { X in all datasets provided by the system 1 ,x 2 ,…,x n Is mapped to [0,1 ]]In the above-mentioned manner,
Figure FDA0002489918310000022
0 < i < n, x for each data i Normalization was performed using the following formula:
Figure FDA0002489918310000021
where min { X } is the minimum value in the system-provided data set, max { X } is the maximum value in the system-provided data set, and X i ' is for each data x i The normalized data values are then randomized, and finally divided into a training sample set TR and a test sample set TE.
3. The method for detecting the abnormal data in the system based on the genetic fuzzy clustering as claimed in claim 1, wherein the Step2 comprises the following steps:
data set TR = { x after preprocessing by Step1 1 ,x 2 ,…,x i ,…,x n Is a set of data sets with mixed attributes,
Figure FDA0002489918310000036
j is more than or equal to 1 and less than or equal to n, wherein x j =[x j1 ,…,x jl ,...,x jm ] T Representing a mixing property, x, of the jth sample of the data set TR jl Represents a sample x j M is x j A dimension containing attribute features; sample x with mixed properties i And x j The dissimilarity measure can be expressed as follows:
Figure FDA0002489918310000031
Figure FDA0002489918310000032
wherein x is i ,x j For the ith and jth samples in the TR data set, d ij Minkowski distance representing the ith to jth sample in the TR dataset;
step2.1: and calculating the membership degree, wherein the function of the membership degree is as follows:
Figure FDA0002489918310000033
wherein u is ij Is a sample, x j A membership matrix belonging to the i-th class, h being a fuzzy coefficient; the value ranges of i and j are [1, n ]]N is the number of training sample sets TR to obtain a membership set u, and c (k) represents the kth element in the set c;
step2.2: degree of membership u obtained by Step2.1 ij And obtaining a clustering center of the system, wherein the clustering center is as follows:
Figure FDA0002489918310000034
wherein, P i Is the cluster center of the ith class, i ranges from [1,n]N is the number of training sample sets TR, and a cluster center set P is obtained;
step2.3: the degree of membership u obtained by separately mixing Step2.1 and Step2.2 ij And a cluster center P i The objective function of the mixed attribute data set is:
Figure FDA0002489918310000035
wherein, J h For the value of the objective function, i has a value in the range of [1,n ]]N is the number of the training sample sets TR to obtain an objective function set J;
step2.4: by the value of the objective function J h Establishing a fitness function, wherein the fitness function is as follows:
Figure FDA0002489918310000041
wherein, F i The value of i is in the range of [1,n ] for the value returned by the fitness function]N is the number of the training sample set TR to obtain a fitness function value set F, epsilon is a positive number, and the range is (e) -10 ,e -20 ) To (c) to (d);
step2.5: fitness function value F obtained by Step2.4 i Determining the selection operator formula as follows:
Figure FDA0002489918310000042
wherein, the PSE i To select an operator, i and k are scaled to [1, n ]]N is the number of the training sample sets TR, and a selection operator set PSE is obtained;
step2.6: encoding the clustering center P obtained from Step2.2 according to a multi-parameter binary encoding mode;
step2.7: initialising a crossover operator P c Mutation operator P m And the population quantity PQ and a maximum genetic algebra MGA, performing genetic operation on the encoded clustering center P obtained by Step2.6, then substituting the clustering center P subjected to the genetic operation into Step2.3, then recalculating Step2.4 to obtain a new fitness function value set, and then continuously passing through a selection operator PSE and a crossover operator P c And mutation operator P m Evolving a next generation of new fitness function set value;
when the change of each element value in the fitness function set is not large, or the genetic algebra reaches MGA, the genetic operation is stopped, the objective function value set J corresponding to the obtained latest fitness function value set and the minimum objective function value in the objective function value set J are added into the set OFV of the objective function values, the corresponding classification number k is added into the clustering number set CN, k = k +1 is made, step2 is repeatedly executed until k = C max -C min +1 is the maximum classification number C max And the minimum classification number C min The last element in set c of (c) stops execution.
4. The genetic fuzzy clustering-based systematic data anomaly detection method according to claim 1, wherein the specific steps of Step3 are:
step3.1: obtaining a cluster number set SCN in a set CN corresponding to a minimum element set ME = min { OFV } and the minimum element set min { OFV } in the set OFV;
step3.2: if the minimum element set ME has only one element, the cluster number in the corresponding set SCN also has only one element, and the element is the optimal cluster number C * (ii) a Otherwise, performing Step3.3 and performing variance analysis;
step3.3: set ME is set as a control variable, set SCN is set as an observation variable, variance analysis is carried out, the analysis result of the square difference is analyzed, and the set SCN corresponding to the group with the largest significance difference is foundElement, i.e. the optimum number of clusters C *
5. The method for detecting the abnormal data in the system based on the genetic fuzzy clustering as claimed in claim 1, wherein the Step5 comprises the following steps:
step5.1: for the clustering result C obtained from Step4, and C i ,i=1,2,3,...,C * In which C is i Representing the ith class result set, and sequencing the class result sets from large to small according to the quantity of each class result set;
step5.2: let i =1, K = N · η, where N = Count (TR), i.e., the total number of training sample sets, and K is a threshold value representing a critical point for dividing the number of normal clusters and abnormal clusters;
step5.3: if Count (C) i ) If > K, then C is i For normal cluster partitioning, divide C i Adding the cluster centers to the set NCRC, and adding the corresponding cluster centers to the set PNCRC; otherwise C i For abnormal cluster partitioning, divide C i Adding the cluster center to the set ACRC, adding the corresponding cluster center to the set PACRC, and executing 5.4;
step5.4: if i = i +1 and i ≦ C * Then 5.3 is executed, otherwise the loop is exited.
6. The method for detecting the abnormal data in the system based on the genetic fuzzy clustering as claimed in claim 1, wherein the Step6 comprises the following steps:
step6.1: separately calculate x i Distance measurement with each element in the PNCRC and PACRC sets obtained in Step 5;
step6.2: comparing the distance measure with the data x to be detected i The subclass corresponding to the clustering center with the minimum distance measure is the cluster to which the subclass belongs;
step6.3: when data x to be detected i The cluster subclass belongs to the normal clustering result class NCRC obtained in Step5, and the cluster subclass is normal data;
when data x to be detected i The cluster subclass belongs to the abnormal cluster result ACRC obtained by Step5, thenIs the exception data.
CN202010402204.3A 2020-05-13 2020-05-13 System data anomaly detection method based on genetic fuzzy clustering Active CN111666981B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010402204.3A CN111666981B (en) 2020-05-13 2020-05-13 System data anomaly detection method based on genetic fuzzy clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010402204.3A CN111666981B (en) 2020-05-13 2020-05-13 System data anomaly detection method based on genetic fuzzy clustering

Publications (2)

Publication Number Publication Date
CN111666981A CN111666981A (en) 2020-09-15
CN111666981B true CN111666981B (en) 2023-03-31

Family

ID=72383542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010402204.3A Active CN111666981B (en) 2020-05-13 2020-05-13 System data anomaly detection method based on genetic fuzzy clustering

Country Status (1)

Country Link
CN (1) CN111666981B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408370B (en) * 2021-05-31 2023-12-19 西安电子科技大学 Forest change remote sensing detection method based on adaptive parameter genetic algorithm
CN116109176B (en) * 2022-12-21 2024-01-05 成都安讯智服科技有限公司 Alarm abnormity prediction method and system based on collaborative clustering

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239800A (en) * 2017-06-06 2017-10-10 常州工学院 Relaxation fuzzy c-means clustering algorithm
CN108710914A (en) * 2018-05-22 2018-10-26 常州工学院 A kind of unsupervised data classification method based on generalized fuzzy clustering algorithm
CN109446028A (en) * 2018-10-26 2019-03-08 中国人民解放军火箭军工程大学 A kind of cooled dehumidifier unit state monitoring method based on Genetic Algorithm Fuzzy C-Mean cluster
CN109669990A (en) * 2018-11-16 2019-04-23 重庆邮电大学 A kind of innovatory algorithm carrying out Outliers mining to density irregular data based on DBSCAN

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239800A (en) * 2017-06-06 2017-10-10 常州工学院 Relaxation fuzzy c-means clustering algorithm
CN108710914A (en) * 2018-05-22 2018-10-26 常州工学院 A kind of unsupervised data classification method based on generalized fuzzy clustering algorithm
CN109446028A (en) * 2018-10-26 2019-03-08 中国人民解放军火箭军工程大学 A kind of cooled dehumidifier unit state monitoring method based on Genetic Algorithm Fuzzy C-Mean cluster
CN109669990A (en) * 2018-11-16 2019-04-23 重庆邮电大学 A kind of innovatory algorithm carrying out Outliers mining to density irregular data based on DBSCAN

Also Published As

Publication number Publication date
CN111666981A (en) 2020-09-15

Similar Documents

Publication Publication Date Title
CN111626336B (en) Subway fault data classification method based on unbalanced data set
CN109891508B (en) Single cell type detection method, device, apparatus and storage medium
CN111666981B (en) System data anomaly detection method based on genetic fuzzy clustering
CN107292330A (en) A kind of iterative label Noise Identification algorithm based on supervised learning and semi-supervised learning double-point information
CN108647272A (en) A kind of small sample extending method based on data distribution
CN111079836A (en) Process data fault classification method based on pseudo label method and weak supervised learning
CN113259331A (en) Unknown abnormal flow online detection method and system based on incremental learning
CN111191726A (en) Fault classification method based on weak supervised learning multi-layer perceptron
CN111046930A (en) Power supply service satisfaction influence factor identification method based on decision tree algorithm
CN106896219A (en) The identification of transformer sub-health state and average remaining lifetime method of estimation based on Gases Dissolved in Transformer Oil data
CN114114039A (en) Method and device for evaluating consistency of single battery cells of battery system
CN114266289A (en) Complex equipment health state assessment method
CN107016416B (en) Data classification prediction method based on neighborhood rough set and PCA fusion
CN116189909B (en) Clinical medicine discriminating method and system based on lifting algorithm
CN111105041B (en) Machine learning method and device for intelligent data collision
CN116864011A (en) Colorectal cancer molecular marker identification method and system based on multiple sets of chemical data
WO2019149133A1 (en) Resource processing method, storage medium, and computer device
Liu et al. Fuzzy c-mean algorithm based on Mahalanobis distances and better initial values
CN110991517A (en) Classification method and system for unbalanced data set in stroke
CN112102882B (en) Quality control system and method for NGS detection process of tumor sample
CN113392086B (en) Medical database construction method, device and equipment based on Internet of things
CN109492705A (en) Method for diagnosing faults of the one kind based on mahalanobis distance (MD) area measurement
CN114548306A (en) Intelligent monitoring method for early drilling overflow based on misclassification cost
Wen A unified view of false discovery rate control: Reconciliation of bayesian and frequentist approaches
CN113255810A (en) Network model testing method based on key decision logic design test coverage rate

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant