CN115017988A

CN115017988A - Competitive clustering method for state anomaly diagnosis

Info

Publication number: CN115017988A
Application number: CN202210619146.9A
Authority: CN
Inventors: 王培红; 徐璐璐; 汤若鑫; 高俊彦; 陈文菲
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-06-01
Filing date: 2022-06-01
Publication date: 2022-09-06

Abstract

The invention discloses a competitive clustering method for state anomaly diagnosis, which relates to the technical field of data mining and solves the technical problems that the conventional clustering method cannot effectively reserve an abnormal small sample class and has poor clustering performance. And as the iteration is carried out, calculating the base number of each class cluster so as to eliminate false class clusters smaller than a set threshold value. Through competition among clusters, the number of the clusters is gradually reduced to achieve stability, when the position of a cluster center is not changed any more or the number of iterations is reached, the algorithm is terminated, and a result is output, so that clustering of a data set is realized, the clustering performance is improved, and application based on data sample clustering characteristics is expanded.

Description

Competitive clustering method for state anomaly diagnosis

Technical Field

The application relates to the technical field of data mining, in particular to a big data processing technology, and particularly relates to a competitive clustering method for state anomaly diagnosis.

Background

Clustering is one of the most common techniques in the field of data mining for finding unknown object classes in a data set. The cluster analysis has wide application prospect in the fields of customer segmentation, pattern recognition, medical decision, abnormality detection and the like. The traditional clustering algorithm can well process the clustering problem of balanced data, but many unbalanced data exist in real life, and the data quantity which shows normal performance is far larger than the data quantity which shows abnormal performance in the fields of medical diagnosis, fault diagnosis and the like. The unbalanced data set is characterized in that the number and the density of data objects belonging to a certain category in the same data set are greatly different from those of data objects of other categories, generally, the class with the larger number of data objects is called a large class, and the class with the smaller number of data objects is called a small class. The current clustering method mainly reflects the clustering characteristics of the balanced sample class, while the abnormal (or fault) small sample class is often ignored or is used to divide part of objects in the large class into small classes, so that the obtained classes have relatively uniform scales, which limits the application based on the data sample clustering characteristics.

In order to solve the clustering problem of unbalanced data, scholars propose various methods from different angles, including three methods of data preprocessing, multi-center point and optimization objective function. The first method is data preprocessing, and the method carries out under-sampling and over-sampling on a data set and then carries out clustering, but the under-sampling method only adopts a subset which belongs to a part of the large class and is representative, so that a large amount of effective information in the large class is ignored, and the clustering effect is influenced; the oversampling method performs data analysis by increasing the number of objects in the subclass, so that the original data set reaches an equilibrium state, but on one hand, overfitting may be caused, and on the other hand, noise may be brought to the data set.

The second method is a multi-center method, which solves the problem of 'uniform effect' of the fuzzy clustering algorithm based on the multi-center angle, and the idea is to use a plurality of class centers to replace a single class center to represent a class. However, for some unbalanced data clustering problems with extremely uneven distribution of large categories, the method cannot fully reflect data distribution characteristics, and the effectiveness of the algorithm is reduced.

The third method is a method for optimizing an objective function, which proposes a new algorithm from the viewpoint of objective function optimization, and optimizes the objective function by deducing corresponding clusters, so as to solve the problem of uniform effect. Compared with the prior clustering algorithm, the method is a direct new method and has certain practicability, but the method generally relates to the solution of target function parameters, belongs to the problem of nonlinear function optimization, and is difficult to obtain a global optimal solution, so that the clustering result of the algorithm has relatively large randomness, and the clustering precision of the algorithm is influenced.

At present, an effective clustering method for small sample classes, which can automatically calculate the number of the class clusters and effectively retain the abnormality (or fault), does not exist.

Disclosure of Invention

The application provides a competitive clustering method for state anomaly diagnosis, which aims to effectively retain small sample classes of anomalies (or faults), simultaneously realize automatic calculation of the number of class clusters and improve clustering performance.

The technical purpose of the application is realized by the following technical scheme:

a competitive clustering method for state anomaly diagnosis, comprising:

s1: inputting a data set U, and setting the number c of initial cluster-like groups as c _max Determining fuzzy weighting index m, initial value eta ₀ The iteration time constant tau and the cluster-like cardinality threshold N are generated randomly to generate a first cluster center set V1, and the initial sample membership degree of a data set U is obtained through a fuzzy C-means clustering algorithm; wherein, U ═ { x ═ x _j |j＝1,...,n},x _j Representing samples, x, in a data set U _j E is U, and n represents the total number of samples of U; v1 ═ V _i 1., c }, c representing the total number of cluster centers of dataset U, v _i A cluster center representing an i-th class cluster;

s2: calculating a sample x _j And cluster center v _i Obtaining a proportional coefficient alpha according to the Euclidean distance and the initial sample membership degree, and constructing a target function of a competitive clustering algorithm according to the Euclidean distance and the proportional coefficient alpha;

s3: calculating to obtain sample membership through the target function;

s4: calculating the cardinality N of the ith class cluster _i If N is present _i If the number of the clusters is less than the cardinal number threshold value N, eliminating the clusters to obtain a sample membership degree and a second cluster center set V2' corresponding to the reserved clusters;

s5: calculating the cluster compactness C of each cluster according to the sample membership and the second cluster center set V2 _i Then according to cluster compactness C _i And updating the sample membership degree and the cluster center to obtain a final sample membership degree and a second cluster center set V2 of the iteration.

S6: when the position of the cluster center is not changed any more or reaches the maximum iteration times, outputting a final result to finish clustering; otherwise, steps S2 to S5 are repeated.

Further, in step S2, a scaling factor α is obtained according to the euclidean distance and the initial sample membership degree, and is expressed as:

η(k)＝η ₀ exp(-k/τ)；

wherein,

represents a sample x _j To cluster heart v _i The distance of (a), or Euclidean distance; u. of _ij Representing the membership degree of the jth sample belonging to the ith class cluster; m represents a fuzzy weighting index, and 2 is taken; k represents the number of iterations;

the objective function is then expressed as:

further, in step S3, the sample membership degree is obtained by calculating the sample membership degree through the target function by using a lagrange multiplier method, and is expressed as:

further, in the step S5, the cluster compactness C of each cluster class _i Expressed as:

wherein,

middle T _i ＝{x _j |u _ij ＞u _lj ；l＝1,2,···,c；l≠i}；

η _j ＝||x _j -v _i ||；

T _i Representing a sample set divided into ith class clusters; | T _i L represents the number of the ith class cluster sample set; eta _j Represents a sample x _j The filtered value of (a); u. u _i Representing the i-th class cluster sample set and the cluster center v _i Average value of the distances.

Further, in the step S5, according to the cluster compactness C _i Updating the sample membership and the cluster center, and expressing as follows:

wherein f is _i Representing coefficients assigned to the ith class cluster; s _i Is the compactness of the normalized class i cluster, S _min Is S _i Minimum value of (1).

The beneficial effect of this application lies in: on the basis of realizing automatic cluster number calculation, the target function of a competitive clustering algorithm is improved to enable the sample capacity to play a role in a clustering cost function, so that the interference of sample capacity difference on clustering judgment is weakened, a new membership calculation method is obtained, the membership of the membership calculation method to large classes and small classes can be adaptively adjusted, the clustering effect of processing unbalanced data sets by the algorithm is improved, abnormal (or fault) small sample classes are effectively reserved, meanwhile, the automatic cluster number calculation is realized, the clustering performance is improved, and the application based on the data sample clustering characteristics is expanded.

Drawings

FIG. 1 is a flow chart of a method described herein;

fig. 2 is a schematic diagram illustrating a comparison between the clustering result and other clustering algorithms in the embodiment of the present application.

Detailed Description

The technical solution of the present application will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a flowchart of the method of the present application, which is a competitive clustering method for abnormal condition diagnosis, and selects 3 unbalanced classes in the Aggregation dataset of the UCI standard dataset as the verified dataset U of the present invention, and the method includes the following steps:

s1: inputting a data set U, and setting the number c of initial cluster-like groups as c _max 10, determining fuzzy weighting index m 2 and initial value eta ₀ 1.3, iteration constant tau 10 and cluster cardinality threshold N7, and randomly generating c _max And (4) obtaining the initial sample membership degree of the data set U through a fuzzy C-means clustering algorithm by each cluster center.

S2: calculating a sample x _j And cluster heart v _i And obtaining a proportional coefficient alpha according to the Euclidean distance and the initial sample membership degree, and constructing a target function of a competitive clustering algorithm according to the Euclidean distance and the proportional coefficient alpha.

Euclidean distance d _ij The calculation of (d) is expressed as:

wherein,

represents a sample x _j To cluster heart v _i The distance of (a), the euclidean distance; p represents x _j Of (c) is calculated.

And then according to d obtained _ij And u _ij The scaling factor α is calculated as:

η(k)＝η ₀ exp(-k/τ)。

the final objective function is expressed as:

wherein u is _ij Representing the membership degree of the jth sample belonging to the ith cluster; m represents a fuzzy weighting index, and 2 is taken; k denotes the number of iterations.

S3: and calculating the sample membership degree through the target function.

Specifically, the sample membership is calculated as:

wherein,

indicating the cardinality of the i-th class cluster.

S4: calculating the cardinality N of each class cluster _i If N is present _i And if the sample membership degree is less than the radix threshold value 7, eliminating the class cluster to obtain a sample membership degree and a second cluster center set V2' corresponding to the reserved class cluster.

S5: besides considering the influence of the class size on the objective function, the influence of the sample distribution of each class on the clustering result must be noted. The present application presents a cluster compactness C _i The calculation formula of (2) is used for measuring the distribution state of the samples in the class, so as to obtain the final sample membership and the second cluster center set V2, C of the current iteration _i Is expressed as:

wherein,

middle T _i ＝{x _j |u _ij ＞u _lj ；l＝1,2,···,c；l≠i}；

η _j ＝||x _j -v _i ||；

T _i Representing a sample set divided into ith class clusters; i T _i L represents the number of the ith class cluster sample set; eta _j Represents a sample x _j The filtered value of (a); mu.s _i Representing the i-th class cluster sample set and the cluster center v _i Average value of the distances.

It can be seen from the cluster compactness formula that: c _i The smaller the value of (A), the more concentrated the class is, the higher the compactness is; otherwise, it indicates that the more dispersed the class, the lower the compactness.

According to cluster compactness C _i Updating the sample membership and the cluster center, and expressing as follows:

wherein f is _i Representing coefficients assigned to the class i cluster; s. the _i Is the compactness of the normalized class i cluster, S _min Is S _i Minimum value of (1).

S6: competition among clusters, the number of cluster cores is gradually reduced to be stable, when the position of the cluster core is not changed any more or the number of iterations is reached, a final result is output, and clustering is completed; otherwise, steps S2 to S5 are repeated.

The fuzzy C-means clustering algorithm and the competitive clustering algorithm are selected as comparison algorithms, wherein the competitive clustering algorithm is evolved on the basis of the fuzzy C-means clustering algorithm, and the fuzzy C-means clustering algorithm has the advantages that the number of clusters can be automatically calculated, and the number of the clusters needs to be set in advance. For the sake of fairness, we use the cluster number obtained by the competitive clustering method available for the diagnosis of state anomalies (small samples) as the premise of the fuzzy C-means clustering algorithm. For the competitive clustering algorithm, η is set ₀ Other parameter settings are the same as the competitive clustering method that can be used for the diagnosis of status anomalies (small samples).

Fig. 2 is a comparison of the clustering results of the three clustering algorithms in the same data set, and the position of the center is shown as a "+" symbol superimposed on the data set, and the final cluster is circled, and fig. 2 (a) is the data set verified by the present application. As can be seen from fig. 2 (b), the fuzzy C-means clustering algorithm divides 3 classes on the basis of initially setting 3 class clusters, which indicates that the algorithm cannot effectively identify the differences between the large and small classes; as seen from (C) in fig. 2, the competitive clustering algorithm still cannot solve the disadvantage that the fuzzy C-means clustering algorithm divides each class, and since the special competitive mechanism of the algorithm automatically ignores the subclass of the right small angle, 3 classes are wrongly divided into 2 classes, and the fault point is generally similar to the subclass, which indicates that the algorithm cannot effectively identify the fault class in some cases. Fig. 2 (d) shows that the competitive clustering method for diagnosing abnormal states (small samples) provided by the present application is applied to the clustering result of the data set, and it can be seen that three classes with large number density difference are correctly classified, which indicates that the algorithm can effectively identify the fault class, and at the same time, can automatically calculate the number of the class clusters.

According to the method, the traditional membership calculation method is improved, so that the membership of the large class and the small class can be adaptively adjusted, the abnormal (or fault) small sample class is effectively reserved, and the clustering effect of processing the unbalanced data set by the algorithm is improved.

The above-mentioned embodiments are only used for illustrating the technical solution of the present invention, and are not limited thereto; the technical solutions described in the foregoing embodiments of the present invention can be modified or equivalent replaced by those skilled in the art, without departing from the structure of the present invention or exceeding the scope defined by the claims.

Claims

1. A competitive clustering method for state anomaly diagnosis, comprising:

s1: inputting a data set U, and setting the number c of initial cluster-like groups as c _max Determining fuzzy weighting index m and initial value eta ₀ The iteration time constant tau and the cluster-like cardinality threshold N are generated randomly to generate a first cluster center set V1, and the initial sample membership degree of a data set U is obtained through a fuzzy C-means clustering algorithm; wherein, U ═ { x ═ x _j |j＝1，...，n}，x _j Representing samples, x, in a data set U _j E is U, and n represents the total number of samples of U; v1 ═ { V ═ V _i 1., c }, c representing the total number of cluster centers of dataset U, v _i A cluster center representing an i-th class cluster;

s2: calculating a sample x _j And cluster heart v _i Obtaining a proportional coefficient alpha according to the Euclidean distance and the initial sample membership degree, and constructing a target function of a competitive clustering algorithm according to the Euclidean distance and the proportional coefficient alpha;

s3: calculating to obtain sample membership through the target function;

s5: calculating the cluster compactness C of each cluster according to the sample membership and the second cluster center set V2 _i Then according to cluster compactness C _i Updating the sample membership degree and the cluster center to obtain a final sample membership degree and a second cluster center set V2 of the iteration;

s6: when the position of the cluster center is not changed any more or the number of iterations is reached, outputting a final result to finish clustering; otherwise, steps S2 to S5 are repeated.

2. The competitive clustering method for abnormal situations diagnosis according to claim 1, wherein in step S2, a scaling factor α is obtained according to the euclidean distance and the initial sample membership degree, and is expressed as:

η(k)＝η ₀ exp(-k/τ)；

wherein,

the objective function is then expressed as:

3. the competitive clustering method for abnormal states diagnosis according to claim 1, wherein in step S3, the objective function calculates the sample membership degree by using lagrangian multiplier method, and the sample membership degree is expressed as:

4. the competitive clustering method for the diagnosis of status abnormality according to claim 1, wherein in the step S5, the cluster compactness C of each cluster is _i Expressed as:

wherein,

middle T _i ＝{x _j |u _ij ＞u _lj ；l＝1，2，…，c；l≠i}；

η _j ＝||x _j -v _i ||；

T _i Representing a sample set divided into ith class clusters; i T _i L represents the number of the ith class cluster sample set; eta _j Represents a sample x _j The filtered value of (a); u. of _i Representing the i-th class cluster sample set and the cluster center v _i Average value of the distances.

5. The competitive clustering method according to claim 1, wherein in the step S5, the cluster compactness C is used as a function of the cluster compactness C _i Updating the sample membership and the cluster center, and expressing that:

wherein f is _i Representing coefficients assigned to the class i cluster; s _i Is the compactness of the normalized class i cluster, S _min Is S _i Minimum value of (1).