CN104102726A

CN104102726A - Modified K-means clustering algorithm based on hierarchical clustering

Info

Publication number: CN104102726A
Application number: CN201410350480.4A
Authority: CN
Inventors: 刘晓波; 张明明; 袁光前; 陈鹏
Original assignee: Nanchang Hangkong University
Current assignee: Nanchang Hangkong University
Priority date: 2014-07-22
Filing date: 2014-07-22
Publication date: 2014-10-15

Abstract

The invention discloses a modified K-means clustering algorithm based on hierarchical clustering. The algorithm includes the steps of calculating a distance of each paired two of n objects; constructing a cluster of n single members; acquiring a hierarchical cluster according to data; when the data are of K types, calculating a cluster center of each type of data; with the obtained cluster centers as an initial cluster center of a K-mean and a K value as the number of K-means clusters, performing K-means clustering to obtain a clustering result. The comparison on intraclass variance and interclass variance of the clustering result and the modified K-means clustering result shows that the modified K-means clustering is more accurate and more reasonable. The respective advantages of hierarchical clustering and K-means clustering are both given to play as far as possible, respective defects of the two clustering methods are avoided, and clustering quality is greatly improved.

Description

Improvement K means clustering algorithm based on hierarchical clustering

Technical field

The invention belongs to aero engine technology field, be specifically related to the improvement K means clustering algorithm based on hierarchical clustering.

Technical background

Aeromotor, as the core drive parts of aircraft, is directly connected to the security & performance of aircraft operation.The reliability of engine concerns its serviceable life, economic benefit, and the security of the lives and property of passenger.Aeromotor, as the internal part of aircraft, generally can only judge according to the experience of oneself and simple equipment and instrument whether it has problems by technician after flight finishes, and can not guarantee like this accuracy of diagnosis.Yet complicated and diversified along with aeromotor fault, technician's recognition capability is also just relatively limited.Therefore condition monitoring and fault diagnosis technology has been brought into play important effect in aeromotor area of maintenance, condition monitoring and fault diagnosis technology can detect fault-signal on the basis of not disassembling engine, with the fault data monitoring, judge duty and the development trend of parts, thereby also diagnose accurately out of order position and fault fast.

In order accurately and rapidly to find engine failure, generally at a plurality of positions of engine sensor installation, carry out engine signal collection, because a plurality of sensors can respond respectively the signal of different parts, sometimes certain sensor cannot detect fault-signal under some jamming pattern, the metrical information of performance different parts sensor can redundancy, complementation, collaborative advantage, can obtain the data of this fault-signal, by the signal that these sensors are collected, carry out data fusion, find out fault signature, thereby realize engine diagnosis.Therefore the accuracy and efficiency that, how to improve aeromotor Fusion has become the problem facing at present.

Cluster algorithm, as a kind of non-supervisory learning method, is applied to the Data processing of all trades and professions widely.There are many deficiencies in traditional clustering algorithm.Such as hierarchical clustering algorithm, it can only be embodied to a great extent in toy data base, if the data cell of database is too much, its scalability will variation.And hierarchical clustering processing is one step ahead can not be reversed, the data cell between treated class afterwards can not exchange.K-mean algorithm and for example, it has very high dependence for initial value, if the k choosing is not be worthwhile, may cause final result unsatisfactory.

Summary of the invention

The object of the present invention is to provide a kind of improvement K means clustering algorithm of aeromotor Fusion, utilize sensor to measure vibration displacement signal, the improvement K means clustering algorithm of employing based on hierarchical clustering, realize aeromotor Fusion, improve the accuracy and efficiency of data fusion, for Fault Diagnosis of Aeroengines provides foundation.

The present invention takes following technical scheme to realize above-mentioned purpose, and the improvement K means clustering algorithm based on hierarchical clustering, the steps include:

1) calculate n the distance that object is mutual; Calculating range formula is Euclidean distance formula, by calculating a distance matrix;

2) n single member's cluster of structure, finds two nearest clusters, and is merged into a class, and the number of cluster just reduces by a class; The like, calculate newly-generated cluster and the spacing of other clusters;

3) according to step 2) hierarchical clustering that draws of computational data; If thrown the reins to, cluster is finally converted into a class; When data are divided into k class, function

S (k) = \frac{Σ_{i = 1}^{k} \underset{x &Element; c_{i}}{Σ} {| x - {\overset{&OverBar;}{x}}_{i} |}^{2}}{\min_{i, j < k (i &NotEqual; j)} {| \overset{&OverBar;}{x_{i}} - \overset{&OverBar;}{x_{j}} |}^{2}}

Obtain minimum, data are just divided into k class;

4), when data are divided into K class, calculate the cluster centre C of each class ₁, C ₂... C _k;

5) using step 4) cluster centre obtained is as the initial cluster center of K average, and K value, as the number of K mean cluster, is carried out K mean cluster and is obtained cluster result.

The present invention is on the basis of above-mentioned steps, with traditional K means clustering method, data are carried out to cluster, the cluster result obtaining and improved K mean cluster result, by being analyzed, relatively both interclass variance and between-group variance, show that improved K mean cluster is more accurate rationally.

The present invention just extracts in large database a part of data as representative, so just solve hierarchical clustering and processed the not strong defect of mass data unit scalability, and obtain initial value by level algorithm, determined k value, just problem k mean algorithm being relied on has the most solved, thereby reduced k-mean algorithm, occurs the probability that result is undesirable.By relatively interclass variance and the between-group variance of cluster result and improved K mean cluster result show that improved K mean cluster is more accurate rationally.What hierarchical clustering and k-mean cluster advantage separately were all tried one's best brings into play, avoids the deficiency of two clustering methods self, has farthest improved the quality of cluster.

Accompanying drawing explanation

Fig. 1 misaligns Dendrogram in the present invention.

Fig. 2 is unbalance dynamic Dendrogram in the present invention.

Fig. 3 touches mill Dendrogram in the present invention.

Fig. 4 is non-fault Dendrogram in the present invention.

Embodiment

The Fusion of the present invention during with several typical faults of aeromotor and non-fault specifically implemented:

The vibration displacement signal sampling data that sensor measures are as table 1:

Table 1: data from the sample survey

1, state model: misalign

The Euclidean distance matrix calculating is as follows:

Euclidean distance between each data point (for example first point is the first row secondary series in Euclidean distance matrix with the Euclidean distance of second point) is as follows:

24 primary datas under condition of misalignment are numbered to 1-24, when starting, cluster regards 24 initial data objects as 24 initial classes, then two nearest classes of Euclidean distance between class and class are merged into a class, the like, Fig. 1 has expressed the process of this merging until all data are gathered is a class.

If as can be seen from Figure 1 thrown the reins to, cluster is finally converted into a class.The present invention introduces following constraint condition.Calculate newly-generated cluster and the spacing of other clusters, if the result obtaining meets function:

S (k) = \frac{Σ_{i = 1}^{k} \underset{x &Element; c_{i}}{Σ} {| x - {\overset{&OverBar;}{x}}_{i} |}^{2}}{\min_{i, j < k (i &NotEqual; j)} {| \overset{&OverBar;}{x_{i}} - \overset{&OverBar;}{x_{j}} |}^{2}}

S obtains minimum value, and algorithm finishes.

From figure, can obviously find out that cluster is that three classes or two class effects are better.S (3)=1.8936 again, S (2)=2.2507.

Therefore the result that hierarchical clustering obtains is three classes as the K value of K mean cluster below.

Calculate the mean value of all data objects in every group of cluster as the initial cluster center of K mean cluster.Every group of cluster centre when cluster is three groups: 8.125,23.1175,35.579.

The cluster centre obtained using above as the initial cluster center of K average, and K value, as the number of K mean cluster, is carried out K mean cluster and is obtained cluster result.

With traditional K means clustering method, data are carried out to cluster, the cluster result obtaining will be analyzed with improved K mean cluster result.

Cluster number in the middle of traditional K mean cluster is random, selects 2 cluster numbers herein, and initial cluster center is also to randomly draw from need the data object of cluster.

By both interclass variances of comparison and between-group variance, show that improved K mean cluster is more accurate rationally.

Calculate interclass variance and the between-group variance of traditional K mean cluster result:

S1 represents the interclass variance of first group, and S2 represents the interclass variance of second group, and S represents between-group variance:

S1＝26.83，S2＝56.107，S＝101.8

Interclass variance and the between-group variance of the K mean cluster of computed improved:

S1＝3.33，S2＝6.98，S3＝29.83,S＝132.57

According to same step carry out unbalance dynamic, touch mill, the cluster of non-fault operating mode.

2, state model: unbalance dynamic

The Euclidean distance matrix calculating is as follows:

Euclidean distance between each data point is as follows:

As shown in Figure 2,18 primary datas under unbalance dynamic state are numbered to 1-18, when starting, cluster regards 18 initial data objects as 18 initial classes, then two nearest classes of Euclidean distance between class and class are merged into a class, the like, Fig. 2 has expressed the process of this merging until all data are gathered is a class.

S(3)＝2.08S(4)＝2.04。

K=4 cluster centre: 17.32,23.91,33.2,41.58.

Baseline results variance: S1=4.98, S2=8.05, S3=45.99, S=117.58.

Improve result variance: S1=26.50, S2=12.30, S3=2.43, S4=4.08, S=124.22.

3, state model: touch mill

The Euclidean distance matrix calculating is as follows:

Euclidean distance between each data point is as follows:

As shown in Figure 3,24 primary datas of touching under mill state are numbered to 1-24, when starting, cluster regards 24 initial data objects as 24 initial classes, then two nearest classes of Euclidean distance between class and class are merged into a class, the like, Fig. 3 has expressed the process of this merging until all data are gathered is a class.

S(3)＝1.4446，S(2)＝2.8598。

Center as k=3 mean cluster is 12.14,23.97,35.68.

Baseline results variance: S1=20.27, S2=33.57, S=32.59.

Improve result variance: S1=9.77, S2=8.42, S3=13.64, S=53.926.

4, state model: non-fault

The Euclidean distance matrix calculating is as follows:

Euclidean distance between each data point is as follows:

As shown in Figure 4,24 primary datas under unfaulty conditions are numbered to 1-24, when starting, cluster regards 24 initial data objects as 24 initial classes, then two nearest classes of Euclidean distance between class and class are merged into a class, the like, Fig. 4 has expressed the process of this merging until all data are gathered is a class.

S(3)＝1.3994，S(2)＝1.8921。

Three cluster centres as K=3 mean cluster are 13.568,23.728,32.77.

Baseline results variance: S1=11.67, S2=21.17, S=53.68.

Improve result variance: S1=8.67, S2=3.03, S3=3.01, S=66.39.

The interclass variance of the result by above 4 kinds of improved K mean clusters of operating mode comparative analysis is all less than the result of traditional K mean cluster, shows that the improved K mean cluster of the similarity degree result of its data object in one group of data is better than traditional K mean cluster.And the between-group variance of the result of improved K mean cluster is greater than the between-group variance of the result of traditional K mean cluster, show the discrimination between group and group, the result of improved K mean cluster is better than traditional K mean cluster result.

Therefore the cluster result that the cluster result that shows to utilize the K mean cluster after improving to obtain obtains than traditional K mean cluster will be more accurately, rationally.

Claims

1. the improvement K means clustering algorithm based on hierarchical clustering, is characterized in that, the steps include:

S (k) = \frac{Σ_{i = 1}^{k} \underset{x &Element; c_{i}}{Σ} {| x - {\overset{&OverBar;}{x}}_{i} |}^{2}}{\min_{i, j < k (i &NotEqual; j)} {| \overset{&OverBar;}{x_{i}} - \overset{&OverBar;}{x_{j}} |}^{2}}

Obtain minimum, data are just divided into k class;