CN111428768A

CN111428768A - Hellinger distance-Gaussian mixture model-based clustering method

Info

Publication number: CN111428768A
Application number: CN202010190288.9A
Authority: CN
Inventors: 郭伟; 何茂
Original assignee: University of Electronic Science and Technology of China; Guangdong Electronic Information Engineering Research Institute of UESTC
Current assignee: University of Electronic Science and Technology of China; Guangdong Electronic Information Engineering Research Institute of UESTC
Priority date: 2020-03-18
Filing date: 2020-03-18
Publication date: 2020-07-17

Abstract

The invention discloses a clustering method based on a Hellinger distance-Gaussian mixture model, which is applied to the field of mechanical fault diagnosis and cluster analysis. In order to solve the problem of low identification precision of label-free data in the prior art, the method improves the clustering capability of a Gaussian mixture model, introduces a regular term which is minimized based on Hellinger distance and used for measuring the distance between probability distributions of the Gaussian mixture model in a data manifold space, and introduces a regular term which is minimized based on Hellinger distance and used for restricting the updating process of posterior probability, and gradually updates the parameters of the Gaussian mixture model by combining with a generalized expectation maximization algorithm, so that the probability of generating given data by the probability distribution of the obtained mixture model is maximized, thereby realizing automatic learning and clustering of data, accurately judging the category information of label-free data, and providing a feasible method for intelligent learning of big data.

Description

Hellinger distance-Gaussian mixture model-based clustering method

Technical Field

The invention belongs to the field of mechanical fault diagnosis and cluster analysis, and particularly relates to a data clustering technology based on a Hellinger distance-Gaussian mixture model.

Background

The upsizing and complication of modern mechanical equipment make it highly susceptible to various faults during continuous operation, and therefore, in order to ensure safe and reliable operation of the equipment, it is necessary to detect potential fault risks early through monitoring and diagnosis, avoid possible accidents and corresponding maintenance consumption, and maximize the use efficiency of the equipment. In recent years, with the proposal of the concept of "big data", data-driven intelligent fault diagnosis methods are popularized and applied. According to the method, the physical failure mechanism of the equipment does not need to be searched, the health state of the monitoring equipment can be automatically judged by learning the statistical rules and the internal characteristics of a large amount of data, and a useful tool is provided for online monitoring and fault monitoring of industrial equipment.

The intelligent fault diagnosis provides an important means for online monitoring and health prediction of large-scale equipment and rotating equipment by automatically extracting fault information implicit in large-scale monitoring data, but the existing large-scale diagnosis methods are based on the assumption that typical faults of the monitoring data are complete and state marks are clear, which is difficult to meet in engineering practice. In the actual operation process of the equipment, in order to ensure continuous and safe operation of the equipment, the equipment cannot be frequently stopped to detect faults and mark the equipment state, so that the monitored data only has little or even no marking information, and the equipment state corresponding to the related data cannot be known. Therefore, there is a need to implement accurate intelligent diagnosis of devices using unsupervised learning methods.

Unsupervised learning is a machine learning method that finds its intrinsic regularity or structure from unlabeled data, where clustering can group a given large amount of data into several categories according to their similarity or distance of features. Existing clustering methods can be divided into two categories: the hard clustering (such as K-means) and the soft clustering (such as Gaussian mixture models) can be used, the former judges that the sample can only belong to one class, and the latter divides the class to simultaneously mine the longitudinal structure (similarity) and the transverse structure (dimension reduction) of the data, so that a more accurate clustering result is obtained. The Gaussian mixture model can fit the distribution of any data by linearly combining a plurality of Gaussian distribution functions, but the existing Gaussian mixture model is influenced by factors such as parameter initialization and complex operation, and the algorithm research and clustering application are still few. Therefore, how to improve the gaussian mixture model, the model parameters are optimized by combining the expectation maximization algorithm, and the recognition capability of the model parameters on the sample class is improved.

Disclosure of Invention

In order to solve the technical problems, the invention provides a clustering method based on a Hellinger distance-Gaussian mixture model, which introduces a regularization term based on the Hellinger distance on the basis of maximizing the probability distribution of a sample, constructs an internal manifold structure of the sample through a generalized expectation maximization algorithm, and further realizes the automatic judgment of the data category. The steps are described as follows:

data feature composition sample set X to be classified is X ═ { X_iI-1, …, n }, and contains n samples, each sample x_iIncluding d-dimensional features.

And S1, setting parameters and initializing.

1) Setting K components in the Gaussian mixture model, and initializing Gaussian model parameters by adopting a K-mean algorithm

2) A regularization coefficient λ is set.

3) An update coefficient γ is set, and its initial value is set to 0.9.

4) And setting the number l of adjacent neighbors.

5) The iteration number t is initialized to 1, i.e., t equals 1.

6) The iteration end value is set to a smaller value.

S2, constructing a model optimization objective function: and defining an objective function of the parameter optimization of the Gaussian mixture model. In the parameter optimization process, Hellinger distance is introduced to calculate the closeness degree between the two distributions.

The constructed gaussian mixture model is composed of K gaussian distributions,

wherein, theta is (pi)₁,μ₁,Σ₁,…,π_K,μ_K,Σ_K) Is a parameter of the Gaussian mixture model, μ_kSum-sigma_k(K1, …, K) is the mean and covariance of the kth gaussian, N_k(x_iμ_k,Σ_k) Is the Gaussian distribution density of the kth partial model, pi_kIs the corresponding mixing coefficient and satisfies

In order to realize data clustering, the gaussian mixture model parameters Θ are updated by iterative operations. Thus, X is defined as the observation sample set, and Z ═ Z_iI is 1, …, n, and X and Z form a complete sample set, and on the basis of maximizing a complete sample log-likelihood function, a regularization term is introduced to form an optimization objective function, which is defined as follows:

wherein, the lambda is a regularization coefficient,

for the regularization term, the Hellinger distance is incorporated herein into the regularization term. Hellinger distance is typically used in order and asymptotic statistics, then probability distribution P_iAnd P_jThe square of the Hellinger distance between is:

and satisfies h (P)_i,P_j) Less than or equal to 1. Regularization term

Is shown as

Wherein, P (k | x)_i) And P (k | x)_j) Are respectively a sample x_iAnd x_jThe posterior probability generated by the k-th Gaussian component, the Laplace matrix L is L ═ D-W, where the relationship between matrices D and W is

T denotes transposition.

For sample x_iFrom the Hellinger distance, its l nearest neighbors, l ∈ { n-1 }. sample x in the nearest neighbor graph can be determined_iSample x adjacent to it_jW of the weight between_ijIs defined as:

wherein the content of the first and second substances,

represents a sample x_jI neighbor sample sets.

S3, calculating the posterior probability of the sample: according to the parameters theta of the mixed model^t-1And calculating the posterior probability by adopting a generalized expectation maximization algorithm.

Obtaining a Gaussian mixture model parameter theta according to the t-1 iteration^t-1Calculating posterior probability

On the basis, a Q function is defined by adopting a generalized expectation maximization algorithm for iterative operation of model parameters, and the Q function is expressed as

The objective of iterative optimization is to maximize Q (Θ, Θ) separately^t-1) And minimizing the regularization term

S4, updating model parameters: and updating the posterior probability and the Gaussian mixture model parameters by adopting a generalized expected maximum algorithm.

First, minimize the regularization term

(equation (17)), applying Newton-Laprison's method to obtain the update of the posterior probability as:

second, maximize Q (Θ, Θ)^t-1) Updated gaussian mixture model parameters Θ can be obtained^t：

And S5, calculating a regularization likelihood function value.

S6, judging iteration termination:

1) if it is not

The update coefficient 0.9 γ → γ is set (i.e., the current update coefficient γ is multiplied by 0.9 as the update coefficient γ of the next iteration), returning to S4.

2) If it is not

Stopping iteration, and determining the parameters theta of the Gaussian mixture model^tOutputs the posterior probability P (k | x)_i) (i ═ 1, …, n; k ═ 1, …, K); otherwise, adding 1 to the iteration number, namely t ← t +1, and returning to S3.

S7, data type judgment: for each sample, a gaussian component label K (K is 1, …, K) corresponding to the maximum posterior probability is taken as the clustering result of the sample.

The invention has the beneficial effects that: the invention provides a clustering method based on a Hellinger distance-Gaussian mixture model, which is characterized in that a data clustering algorithm is constructed by using the unsupervised learning capability of the Gaussian mixture model, each cluster is determined by one Gaussian distribution, and each data is formed by the comprehensive action of a plurality of probability clusters; the neighbor samples are defined on a data manifold structure through Hellinger distance and regularization terms, and simultaneously Gaussian distribution parameters and coefficients in the hybrid model are gradually updated by combining a generalized expectation maximization algorithm, so that the probability of generating given data by probability distribution determined by the hybrid model is the maximum, automatic learning and clustering of the data are realized, and the class information of the data without labels can be accurately judged. The method not only expands the probabilistic clustering algorithm, but also can improve the mining capability of the potential structure of the unmarked data, and can be applied to intelligent diagnosis of industrial data.

Drawings

FIG. 1 is a flow chart of the clustering method based on Hellinger distance-Gaussian mixture model of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Fig. 1 is a flow chart of the clustering method based on the Hellinger distance-gaussian mixture model according to the present invention.

The Iris data set is used in the embodiment to clarify the implementation result. The dataset is also called iris data set, and is a type of dataset for multi-variable analysis. It contains 150 data samples, each data containing 4 attributes (features), namely calyx length and width, petal length and width; the categories to which the data belongs are of 3 types: irises Iris (Iris Setosa), Iris versicolor Iris (Iris versicolor) and irises Virginica (Iris Virginica), 50 data per category.

The parameter settings and initial value settings of the gaussian mixture model are as follows:

1) the Gaussian component K is 3, and a K mean value algorithm is adopted to initialize model parameters

2) The regularization coefficient λ is 0.1.

3) The initial value of the update coefficient is γ equal to 0.9.

4) The number of neighbors l is 2.

5) Iteration end value of 10^-5。

According to the method, the sample is input into the Gaussian mixture model, and the model parameter value is updated through iterative operation until the termination condition is met. For each sample to be classified, the model outputs posterior probability values obtained by calculating 1 st, 2 nd and 3 rd Gaussian components when iteration is terminated, and the label of the Gaussian component corresponding to the maximum value is taken as the class information of the sample. For example, for sample x₁The posterior probability values of the outputs of the 1 st, 2 nd and 3 rd Gaussian components in the mixed model are (2.66 × 10)^-40,7.98×10^-281), then the cluster label is (0,0, 1); the true class of this sample is Iris tectorum (Iris Setosa), and correspondingly, the true label is (1,0, 0). The corresponding relation between the clustering label and the real label can be determined by adopting a Kuhn-Munkres algorithm on all samples, the cluster obtained by the 3 rd Gaussian component corresponds to the 1 st Iris pallida, and therefore, the sample x is subjected to₁The classification result of (2) is correct.

By adopting the method, the accuracy of the clustering analysis can be checked by contrasting the class information of the sample given by the Iris data set, namely the identification accuracy is identified, and the calculation formula is as follows:

the results of the recognition accuracy of the two clustering models with different sample characteristics are compared in table 1. One of the models is a clustering method based on a Hellinger distance-Gaussian mixture model, which is provided by the invention and is abbreviated as HGMM; another approach is the conventional gaussian mixture model (no regularization term based on Hellinger distance is introduced). Meanwhile, considering the influence of the sensitivity and the correlation of the sample features on the cluster analysis, the second row in table 1 lists the cluster analysis results of all the sample features (4 features) selected, and the third row and the fourth row respectively list the highest and lowest analysis results of some features (3 features selected).

TABLE 1 comparison of classification correctness of two clustering models using different sample characteristics

Sample characterization	HGMM	GMM
			1,2,3,4	98％	77％
1,3,4	97％	61％
			1,2,4	88％	61％

Note: HGMM: hellinger distance-Gaussian mixture model (Hellinger distance Gaussian mixture model)

GMM: gaussian mixture Model (Gaussian Mixed Model)

"1": attribute value "length of calyx" (sepal length, unit: cm)

"2": attribute value "length of calyx" (sepal width, unit: cm)

"3": attribute value "petal length" (unit: centimeter)

"4": attribute value "petal width" (Total width, unit: centimeter)

The table shows that the clustering capability of the Gaussian mixture model can be remarkably improved by adopting the improved method provided by the invention, and the recognition accuracy of the label-free data is 98% at most and 88% at least; meanwhile, the method provided by the invention can obtain higher identification accuracy without reducing multidimensional characteristics, can obtain an intelligent classification model through unsupervised learning, and can be further popularized and applied to unsupervised learning of other data.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. The data clustering method based on the Hellinger distance-Gaussian mixture model is characterized by comprising the following steps of:

s1, parameter setting and initialization: setting initial values of parameters of a Gaussian mixture model, and initial values and set values of related parameters, wherein the initial values of the parameters of the Gaussian mixture model comprise: the number K of Gaussian distributions in the mixed model, and the initial value of each Gaussian distribution parameter, namely the mean value

Sum covariance

And the mixing coefficient corresponding to the Gaussian distribution

And satisfy

The initial value of the Gaussian mixture model parameter is

Setting other parameter initial values and set values, namely a regularization coefficient lambda, an initial value of an updating coefficient gamma, a neighbor number l and an iteration termination value, and initializing an iteration sequence number t to 1, namely t is 1;

s2, constructing a model optimization objective function: defining an objective function for optimizing parameters of the Gaussian mixture model, and introducing a regularization term to update the parameters of the Gaussian mixture model, wherein the approximation degree between two Gaussian distributions is calculated by using a Hellinger distance;

s3, calculating the posterior probability of the sample: calculating the posterior probability of the sample according to the Gaussian mixture model parameters obtained in the previous iteration;

s4, updating Gaussian mixture model parameters: updating posterior probability and Gaussian mixture model parameters by adopting a generalized expectation maximum algorithm;

s5, calculating a regularization likelihood function value;

s6, judging iteration termination: comparing regularization likelihood function values before and after updating of the Gaussian mixture model parameters, and continuing the iteration process of the steps S3-S5 until an iteration termination condition is met;

s7, data type judgment: and for each sample, taking a Gaussian component label corresponding to the maximum posterior probability as the clustering result of the sample.

2. The Hellinger distance-gaussian mixture model-based clustering method according to claim 1, wherein the step S2 is implemented by:

the gaussian mixture model to be optimized is composed of K gaussian distributions,

wherein, theta is (pi)₁,μ₁,Σ₁,…,π_K,μ_K,Σ_K) Is a parameter of the Gaussian mixture model, μ_kSum-sigma_kIs the mean and covariance of the kth Gaussian distribution, N_k(x_i|μ_k,Σ_k) For a corresponding Gaussian distribution density, pi_kIs the corresponding mixing coefficient and satisfies

x_iRepresents one sample in the sample set X, i ═ 1, …, n, each sample X_iIncluding d-dimensional features;

to achieve data clustering, the gaussian mixture model parameters Θ are updated by iterative operations, so X is defined as the observation sample set, and Z ═ Z_iI is 1, …, n, and X and Z form a complete sample set, and on the basis of maximizing the log-likelihood function of the complete sample set, a regularization term is introduced to form an optimization objective function, which is defined as follows:

wherein, the lambda is a regularization coefficient,

for the regularization term, where the Hellinger distance is introduced into the regularization term, then the probability distribution P_iAnd P_jH (P) of Hellinger distance between_i,P_j) The square of (d) is:

and satisfies h (P)_i,P_j) 1 ≦ regular term

Is shown as

Wherein, P (k | x)_i) And P (k | x)_j) Are respectively a sample x_iAnd x_jThe posterior probability generated by the kth gaussian component, laplacian matrix L may be represented as L ═ D-W, where the relationship of matrices D and W is

T represents transposition;

for sample x_iFrom Hellinger distance, its l nearest neighbors, l ∈ { n-1}, can be determined, sample x in the nearest neighbor graph_iSample x adjacent to it_jW of the weight between_ijIs defined as:

wherein the content of the first and second substances,

and

respectively represent samples x_iAnd x_jI neighbor sample sets.

3. The Hellinger distance-gaussian mixture model-based clustering method according to claim 1, wherein the step S3 is implemented by:

obtaining a Gaussian mixture model parameter theta according to the t-1 iteration^t-1The posterior probability is calculated as:

on the basis, a Q function is defined by adopting a generalized expectation maximization algorithm and is used for iterative operation of model parameters, and the method is represented as follows:

4. The Hellinger distance-gaussian mixture model-based clustering method according to claim 1, wherein the posterior probability and the gaussian mixture model parameters are respectively updated by adopting a generalized expectation-maximization algorithm in the step S4, and the method is implemented in two steps:

first, minimize the regularization term R_kThe update of the posterior probability obtained by applying the Newton-Laplacian method is as follows:

second, maximize Q (Θ, Θ)^t-1) Obtaining updated Gaussian mixture model parameters theta^t：

5. The Hellinger distance-gaussian mixture model-based clustering method according to claim 1, wherein the regularized likelihood function values need to be calculated in step S5, and the corresponding calculation formula is:

6. the Hellinger distance-Gaussian mixture model-based clustering method according to claim 1, wherein the step S6 specifically comprises: comparing the regularization likelihood function values calculated in step S5, and determining the flow direction of the program:

1) if it is not

Then the update coefficient 0.9 γ → γ is set, returning to S4;

2) if it is not