CN107563399A

CN107563399A - The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy

Info

Publication number: CN107563399A
Application number: CN201610514901.1A
Authority: CN
Inventors: 丁世飞; 贾洪杰
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2016-06-30
Filing date: 2016-06-30
Publication date: 2018-01-09

Abstract

The present invention is the characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy, the importance of each attribute of data set is assessed using the definition of Knowledge entropy in rough set, then corresponding attributive character is assigned to as weights, then data point is clustered using Spectral Clustering based on the data after weighting.Attribute weight by Knowledge entropy to data, the information that each attribute can be made full use of to be included, weaken the interference of noise data and redundant attributes to cluster, can preferably handle high dimensional data, there is stronger robustness and generalization ability.

Description

The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy

Technical field

The present invention relates to pattern-recognition and machine learning field, and in particular to a kind of characteristic weighing spectrum of knowledge based entropy is poly- Class method and system.

Background technology

Cluster analysis is an important research content of multi-variate statistical analysis and area of pattern recognition.The purpose of cluster be by According to the inner link of data point, if being divided into Ganlei so that the similitude of data point is larger in same class, and inhomogeneity it Between data point similitude it is smaller^[1].Traditional clustering method, such as k-means algorithms and FCM algorithms, in convex spherical structure Really good Clustering Effect is achieved on data set, but when sample space is non-convex, algorithm is easily trapped into local optimum.

In recent years, Spectral Clustering is due to good performance in cluster, and the characteristics of be easily achieved, causes science More and more pay close attention on boundary.Spectral clustering does not have to make the global structure of data it is assumed that it can be in the sample sky of arbitrary shape Between it is upper cluster, and converge on global optimum, the non-convex situation of the data set that is particularly suitable for use in.The thought source of spectral clustering is managed in spectrogram By data clusters problem is regarded as a Graph partition problem by it.Summit of each point as figure in data set, any two Weights of the similarity as the side on two summits of connection between point, thus construct a undirected weighted graph.Traditional figure Division methods have many kinds, such as Minimal Cut Set, ratio cut set method, specification cut set method and minimax cut set method.Minimize The object function of figure division methods, it is possible to obtain optimal clustering.But in solution procedure, it usually needs the drawing to figure This matrix- eigenvector-decomposition of pula, so as to obtain globally optimal solution of the object function on the continuous domain relaxed.At present, spectral clustering It has been successfully applied to the fields such as speech Separation, video index, Text region and image procossing.Spectral clustering is to clustering problem Solution provides new approaches, can effectively handle many practical problems, and its research has huge scientific research value and application potential.

But spectral clustering is in the similitude of metric data point, it is generally recognized that importance all phases of each attribute of data Together, it is defaulted as 1.In fact, the information that each attribute is included is different, there is also difference for their contributions to cluster.And And often contain noise and incoherent feature in real data set, easily cause " dimension trap ", the cluster of algorithm of interference Process, influence the accuracy of cluster result.Such as：It is assumed that data to be clustered have 20 dimension attributes, wherein only 2 dimension attributes are with gathering Class is most related, and this 2 dimension attribute potential range in whole attribute space is farthest, in this case, when calculating similitude If the effect to attribute is not added with distinguishing, easily cause misleading.Overcome one of effective ways of this problem be exactly be each Attribute adds a weight parameter, allows different attributes to serve in cluster different, from Euclidean space for be exactly to elongate Axle corresponding to association attributes, shorten axle corresponding to unrelated attribute.

For these reasons, there has been proposed the method for various features weighting, these methods to be broadly divided into two major classes：One Class is subjective weighting method, and mainly by expert, rule of thumb subjective judgement obtains；Another kind of is objective assignment method, by each index in quilt Real data in evaluation unit is formed.

The content of the invention

In order to solve the above problems, the present invention proposes a kind of the characteristic weighing Spectral Clustering and system of knowledge based entropy, The importance of each attribute is assessed using the related notion of Knowledge entropy in rough set, is then assigned to accordingly as weights Attributive character, reapply Spectral Clustering and data point is clustered.This method can make full use of what each feature was included Information, influence of the redundant attributes to cluster result is eliminated, there is stronger robustness and generalization ability.

The present invention is achieved by the following scheme：

The present invention relates to a kind of characteristic weighing Spectral Clustering of knowledge based entropy, has added a weight for each attribute Parameter, different attributes is allowed to serve in cluster different.Weight parameter is obtained by the Knowledge entropy in rough set, visitor Division ability of the ground reflection attribute to sample data is seen, then the data after weighting are handled with Spectral Clustering, exports cluster result.

The related definition of the present invention is as follows：

Define 1. knowledge：If the nonempty finite set that U is made up of object, a referred to as domain.Any one of domain U SubsetReferred to as U concept or category.Any subset cluster is referred to as the abstract knowledge on U in domain U, referred to as knows Know.

What rough set theory mainly discussed is the knowledge that those can form division and covering on domain U, generally use etc. Valency relation comes presentation class and knowledge.

Define 2. knowledge bases：Give domain a U and U on cluster equivalence relation S, claim two tuple K=(U, S) be on A domain U knowledge base or approximation space.

It is exactly various as derived from equivalence relation that equivalence relation on domain, which represents division and knowledge, knowledge base, Knowledge, classification capacity of the equivalence relation to domain is embodied, wherein also imply existing each between each knowledge in knowledge base Kind relation.

Define 3. Indiscernible relations：The cluster equivalence relation S on domain a U and U is given, ifAnd P ≠ Φ, then the common factor (∩ P) of all equivalence relations is still an equivalence relation on domain U in P, can not differentiate referred to as on P Relation, be designated as IND (P), do not produce obscure in the case of be often abbreviated as P.

Wherein [x]_R(R ∈ P) be with object x meet Indiscernible relation all objects composition set, referred to as by etc. The equivalence class for the object x that valency relation R is determined.

The set of all equivalence classes is designated as U/IND (P) as derived from IND (P), and it constitutes a U division, referred to as discusses Domain U P- ABCs.U/IND (P) can be abbreviated as U/P.

Define 4. knowledge-representation systems：It is a knowledge-representation system to claim four-tuple KRS=(U, A, V, f), wherein, U is The nonempty finite set of object, referred to as domain；A is the nonempty finite set of attribute, includes conditional attribute C and decision attribute D, A =C ∪ D, C ∩ D=Φ；V is the codomain of all attributes,V_aRepresent attribute a ∈ A codomain；F represents U × A → V A mapping, referred to as information function.

Knowledge-representation system can be divided into two types：One kind is information system (information table), decision attribute set D= Φ, the i.e. knowledge-representation system not comprising decision attribute；Another kind of is decision system (decision table), decision attribute set D ≠ Φ, Include the knowledge-representation system of decision attribute.

Define 5. comentropies：If KRS=(U, A, V, f) is an information system,Closed for one on U is of equal value Assembly is closed, and P is derived on U to be divided into U/P=U/IND (P)={ X₁,X₂,…,X_n, then knowledge U/P comentropy is defined as

Wherein,Represent equivalence class X_iProbability in U.

It is an information system that theorem 1., which sets KRS=(U, A, V, f), equivalence relation set P,IfThen H (U/P)≤H (U/Q)

Theorem 1 shows that domain is divided thinner, and the comentropy of knowledge is bigger.The comentropy obtained by this method, Its span is (1 ,+∞), is not suitable as weights.Therefore the present invention proposes another metric form of Knowledge entropy, See definition 6.

Define 6. Knowledge entropies：If Z={ X₁,X₂,…,X_nOne on domain U division, then Z Knowledge entropy is defined as

Wherein,

H (Z) describes the size of the information content included in knowledge Z, and it has the following properties that：

(1)

(2) H (Z)=0, and if only if Z={ U }；

(3)And if only if | X_i|=| X_j|, (i, j=1,2 ..., n)

The measure of above-mentioned Knowledge entropy is readily appreciated that, and is calculated easy.From property (1), if | Z |=n,So more meet the value requirement of weight；Property (2) is pointed out if do not divided, just without not true It is qualitative, so Knowledge entropy is 0；Property (3) shows that Knowledge entropy is maximum when dividing uniform, i.e., uncertain maximum, this In the case of people be most difficult to make a choice, it is necessary to which more knowledge removes uncertainty.

The present invention comprises the following steps that：

Step 1, knowledge based entropy calculates input data set χ={ x₁,x₂,…,x_n}(x_i∈R^l) attribute importance, and To each attribute weight of data.

Step 1.1：Data prediction.Rough set method can only handle discrete data, if data are continuous, just need Sliding-model control is first carried out, that is, selects appropriate division points that the codomain of whole connection attribute is divided into some discrete sections, Then the property value in each subinterval is represented with different integer values.

Step 1.2：The Knowledge entropy of computation attribute.On the basis of sample attribute value discretization, determine current attribute to sample The division of notebook data, attribute a_jBe divided into

Its Knowledge entropy is calculated further according to defining 6：

Step 1.3：Determine the importance of attribute.The Knowledge entropy of all properties is normalized, by Knowledge entropy institute Importance of the proportion accounted for as attribute, its calculation formula are as follows：

Step 1.4：To the attribute weight of input data.Corresponding attributive character is assigned to using Attribute Significance as weights, Obtain the data set χ ' of characteristic weighing

Step 2, similar matrix and Laplacian Matrix are constructed, and to Laplacian Matrix feature decomposition, calculates its feature Value and characteristic vector.

Step 2.1：On the basis of χ ', the similarity matrix W ∈ R of data point are established using gaussian kernel function^n×n, Gauss Shown in kernel function such as formula (7)：

In formula, d (x_i,x_j) it is point x_iAnd x_jEuclidean distance；σ is scale parameter, controls similarity w_ijWith d (x_i, x_j) decay speed.

Step 2.2：According to the Similarity measures degree matrix D ∈ R of data point^n×n.Spectral clustering regards data clusters problem as One Graph partition problem, each point x_iA summit is, any two points (x of the ∈ χ as figure_i,x_j) between similarity w_ijMake To connect the weights on the side on two summits (i, j).With the weights sum on the summit i sides being connected, summit i degree is defined as, uses d_iTable Show：

The degree of n data point just constitutes matrix D ∈ R^n×n, it is a diagonal matrix：Element on diagonal is d_i, And off-diagonal element value is 0.

Step 2.3：According to similarity matrix W and degree matrix D, construction Laplacian Matrix L_sym：L_sym=D^-1/2(D-W)D^-1/2。

Step 2.4：Calculating matrix L_symPreceding k eigenvalue of maximum corresponding to characteristic vector u₁,…,u_k, then by this A little characteristic vector longitudinal arrangements, form matrix U：

Step 2.5：Standard Process U every a line, is transformed into unit vector by row vector, obtains matrix Y：

Step 3, according to the characteristic vector of Laplacian Matrix, the representative point each Mapping of data points to a low-dimensional, And these representative points are clustered.

Step 3.1：Regard matrix Y every a line as space R^kIn a point, using k-means or other algorithms by this A little points are divided into k classes；

Step 3.2：If matrix Y the i-th row is assigned to jth class, just by original data point x_iIt is divided into jth class.

By above content, the application provides a kind of characteristic weighing Spectral Clustering of knowledge based entropy and is System, the importance of each attribute of data set is assessed using the definition of Knowledge entropy in rough set, is then assigned as weights Data point is clustered using Spectral Clustering to corresponding attributive character, then based on the data after weighting.The application passes through Knowledge entropy is to the attribute weight of data, the information that each attribute can be made full use of to be included, and weakens noise data and redundancy category Interference of the property to cluster, can preferably handle high dimensional data, have stronger robustness and generalization ability.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it is clear that ground, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is a kind of flow chart of the characteristic weighing spectral clustering for knowledge based entropy that the embodiment of the present application provides.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is carried out clear, complete Site preparation describes, it is clear that described embodiment is only some embodiments of the present application, rather than whole embodiments.It is based on Embodiment in the application, those of ordinary skill in the art are obtained all other under the premise of creative work is not paid Embodiment, belong to the scope of the application protection.

Embodiment 1

As shown in figure 1, the present embodiment comprises the following steps：

Input：Data set χ={ x₁,x₂,…,x_n}(x_i∈R^l), clusters number k.

Output：Ready-portioned k is according to class.

Step 1：The pretreatment of sample data.According to the particular problem of research, the property value of sample is transformed into and is adapted to slightly The data format of rough diversity method processing, such as carries out sliding-model control to continuous property.

Step 2：Calculate the importance of sample attribute.The Knowledge entropy of each attribute is calculated using formula (5), further according to formula (6) importance of each attribute is confirmed.

Step 3：Sample attribute is weighted.Corresponding attributive character is assigned to using Attribute Significance as weights, obtains feature The data set χ ' of weighting.

Step 4：On the basis of χ ', the similarity matrix W ∈ R of data point are established using formula (7)^n×n, utilize formula (8) the degree matrix D ∈ R of figure are established^n×n。

Step 5：According to similarity matrix W and degree matrix D, construction Laplacian Matrix L_sym：L_sym=D^-1/2(D-W)D^-1/2。

Step 6：Calculating matrix L_symPreceding k eigenvalue of maximum corresponding to characteristic vector u₁,…,u_k, then by these Characteristic vector longitudinal arrangement, form matrix U：

Step 7：Standard Process U every a line, is transformed into unit vector by row vector, obtains matrix Y：

Step 8：Regard matrix Y every a line as space R^kIn a point, using k-means or other algorithms by these Point is divided into k classes.

Step 9：If matrix Y the i-th row is assigned to jth class, just by original data point x_iIt is divided into jth class.

Claims

1. a kind of characteristic weighing Spectral Clustering of knowledge based entropy, it is characterised in that utilize the definition of Knowledge entropy in rough set To assess the importance of each attribute of data set, corresponding attributive character then is assigned to as weights, then based on weighting Data afterwards are clustered using Spectral Clustering to data point.

2. according to the method for claim 1, it is characterized in that, described data set is n × l matrix, matrix it is every Row represents a data point, and each column represents an attribute, therefore this matrix includes n data point, and each data point has l kinds Property, χ={ x can be expressed as₁,x₂,…,x_n}(x_i∈R^l)。

3. according to the method for claim 1, it is characterized in that, described Knowledge entropy refers to： Wherein Z={ X₁, X₂..., X_nIt is one on domain U division,p(X_i)=| X_i|/| U | represent Probability of the equivalence class Xi in U.

4. according to the method for claim 1, it is characterized in that, described weighting refers to：Each category of data set is calculated respectively The Knowledge entropy of property, and the Knowledge entropy to obtaining makees normalized, the importance using the proportion shared by Knowledge entropy as attribute, so Each attribute of data set is multiplied by corresponding importance, the weighting of complete paired data collection afterwards.

5. the method according to claim 1 or 4, it is characterized in that, described weighting includes：

1：Data prediction.Sliding-model control is carried out to the continuous data of input, selects appropriate division points entirely will continuously belong to The codomain of property is divided into some discrete sections, and the property value in each subinterval is then represented with different integer values.

2：The Knowledge entropy of computation attribute.On the basis of sample attribute value discretization, determine that current attribute is drawn to sample data Point, attribute a_jBe divided into

3：Determine the importance of attribute.The Knowledge entropy of all properties is normalized, by the ratio recast shared by Knowledge entropy For the importance of attribute, any attribute a_jImportance be：

4：To the attribute weight of input data.The corresponding attribute of data set is assigned to using Attribute Significance as weights, is weighted Data set χ ' afterwards.

6. according to the method for claim 1, it is characterized in that, described spectral clustering includes：

1：Similar matrix and Laplacian Matrix are constructed, and to Laplacian Matrix feature decomposition, calculates its characteristic value and feature Vector.

2：According to the characteristic vector of Laplacian Matrix, the representative point each Mapping of data points to a low-dimensional, and to these Point is represented to be clustered.

7. the method according to claim 1 or 6, it is characterized in that, described feature decomposition includes：

1：On the basis of χ ', the similarity matrix W ∈ R of data point are established using gaussian kernel function^n×n, calculated according to matrix W Each data point x_iDegreeOne diagonal matrix, degree matrix D ∈ R are formed by the degree of n data point^n×n。

2：According to similarity matrix W and degree matrix D, construction Laplacian Matrix L_sym=D^-1/2(D-W)D^-1/2。

3：Calculating matrix L_symPreceding k eigenvalue of maximum corresponding to characteristic vector u₁,…,u_k, then by these characteristic vectors Longitudinal arrangement, form matrix

4：Standard Process U every a line, is transformed into unit vector by row vector, obtains matrix Y：

8. the method according to claim 1 or 6, it is characterized in that, described cluster includes：

1：Regard matrix Y every a line as space R^kIn a point, these points are divided into k using k-means or other algorithms Class；

2：If matrix Y the i-th row is assigned to jth class, just by original data point x_iIt is divided into jth class.

A kind of 9. system for realizing any of the above-described claim methods described, it is characterised in that：Attribute weight module, feature decomposition Module and cluster module, wherein attribute weight module calculate the Knowledge entropy and Attribute Significance of each attribute of input data set, Then it is assigned to corresponding attributive character as weights；Feature decomposition module is based on the dataset construction similitude square after weighting Battle array and Laplacian Matrix, and to Laplacian Matrix feature decomposition；Cluster module is each Mapping of data points to by La Pula In the feature space of the characteristic vector composition of this matrix, and the representative point in feature space is clustered, export cluster result.