CN107563399A - The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy - Google Patents

The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy Download PDF

Info

Publication number
CN107563399A
CN107563399A CN201610514901.1A CN201610514901A CN107563399A CN 107563399 A CN107563399 A CN 107563399A CN 201610514901 A CN201610514901 A CN 201610514901A CN 107563399 A CN107563399 A CN 107563399A
Authority
CN
China
Prior art keywords
attribute
matrix
data
knowledge
entropy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610514901.1A
Other languages
Chinese (zh)
Inventor
丁世飞
贾洪杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Mining and Technology CUMT
Original Assignee
China University of Mining and Technology CUMT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Mining and Technology CUMT filed Critical China University of Mining and Technology CUMT
Priority to CN201610514901.1A priority Critical patent/CN107563399A/en
Publication of CN107563399A publication Critical patent/CN107563399A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention is the characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy, the importance of each attribute of data set is assessed using the definition of Knowledge entropy in rough set, then corresponding attributive character is assigned to as weights, then data point is clustered using Spectral Clustering based on the data after weighting.Attribute weight by Knowledge entropy to data, the information that each attribute can be made full use of to be included, weaken the interference of noise data and redundant attributes to cluster, can preferably handle high dimensional data, there is stronger robustness and generalization ability.

Description

The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy
Technical field
The present invention relates to pattern-recognition and machine learning field, and in particular to a kind of characteristic weighing spectrum of knowledge based entropy is poly- Class method and system.
Background technology
Cluster analysis is an important research content of multi-variate statistical analysis and area of pattern recognition.The purpose of cluster be by According to the inner link of data point, if being divided into Ganlei so that the similitude of data point is larger in same class, and inhomogeneity it Between data point similitude it is smaller[1].Traditional clustering method, such as k-means algorithms and FCM algorithms, in convex spherical structure Really good Clustering Effect is achieved on data set, but when sample space is non-convex, algorithm is easily trapped into local optimum.
In recent years, Spectral Clustering is due to good performance in cluster, and the characteristics of be easily achieved, causes science More and more pay close attention on boundary.Spectral clustering does not have to make the global structure of data it is assumed that it can be in the sample sky of arbitrary shape Between it is upper cluster, and converge on global optimum, the non-convex situation of the data set that is particularly suitable for use in.The thought source of spectral clustering is managed in spectrogram By data clusters problem is regarded as a Graph partition problem by it.Summit of each point as figure in data set, any two Weights of the similarity as the side on two summits of connection between point, thus construct a undirected weighted graph.Traditional figure Division methods have many kinds, such as Minimal Cut Set, ratio cut set method, specification cut set method and minimax cut set method.Minimize The object function of figure division methods, it is possible to obtain optimal clustering.But in solution procedure, it usually needs the drawing to figure This matrix- eigenvector-decomposition of pula, so as to obtain globally optimal solution of the object function on the continuous domain relaxed.At present, spectral clustering It has been successfully applied to the fields such as speech Separation, video index, Text region and image procossing.Spectral clustering is to clustering problem Solution provides new approaches, can effectively handle many practical problems, and its research has huge scientific research value and application potential.
But spectral clustering is in the similitude of metric data point, it is generally recognized that importance all phases of each attribute of data Together, it is defaulted as 1.In fact, the information that each attribute is included is different, there is also difference for their contributions to cluster.And And often contain noise and incoherent feature in real data set, easily cause " dimension trap ", the cluster of algorithm of interference Process, influence the accuracy of cluster result.Such as:It is assumed that data to be clustered have 20 dimension attributes, wherein only 2 dimension attributes are with gathering Class is most related, and this 2 dimension attribute potential range in whole attribute space is farthest, in this case, when calculating similitude If the effect to attribute is not added with distinguishing, easily cause misleading.Overcome one of effective ways of this problem be exactly be each Attribute adds a weight parameter, allows different attributes to serve in cluster different, from Euclidean space for be exactly to elongate Axle corresponding to association attributes, shorten axle corresponding to unrelated attribute.
For these reasons, there has been proposed the method for various features weighting, these methods to be broadly divided into two major classes:One Class is subjective weighting method, and mainly by expert, rule of thumb subjective judgement obtains;Another kind of is objective assignment method, by each index in quilt Real data in evaluation unit is formed.
The content of the invention
In order to solve the above problems, the present invention proposes a kind of the characteristic weighing Spectral Clustering and system of knowledge based entropy, The importance of each attribute is assessed using the related notion of Knowledge entropy in rough set, is then assigned to accordingly as weights Attributive character, reapply Spectral Clustering and data point is clustered.This method can make full use of what each feature was included Information, influence of the redundant attributes to cluster result is eliminated, there is stronger robustness and generalization ability.
The present invention is achieved by the following scheme:
The present invention relates to a kind of characteristic weighing Spectral Clustering of knowledge based entropy, has added a weight for each attribute Parameter, different attributes is allowed to serve in cluster different.Weight parameter is obtained by the Knowledge entropy in rough set, visitor Division ability of the ground reflection attribute to sample data is seen, then the data after weighting are handled with Spectral Clustering, exports cluster result.
The related definition of the present invention is as follows:
Define 1. knowledge:If the nonempty finite set that U is made up of object, a referred to as domain.Any one of domain U SubsetReferred to as U concept or category.Any subset cluster is referred to as the abstract knowledge on U in domain U, referred to as knows Know.
What rough set theory mainly discussed is the knowledge that those can form division and covering on domain U, generally use etc. Valency relation comes presentation class and knowledge.
Define 2. knowledge bases:Give domain a U and U on cluster equivalence relation S, claim two tuple K=(U, S) be on A domain U knowledge base or approximation space.
It is exactly various as derived from equivalence relation that equivalence relation on domain, which represents division and knowledge, knowledge base, Knowledge, classification capacity of the equivalence relation to domain is embodied, wherein also imply existing each between each knowledge in knowledge base Kind relation.
Define 3. Indiscernible relations:The cluster equivalence relation S on domain a U and U is given, ifAnd P ≠ Φ, then the common factor (∩ P) of all equivalence relations is still an equivalence relation on domain U in P, can not differentiate referred to as on P Relation, be designated as IND (P), do not produce obscure in the case of be often abbreviated as P.
Wherein [x]R(R ∈ P) be with object x meet Indiscernible relation all objects composition set, referred to as by etc. The equivalence class for the object x that valency relation R is determined.
The set of all equivalence classes is designated as U/IND (P) as derived from IND (P), and it constitutes a U division, referred to as discusses Domain U P- ABCs.U/IND (P) can be abbreviated as U/P.
Define 4. knowledge-representation systems:It is a knowledge-representation system to claim four-tuple KRS=(U, A, V, f), wherein, U is The nonempty finite set of object, referred to as domain;A is the nonempty finite set of attribute, includes conditional attribute C and decision attribute D, A =C ∪ D, C ∩ D=Φ;V is the codomain of all attributes,VaRepresent attribute a ∈ A codomain;F represents U × A → V A mapping, referred to as information function.
Knowledge-representation system can be divided into two types:One kind is information system (information table), decision attribute set D= Φ, the i.e. knowledge-representation system not comprising decision attribute;Another kind of is decision system (decision table), decision attribute set D ≠ Φ, Include the knowledge-representation system of decision attribute.
Define 5. comentropies:If KRS=(U, A, V, f) is an information system,Closed for one on U is of equal value Assembly is closed, and P is derived on U to be divided into U/P=U/IND (P)={ X1,X2,…,Xn, then knowledge U/P comentropy is defined as
Wherein,Represent equivalence class XiProbability in U.
It is an information system that theorem 1., which sets KRS=(U, A, V, f), equivalence relation set P,IfThen H (U/P)≤H (U/Q)
Theorem 1 shows that domain is divided thinner, and the comentropy of knowledge is bigger.The comentropy obtained by this method, Its span is (1 ,+∞), is not suitable as weights.Therefore the present invention proposes another metric form of Knowledge entropy, See definition 6.
Define 6. Knowledge entropies:If Z={ X1,X2,…,XnOne on domain U division, then Z Knowledge entropy is defined as
Wherein,
H (Z) describes the size of the information content included in knowledge Z, and it has the following properties that:
(1)
(2) H (Z)=0, and if only if Z={ U };
(3)And if only if | Xi|=| Xj|, (i, j=1,2 ..., n)
The measure of above-mentioned Knowledge entropy is readily appreciated that, and is calculated easy.From property (1), if | Z |=n,So more meet the value requirement of weight;Property (2) is pointed out if do not divided, just without not true It is qualitative, so Knowledge entropy is 0;Property (3) shows that Knowledge entropy is maximum when dividing uniform, i.e., uncertain maximum, this In the case of people be most difficult to make a choice, it is necessary to which more knowledge removes uncertainty.
The present invention comprises the following steps that:
Step 1, knowledge based entropy calculates input data set χ={ x1,x2,…,xn}(xi∈Rl) attribute importance, and To each attribute weight of data.
Step 1.1:Data prediction.Rough set method can only handle discrete data, if data are continuous, just need Sliding-model control is first carried out, that is, selects appropriate division points that the codomain of whole connection attribute is divided into some discrete sections, Then the property value in each subinterval is represented with different integer values.
Step 1.2:The Knowledge entropy of computation attribute.On the basis of sample attribute value discretization, determine current attribute to sample The division of notebook data, attribute ajBe divided into
Its Knowledge entropy is calculated further according to defining 6:
Step 1.3:Determine the importance of attribute.The Knowledge entropy of all properties is normalized, by Knowledge entropy institute Importance of the proportion accounted for as attribute, its calculation formula are as follows:
Step 1.4:To the attribute weight of input data.Corresponding attributive character is assigned to using Attribute Significance as weights, Obtain the data set χ ' of characteristic weighing
Step 2, similar matrix and Laplacian Matrix are constructed, and to Laplacian Matrix feature decomposition, calculates its feature Value and characteristic vector.
Step 2.1:On the basis of χ ', the similarity matrix W ∈ R of data point are established using gaussian kernel functionn×n, Gauss Shown in kernel function such as formula (7):
In formula, d (xi,xj) it is point xiAnd xjEuclidean distance;σ is scale parameter, controls similarity wijWith d (xi, xj) decay speed.
Step 2.2:According to the Similarity measures degree matrix D ∈ R of data pointn×n.Spectral clustering regards data clusters problem as One Graph partition problem, each point xiA summit is, any two points (x of the ∈ χ as figurei,xj) between similarity wijMake To connect the weights on the side on two summits (i, j).With the weights sum on the summit i sides being connected, summit i degree is defined as, uses diTable Show:
The degree of n data point just constitutes matrix D ∈ Rn×n, it is a diagonal matrix:Element on diagonal is di, And off-diagonal element value is 0.
Step 2.3:According to similarity matrix W and degree matrix D, construction Laplacian Matrix Lsym:Lsym=D-1/2(D-W)D-1/2
Step 2.4:Calculating matrix LsymPreceding k eigenvalue of maximum corresponding to characteristic vector u1,…,uk, then by this A little characteristic vector longitudinal arrangements, form matrix U:
Step 2.5:Standard Process U every a line, is transformed into unit vector by row vector, obtains matrix Y:
Step 3, according to the characteristic vector of Laplacian Matrix, the representative point each Mapping of data points to a low-dimensional, And these representative points are clustered.
Step 3.1:Regard matrix Y every a line as space RkIn a point, using k-means or other algorithms by this A little points are divided into k classes;
Step 3.2:If matrix Y the i-th row is assigned to jth class, just by original data point xiIt is divided into jth class.
By above content, the application provides a kind of characteristic weighing Spectral Clustering of knowledge based entropy and is System, the importance of each attribute of data set is assessed using the definition of Knowledge entropy in rough set, is then assigned as weights Data point is clustered using Spectral Clustering to corresponding attributive character, then based on the data after weighting.The application passes through Knowledge entropy is to the attribute weight of data, the information that each attribute can be made full use of to be included, and weakens noise data and redundancy category Interference of the property to cluster, can preferably handle high dimensional data, have stronger robustness and generalization ability.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it is clear that ground, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of flow chart of the characteristic weighing spectral clustering for knowledge based entropy that the embodiment of the present application provides.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is carried out clear, complete Site preparation describes, it is clear that described embodiment is only some embodiments of the present application, rather than whole embodiments.It is based on Embodiment in the application, those of ordinary skill in the art are obtained all other under the premise of creative work is not paid Embodiment, belong to the scope of the application protection.
Embodiment 1
As shown in figure 1, the present embodiment comprises the following steps:
Input:Data set χ={ x1,x2,…,xn}(xi∈Rl), clusters number k.
Output:Ready-portioned k is according to class.
Step 1:The pretreatment of sample data.According to the particular problem of research, the property value of sample is transformed into and is adapted to slightly The data format of rough diversity method processing, such as carries out sliding-model control to continuous property.
Step 2:Calculate the importance of sample attribute.The Knowledge entropy of each attribute is calculated using formula (5), further according to formula (6) importance of each attribute is confirmed.
Step 3:Sample attribute is weighted.Corresponding attributive character is assigned to using Attribute Significance as weights, obtains feature The data set χ ' of weighting.
Step 4:On the basis of χ ', the similarity matrix W ∈ R of data point are established using formula (7)n×n, utilize formula (8) the degree matrix D ∈ R of figure are establishedn×n
Step 5:According to similarity matrix W and degree matrix D, construction Laplacian Matrix Lsym:Lsym=D-1/2(D-W)D-1/2
Step 6:Calculating matrix LsymPreceding k eigenvalue of maximum corresponding to characteristic vector u1,…,uk, then by these Characteristic vector longitudinal arrangement, form matrix U:
Step 7:Standard Process U every a line, is transformed into unit vector by row vector, obtains matrix Y:
Step 8:Regard matrix Y every a line as space RkIn a point, using k-means or other algorithms by these Point is divided into k classes.
Step 9:If matrix Y the i-th row is assigned to jth class, just by original data point xiIt is divided into jth class.

Claims (9)

1. a kind of characteristic weighing Spectral Clustering of knowledge based entropy, it is characterised in that utilize the definition of Knowledge entropy in rough set To assess the importance of each attribute of data set, corresponding attributive character then is assigned to as weights, then based on weighting Data afterwards are clustered using Spectral Clustering to data point.
2. according to the method for claim 1, it is characterized in that, described data set is n × l matrix, matrix it is every Row represents a data point, and each column represents an attribute, therefore this matrix includes n data point, and each data point has l kinds Property, χ={ x can be expressed as1,x2,…,xn}(xi∈Rl)。
3. according to the method for claim 1, it is characterized in that, described Knowledge entropy refers to: Wherein Z={ X1, X2..., XnIt is one on domain U division,p(Xi)=| Xi|/| U | represent Probability of the equivalence class Xi in U.
4. according to the method for claim 1, it is characterized in that, described weighting refers to:Each category of data set is calculated respectively The Knowledge entropy of property, and the Knowledge entropy to obtaining makees normalized, the importance using the proportion shared by Knowledge entropy as attribute, so Each attribute of data set is multiplied by corresponding importance, the weighting of complete paired data collection afterwards.
5. the method according to claim 1 or 4, it is characterized in that, described weighting includes:
1:Data prediction.Sliding-model control is carried out to the continuous data of input, selects appropriate division points entirely will continuously belong to The codomain of property is divided into some discrete sections, and the property value in each subinterval is then represented with different integer values.
2:The Knowledge entropy of computation attribute.On the basis of sample attribute value discretization, determine that current attribute is drawn to sample data Point, attribute ajBe divided into
3:Determine the importance of attribute.The Knowledge entropy of all properties is normalized, by the ratio recast shared by Knowledge entropy For the importance of attribute, any attribute ajImportance be:
4:To the attribute weight of input data.The corresponding attribute of data set is assigned to using Attribute Significance as weights, is weighted Data set χ ' afterwards.
6. according to the method for claim 1, it is characterized in that, described spectral clustering includes:
1:Similar matrix and Laplacian Matrix are constructed, and to Laplacian Matrix feature decomposition, calculates its characteristic value and feature Vector.
2:According to the characteristic vector of Laplacian Matrix, the representative point each Mapping of data points to a low-dimensional, and to these Point is represented to be clustered.
7. the method according to claim 1 or 6, it is characterized in that, described feature decomposition includes:
1:On the basis of χ ', the similarity matrix W ∈ R of data point are established using gaussian kernel functionn×n, calculated according to matrix W Each data point xiDegreeOne diagonal matrix, degree matrix D ∈ R are formed by the degree of n data pointn×n
2:According to similarity matrix W and degree matrix D, construction Laplacian Matrix Lsym=D-1/2(D-W)D-1/2
3:Calculating matrix LsymPreceding k eigenvalue of maximum corresponding to characteristic vector u1,…,uk, then by these characteristic vectors Longitudinal arrangement, form matrix
4:Standard Process U every a line, is transformed into unit vector by row vector, obtains matrix Y:
8. the method according to claim 1 or 6, it is characterized in that, described cluster includes:
1:Regard matrix Y every a line as space RkIn a point, these points are divided into k using k-means or other algorithms Class;
2:If matrix Y the i-th row is assigned to jth class, just by original data point xiIt is divided into jth class.
A kind of 9. system for realizing any of the above-described claim methods described, it is characterised in that:Attribute weight module, feature decomposition Module and cluster module, wherein attribute weight module calculate the Knowledge entropy and Attribute Significance of each attribute of input data set, Then it is assigned to corresponding attributive character as weights;Feature decomposition module is based on the dataset construction similitude square after weighting Battle array and Laplacian Matrix, and to Laplacian Matrix feature decomposition;Cluster module is each Mapping of data points to by La Pula In the feature space of the characteristic vector composition of this matrix, and the representative point in feature space is clustered, export cluster result.
CN201610514901.1A 2016-06-30 2016-06-30 The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy Pending CN107563399A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610514901.1A CN107563399A (en) 2016-06-30 2016-06-30 The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610514901.1A CN107563399A (en) 2016-06-30 2016-06-30 The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy

Publications (1)

Publication Number Publication Date
CN107563399A true CN107563399A (en) 2018-01-09

Family

ID=60969619

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610514901.1A Pending CN107563399A (en) 2016-06-30 2016-06-30 The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy

Country Status (1)

Country Link
CN (1) CN107563399A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347930A (en) * 2019-07-18 2019-10-18 杭州连银科技有限公司 A kind of high dimensional data based on statistical analysis technique is processed automatically and processing method
CN110706092A (en) * 2019-09-23 2020-01-17 深圳中兴飞贷金融科技有限公司 Risk user identification method and device, storage medium and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347930A (en) * 2019-07-18 2019-10-18 杭州连银科技有限公司 A kind of high dimensional data based on statistical analysis technique is processed automatically and processing method
CN110706092A (en) * 2019-09-23 2020-01-17 深圳中兴飞贷金融科技有限公司 Risk user identification method and device, storage medium and electronic equipment
CN110706092B (en) * 2019-09-23 2021-05-18 前海飞算科技(深圳)有限公司 Risk user identification method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
Yin et al. Incomplete multi-view clustering via subspace learning
US10019442B2 (en) Method and system for peer detection
Shen et al. Non-negative matrix factorization clustering on multiple manifolds
CN109242002A (en) High dimensional data classification method, device and terminal device
CN109886334B (en) Shared neighbor density peak clustering method for privacy protection
CN107341505B (en) Scene classification method based on image significance and Object Bank
CN109190698B (en) Classification and identification system and method for network digital virtual assets
CN112529638B (en) Service demand dynamic prediction method and system based on user classification and deep learning
CN111062428A (en) Hyperspectral image clustering method, system and equipment
Tang et al. Research on weeds identification based on K-means feature learning
Jia et al. A Feature Weighted Spectral Clustering Algorithm Based on Knowledge Entropy.
He et al. An effective information detection method for social big data
Lin et al. A tensor approach for uncoupled multiview clustering
CN115761275A (en) Unsupervised community discovery method and system based on graph neural network
Wang et al. Multi-manifold clustering
Liu et al. A high-order proximity-incorporated nonnegative matrix factorization-based community detector
Hou et al. Robust clustering of multi-type relational data via a heterogeneous manifold ensemble
Huang et al. Graph convolutional sparse subspace coclustering with nonnegative orthogonal factorization for large hyperspectral images
CN107563399A (en) The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy
Baswade et al. A comparative study of k-means and weighted k-means for clustering
CN109858543B (en) Image memorability prediction method based on low-rank sparse representation and relationship inference
Zhu et al. An improved fuzzy C-means clustering algorithm using Euclidean distance function
CN111401440A (en) Target classification recognition method and device, computer equipment and storage medium
Xuan et al. Subclass representation‐based face‐recognition algorithm derived from the structure scatter of training samples
Huang et al. A bipartite graph partition-based coclustering approach with graph nonnegative matrix factorization for large hyperspectral images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180109