CN107563399A - The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy - Google Patents
The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy Download PDFInfo
- Publication number
- CN107563399A CN107563399A CN201610514901.1A CN201610514901A CN107563399A CN 107563399 A CN107563399 A CN 107563399A CN 201610514901 A CN201610514901 A CN 201610514901A CN 107563399 A CN107563399 A CN 107563399A
- Authority
- CN
- China
- Prior art keywords
- attribute
- matrix
- data
- knowledge
- entropy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention is the characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy, the importance of each attribute of data set is assessed using the definition of Knowledge entropy in rough set, then corresponding attributive character is assigned to as weights, then data point is clustered using Spectral Clustering based on the data after weighting.Attribute weight by Knowledge entropy to data, the information that each attribute can be made full use of to be included, weaken the interference of noise data and redundant attributes to cluster, can preferably handle high dimensional data, there is stronger robustness and generalization ability.
Description
Technical field
The present invention relates to pattern-recognition and machine learning field, and in particular to a kind of characteristic weighing spectrum of knowledge based entropy is poly-
Class method and system.
Background technology
Cluster analysis is an important research content of multi-variate statistical analysis and area of pattern recognition.The purpose of cluster be by
According to the inner link of data point, if being divided into Ganlei so that the similitude of data point is larger in same class, and inhomogeneity it
Between data point similitude it is smaller[1].Traditional clustering method, such as k-means algorithms and FCM algorithms, in convex spherical structure
Really good Clustering Effect is achieved on data set, but when sample space is non-convex, algorithm is easily trapped into local optimum.
In recent years, Spectral Clustering is due to good performance in cluster, and the characteristics of be easily achieved, causes science
More and more pay close attention on boundary.Spectral clustering does not have to make the global structure of data it is assumed that it can be in the sample sky of arbitrary shape
Between it is upper cluster, and converge on global optimum, the non-convex situation of the data set that is particularly suitable for use in.The thought source of spectral clustering is managed in spectrogram
By data clusters problem is regarded as a Graph partition problem by it.Summit of each point as figure in data set, any two
Weights of the similarity as the side on two summits of connection between point, thus construct a undirected weighted graph.Traditional figure
Division methods have many kinds, such as Minimal Cut Set, ratio cut set method, specification cut set method and minimax cut set method.Minimize
The object function of figure division methods, it is possible to obtain optimal clustering.But in solution procedure, it usually needs the drawing to figure
This matrix- eigenvector-decomposition of pula, so as to obtain globally optimal solution of the object function on the continuous domain relaxed.At present, spectral clustering
It has been successfully applied to the fields such as speech Separation, video index, Text region and image procossing.Spectral clustering is to clustering problem
Solution provides new approaches, can effectively handle many practical problems, and its research has huge scientific research value and application potential.
But spectral clustering is in the similitude of metric data point, it is generally recognized that importance all phases of each attribute of data
Together, it is defaulted as 1.In fact, the information that each attribute is included is different, there is also difference for their contributions to cluster.And
And often contain noise and incoherent feature in real data set, easily cause " dimension trap ", the cluster of algorithm of interference
Process, influence the accuracy of cluster result.Such as:It is assumed that data to be clustered have 20 dimension attributes, wherein only 2 dimension attributes are with gathering
Class is most related, and this 2 dimension attribute potential range in whole attribute space is farthest, in this case, when calculating similitude
If the effect to attribute is not added with distinguishing, easily cause misleading.Overcome one of effective ways of this problem be exactly be each
Attribute adds a weight parameter, allows different attributes to serve in cluster different, from Euclidean space for be exactly to elongate
Axle corresponding to association attributes, shorten axle corresponding to unrelated attribute.
For these reasons, there has been proposed the method for various features weighting, these methods to be broadly divided into two major classes:One
Class is subjective weighting method, and mainly by expert, rule of thumb subjective judgement obtains;Another kind of is objective assignment method, by each index in quilt
Real data in evaluation unit is formed.
The content of the invention
In order to solve the above problems, the present invention proposes a kind of the characteristic weighing Spectral Clustering and system of knowledge based entropy,
The importance of each attribute is assessed using the related notion of Knowledge entropy in rough set, is then assigned to accordingly as weights
Attributive character, reapply Spectral Clustering and data point is clustered.This method can make full use of what each feature was included
Information, influence of the redundant attributes to cluster result is eliminated, there is stronger robustness and generalization ability.
The present invention is achieved by the following scheme:
The present invention relates to a kind of characteristic weighing Spectral Clustering of knowledge based entropy, has added a weight for each attribute
Parameter, different attributes is allowed to serve in cluster different.Weight parameter is obtained by the Knowledge entropy in rough set, visitor
Division ability of the ground reflection attribute to sample data is seen, then the data after weighting are handled with Spectral Clustering, exports cluster result.
The related definition of the present invention is as follows:
Define 1. knowledge:If the nonempty finite set that U is made up of object, a referred to as domain.Any one of domain U
SubsetReferred to as U concept or category.Any subset cluster is referred to as the abstract knowledge on U in domain U, referred to as knows
Know.
What rough set theory mainly discussed is the knowledge that those can form division and covering on domain U, generally use etc.
Valency relation comes presentation class and knowledge.
Define 2. knowledge bases:Give domain a U and U on cluster equivalence relation S, claim two tuple K=(U, S) be on
A domain U knowledge base or approximation space.
It is exactly various as derived from equivalence relation that equivalence relation on domain, which represents division and knowledge, knowledge base,
Knowledge, classification capacity of the equivalence relation to domain is embodied, wherein also imply existing each between each knowledge in knowledge base
Kind relation.
Define 3. Indiscernible relations:The cluster equivalence relation S on domain a U and U is given, ifAnd P ≠
Φ, then the common factor (∩ P) of all equivalence relations is still an equivalence relation on domain U in P, can not differentiate referred to as on P
Relation, be designated as IND (P), do not produce obscure in the case of be often abbreviated as P.
Wherein [x]R(R ∈ P) be with object x meet Indiscernible relation all objects composition set, referred to as by etc.
The equivalence class for the object x that valency relation R is determined.
The set of all equivalence classes is designated as U/IND (P) as derived from IND (P), and it constitutes a U division, referred to as discusses
Domain U P- ABCs.U/IND (P) can be abbreviated as U/P.
Define 4. knowledge-representation systems:It is a knowledge-representation system to claim four-tuple KRS=(U, A, V, f), wherein, U is
The nonempty finite set of object, referred to as domain;A is the nonempty finite set of attribute, includes conditional attribute C and decision attribute D, A
=C ∪ D, C ∩ D=Φ;V is the codomain of all attributes,VaRepresent attribute a ∈ A codomain;F represents U × A → V
A mapping, referred to as information function.
Knowledge-representation system can be divided into two types:One kind is information system (information table), decision attribute set D=
Φ, the i.e. knowledge-representation system not comprising decision attribute;Another kind of is decision system (decision table), decision attribute set D ≠ Φ,
Include the knowledge-representation system of decision attribute.
Define 5. comentropies:If KRS=(U, A, V, f) is an information system,Closed for one on U is of equal value
Assembly is closed, and P is derived on U to be divided into U/P=U/IND (P)={ X1,X2,…,Xn, then knowledge U/P comentropy is defined as
Wherein,Represent equivalence class XiProbability in U.
It is an information system that theorem 1., which sets KRS=(U, A, V, f), equivalence relation set P,IfThen H (U/P)≤H (U/Q)
Theorem 1 shows that domain is divided thinner, and the comentropy of knowledge is bigger.The comentropy obtained by this method,
Its span is (1 ,+∞), is not suitable as weights.Therefore the present invention proposes another metric form of Knowledge entropy,
See definition 6.
Define 6. Knowledge entropies:If Z={ X1,X2,…,XnOne on domain U division, then Z Knowledge entropy is defined as
Wherein,
H (Z) describes the size of the information content included in knowledge Z, and it has the following properties that:
(1)
(2) H (Z)=0, and if only if Z={ U };
(3)And if only if | Xi|=| Xj|, (i, j=1,2 ..., n)
The measure of above-mentioned Knowledge entropy is readily appreciated that, and is calculated easy.From property (1), if | Z |=n,So more meet the value requirement of weight;Property (2) is pointed out if do not divided, just without not true
It is qualitative, so Knowledge entropy is 0;Property (3) shows that Knowledge entropy is maximum when dividing uniform, i.e., uncertain maximum, this
In the case of people be most difficult to make a choice, it is necessary to which more knowledge removes uncertainty.
The present invention comprises the following steps that:
Step 1, knowledge based entropy calculates input data set χ={ x1,x2,…,xn}(xi∈Rl) attribute importance, and
To each attribute weight of data.
Step 1.1:Data prediction.Rough set method can only handle discrete data, if data are continuous, just need
Sliding-model control is first carried out, that is, selects appropriate division points that the codomain of whole connection attribute is divided into some discrete sections,
Then the property value in each subinterval is represented with different integer values.
Step 1.2:The Knowledge entropy of computation attribute.On the basis of sample attribute value discretization, determine current attribute to sample
The division of notebook data, attribute ajBe divided into
Its Knowledge entropy is calculated further according to defining 6:
Step 1.3:Determine the importance of attribute.The Knowledge entropy of all properties is normalized, by Knowledge entropy institute
Importance of the proportion accounted for as attribute, its calculation formula are as follows:
Step 1.4:To the attribute weight of input data.Corresponding attributive character is assigned to using Attribute Significance as weights,
Obtain the data set χ ' of characteristic weighing
Step 2, similar matrix and Laplacian Matrix are constructed, and to Laplacian Matrix feature decomposition, calculates its feature
Value and characteristic vector.
Step 2.1:On the basis of χ ', the similarity matrix W ∈ R of data point are established using gaussian kernel functionn×n, Gauss
Shown in kernel function such as formula (7):
In formula, d (xi,xj) it is point xiAnd xjEuclidean distance;σ is scale parameter, controls similarity wijWith d (xi,
xj) decay speed.
Step 2.2:According to the Similarity measures degree matrix D ∈ R of data pointn×n.Spectral clustering regards data clusters problem as
One Graph partition problem, each point xiA summit is, any two points (x of the ∈ χ as figurei,xj) between similarity wijMake
To connect the weights on the side on two summits (i, j).With the weights sum on the summit i sides being connected, summit i degree is defined as, uses diTable
Show:
The degree of n data point just constitutes matrix D ∈ Rn×n, it is a diagonal matrix:Element on diagonal is di,
And off-diagonal element value is 0.
Step 2.3:According to similarity matrix W and degree matrix D, construction Laplacian Matrix Lsym:Lsym=D-1/2(D-W)D-1/2。
Step 2.4:Calculating matrix LsymPreceding k eigenvalue of maximum corresponding to characteristic vector u1,…,uk, then by this
A little characteristic vector longitudinal arrangements, form matrix U:
Step 2.5:Standard Process U every a line, is transformed into unit vector by row vector, obtains matrix Y:
Step 3, according to the characteristic vector of Laplacian Matrix, the representative point each Mapping of data points to a low-dimensional,
And these representative points are clustered.
Step 3.1:Regard matrix Y every a line as space RkIn a point, using k-means or other algorithms by this
A little points are divided into k classes;
Step 3.2:If matrix Y the i-th row is assigned to jth class, just by original data point xiIt is divided into jth class.
By above content, the application provides a kind of characteristic weighing Spectral Clustering of knowledge based entropy and is
System, the importance of each attribute of data set is assessed using the definition of Knowledge entropy in rough set, is then assigned as weights
Data point is clustered using Spectral Clustering to corresponding attributive character, then based on the data after weighting.The application passes through
Knowledge entropy is to the attribute weight of data, the information that each attribute can be made full use of to be included, and weakens noise data and redundancy category
Interference of the property to cluster, can preferably handle high dimensional data, have stronger robustness and generalization ability.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
There is the required accompanying drawing used in technology description to be briefly described, it is clear that ground, drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of flow chart of the characteristic weighing spectral clustering for knowledge based entropy that the embodiment of the present application provides.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only some embodiments of the present application, rather than whole embodiments.It is based on
Embodiment in the application, those of ordinary skill in the art are obtained all other under the premise of creative work is not paid
Embodiment, belong to the scope of the application protection.
Embodiment 1
As shown in figure 1, the present embodiment comprises the following steps:
Input:Data set χ={ x1,x2,…,xn}(xi∈Rl), clusters number k.
Output:Ready-portioned k is according to class.
Step 1:The pretreatment of sample data.According to the particular problem of research, the property value of sample is transformed into and is adapted to slightly
The data format of rough diversity method processing, such as carries out sliding-model control to continuous property.
Step 2:Calculate the importance of sample attribute.The Knowledge entropy of each attribute is calculated using formula (5), further according to formula
(6) importance of each attribute is confirmed.
Step 3:Sample attribute is weighted.Corresponding attributive character is assigned to using Attribute Significance as weights, obtains feature
The data set χ ' of weighting.
Step 4:On the basis of χ ', the similarity matrix W ∈ R of data point are established using formula (7)n×n, utilize formula
(8) the degree matrix D ∈ R of figure are establishedn×n。
Step 5:According to similarity matrix W and degree matrix D, construction Laplacian Matrix Lsym:Lsym=D-1/2(D-W)D-1/2。
Step 6:Calculating matrix LsymPreceding k eigenvalue of maximum corresponding to characteristic vector u1,…,uk, then by these
Characteristic vector longitudinal arrangement, form matrix U:
Step 7:Standard Process U every a line, is transformed into unit vector by row vector, obtains matrix Y:
Step 8:Regard matrix Y every a line as space RkIn a point, using k-means or other algorithms by these
Point is divided into k classes.
Step 9:If matrix Y the i-th row is assigned to jth class, just by original data point xiIt is divided into jth class.
Claims (9)
1. a kind of characteristic weighing Spectral Clustering of knowledge based entropy, it is characterised in that utilize the definition of Knowledge entropy in rough set
To assess the importance of each attribute of data set, corresponding attributive character then is assigned to as weights, then based on weighting
Data afterwards are clustered using Spectral Clustering to data point.
2. according to the method for claim 1, it is characterized in that, described data set is n × l matrix, matrix it is every
Row represents a data point, and each column represents an attribute, therefore this matrix includes n data point, and each data point has l kinds
Property, χ={ x can be expressed as1,x2,…,xn}(xi∈Rl)。
3. according to the method for claim 1, it is characterized in that, described Knowledge entropy refers to:
Wherein Z={ X1, X2..., XnIt is one on domain U division,p(Xi)=| Xi|/| U | represent
Probability of the equivalence class Xi in U.
4. according to the method for claim 1, it is characterized in that, described weighting refers to:Each category of data set is calculated respectively
The Knowledge entropy of property, and the Knowledge entropy to obtaining makees normalized, the importance using the proportion shared by Knowledge entropy as attribute, so
Each attribute of data set is multiplied by corresponding importance, the weighting of complete paired data collection afterwards.
5. the method according to claim 1 or 4, it is characterized in that, described weighting includes:
1:Data prediction.Sliding-model control is carried out to the continuous data of input, selects appropriate division points entirely will continuously belong to
The codomain of property is divided into some discrete sections, and the property value in each subinterval is then represented with different integer values.
2:The Knowledge entropy of computation attribute.On the basis of sample attribute value discretization, determine that current attribute is drawn to sample data
Point, attribute ajBe divided into
3:Determine the importance of attribute.The Knowledge entropy of all properties is normalized, by the ratio recast shared by Knowledge entropy
For the importance of attribute, any attribute ajImportance be:
4:To the attribute weight of input data.The corresponding attribute of data set is assigned to using Attribute Significance as weights, is weighted
Data set χ ' afterwards.
6. according to the method for claim 1, it is characterized in that, described spectral clustering includes:
1:Similar matrix and Laplacian Matrix are constructed, and to Laplacian Matrix feature decomposition, calculates its characteristic value and feature
Vector.
2:According to the characteristic vector of Laplacian Matrix, the representative point each Mapping of data points to a low-dimensional, and to these
Point is represented to be clustered.
7. the method according to claim 1 or 6, it is characterized in that, described feature decomposition includes:
1:On the basis of χ ', the similarity matrix W ∈ R of data point are established using gaussian kernel functionn×n, calculated according to matrix W
Each data point xiDegreeOne diagonal matrix, degree matrix D ∈ R are formed by the degree of n data pointn×n。
2:According to similarity matrix W and degree matrix D, construction Laplacian Matrix Lsym=D-1/2(D-W)D-1/2。
3:Calculating matrix LsymPreceding k eigenvalue of maximum corresponding to characteristic vector u1,…,uk, then by these characteristic vectors
Longitudinal arrangement, form matrix
4:Standard Process U every a line, is transformed into unit vector by row vector, obtains matrix Y:
8. the method according to claim 1 or 6, it is characterized in that, described cluster includes:
1:Regard matrix Y every a line as space RkIn a point, these points are divided into k using k-means or other algorithms
Class;
2:If matrix Y the i-th row is assigned to jth class, just by original data point xiIt is divided into jth class.
A kind of 9. system for realizing any of the above-described claim methods described, it is characterised in that:Attribute weight module, feature decomposition
Module and cluster module, wherein attribute weight module calculate the Knowledge entropy and Attribute Significance of each attribute of input data set,
Then it is assigned to corresponding attributive character as weights;Feature decomposition module is based on the dataset construction similitude square after weighting
Battle array and Laplacian Matrix, and to Laplacian Matrix feature decomposition;Cluster module is each Mapping of data points to by La Pula
In the feature space of the characteristic vector composition of this matrix, and the representative point in feature space is clustered, export cluster result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610514901.1A CN107563399A (en) | 2016-06-30 | 2016-06-30 | The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610514901.1A CN107563399A (en) | 2016-06-30 | 2016-06-30 | The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107563399A true CN107563399A (en) | 2018-01-09 |
Family
ID=60969619
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610514901.1A Pending CN107563399A (en) | 2016-06-30 | 2016-06-30 | The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107563399A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110347930A (en) * | 2019-07-18 | 2019-10-18 | 杭州连银科技有限公司 | A kind of high dimensional data based on statistical analysis technique is processed automatically and processing method |
CN110706092A (en) * | 2019-09-23 | 2020-01-17 | 深圳中兴飞贷金融科技有限公司 | Risk user identification method and device, storage medium and electronic equipment |
-
2016
- 2016-06-30 CN CN201610514901.1A patent/CN107563399A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110347930A (en) * | 2019-07-18 | 2019-10-18 | 杭州连银科技有限公司 | A kind of high dimensional data based on statistical analysis technique is processed automatically and processing method |
CN110706092A (en) * | 2019-09-23 | 2020-01-17 | 深圳中兴飞贷金融科技有限公司 | Risk user identification method and device, storage medium and electronic equipment |
CN110706092B (en) * | 2019-09-23 | 2021-05-18 | 前海飞算科技(深圳)有限公司 | Risk user identification method and device, storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yin et al. | Incomplete multi-view clustering via subspace learning | |
US10019442B2 (en) | Method and system for peer detection | |
Shen et al. | Non-negative matrix factorization clustering on multiple manifolds | |
CN109242002A (en) | High dimensional data classification method, device and terminal device | |
CN109886334B (en) | Shared neighbor density peak clustering method for privacy protection | |
CN107341505B (en) | Scene classification method based on image significance and Object Bank | |
CN109190698B (en) | Classification and identification system and method for network digital virtual assets | |
CN112529638B (en) | Service demand dynamic prediction method and system based on user classification and deep learning | |
CN111062428A (en) | Hyperspectral image clustering method, system and equipment | |
Tang et al. | Research on weeds identification based on K-means feature learning | |
Jia et al. | A Feature Weighted Spectral Clustering Algorithm Based on Knowledge Entropy. | |
He et al. | An effective information detection method for social big data | |
Lin et al. | A tensor approach for uncoupled multiview clustering | |
CN115761275A (en) | Unsupervised community discovery method and system based on graph neural network | |
Wang et al. | Multi-manifold clustering | |
Liu et al. | A high-order proximity-incorporated nonnegative matrix factorization-based community detector | |
Hou et al. | Robust clustering of multi-type relational data via a heterogeneous manifold ensemble | |
Huang et al. | Graph convolutional sparse subspace coclustering with nonnegative orthogonal factorization for large hyperspectral images | |
CN107563399A (en) | The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy | |
Baswade et al. | A comparative study of k-means and weighted k-means for clustering | |
CN109858543B (en) | Image memorability prediction method based on low-rank sparse representation and relationship inference | |
Zhu et al. | An improved fuzzy C-means clustering algorithm using Euclidean distance function | |
CN111401440A (en) | Target classification recognition method and device, computer equipment and storage medium | |
Xuan et al. | Subclass representation‐based face‐recognition algorithm derived from the structure scatter of training samples | |
Huang et al. | A bipartite graph partition-based coclustering approach with graph nonnegative matrix factorization for large hyperspectral images |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180109 |