CN107704872A - A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method - Google Patents
A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method Download PDFInfo
- Publication number
- CN107704872A CN107704872A CN201710844898.4A CN201710844898A CN107704872A CN 107704872 A CN107704872 A CN 107704872A CN 201710844898 A CN201710844898 A CN 201710844898A CN 107704872 A CN107704872 A CN 107704872A
- Authority
- CN
- China
- Prior art keywords
- dimension
- data
- relatively
- discrete
- point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of method that K means cluster initial center points based on relatively most discrete dimension segmentation are chosen.This method thinking is:The data set of given D dimensions, s1. carry out dimension-reduction treatment to data set;S2. the dispersion degree of each dimension of data set after dimensionality reduction is assessed;S3. select relatively most discrete dimension to be split, all data are divided into two classes according to the dimension average point;S4. one kind that data point is most in the classification after being split is chosen, relatively most discrete dimension is chosen according to s2 and s3, is continued to be split according to most discrete Wei Junzhidianchu, according to above-mentioned steps untill being divided into required classification number;S5. data in each classification split are averaged;S6. the average of each classification is carried out into a liter dimension to operate, and as the initial center point of K means clusters.The beneficial effects of the invention are as follows:Data after dimensionality reduction can reduce operand, accelerate arithmetic speed so that K means clusters can reach higher cluster accuracy rate with less iterations.
Description
Technical field
The present invention relates to data mining technology field, more particularly to a kind of K-means based on relatively most discrete dimension segmentation
Cluster initial center choosing method.
Background technology
The process that the set of physics or abstract object is divided into the multiple classes being made up of similar object is referred to as clustering.
Cluster is exactly that one group of individual is returned into some classifications according to similitude, i.e., " things of a kind come together, people of a mind fall into the same group ".Its purpose is so that
Distance is as small as possible between belonging to same category of individual, and the distance between different classes of individual is as big as possible.Each class is also known as
For cluster, the similitude of object is higher in cluster, and the similitude of object is relatively low between cluster.According to this feature, cluster, which can be divided into, to be based on
Clustering algorithm of division, density, level and grid etc..
K-means is a kind of Classic Clustering Algorithms based on division, because the characteristics of its is easy and effective is widely used in counting
According in the tasks such as excavation, machine learning, pattern-recognition.
K-means algorithms are hard clustering algorithms, are the representatives of the typically object function clustering method based on prototype, it is
Data point asks the method for extreme value to obtain the tune of interative computation to certain object function of distance as optimization of prototype using function
Whole rule.For K-means algorithms using Euclidean distance as similarity measure, it is to seek corresponding a certain initial cluster center vector V most
Optimal sorting class so that evaluation index J is minimum.Algorithm is using error sum of squares criterion function as clustering criteria function.
K-means general principles are as follows:
If data acquisition system to be clustered:X={ xi|xi∈RD, i=1,2,3 ..., N };
It is K to cluster classification number, and K cluster centre is respectively C1,C2,…,CK;
K initial center is randomly choosed from N number of data object;
The distance of each object and cluster center (average point) object is calculated, and is distributed to object accordingly according to minimum distance
Cluster;
Distance is defined as Euclidean distance between object, and (x is set between any two data pointi, xj) Euclidean distance is
Recalculate the average of each cluster;
[0010]~[0012] is repeated, untill object function no longer changes.
Each object and the Euclidean distance sum at the cluster center of cluster where the object are referred to as object function in data set, general to use
Alphabetical J is represented;
Above-mentioned is traditional K-means algorithms, and traditional K-means algorithms are easy to be influenceed by initial center point, such as
Fruit initial center is clicked bad, and may increase iterations causes amount of calculation to increase, or even cluster can be made to be absorbed in part most
Excellent solution, desired effect can not be reached.
The content of the invention
Implementation of the present invention proposes a kind of K-means initial center points choosing method, it is possible to reduce K-means algorithms
Iterations.
Implementation of the present invention proposes a kind of K-means initial center points choosing method, and adopting said method can cause K-
Means cluster result accuracys rate are higher, will not be absorbed in local optimum.
In traditional K-means algorithms, initial cluster center is all randomly assigned, such that K-means was clustered
Iterations increase in journey, and local optimum is absorbed in, therefore, in the present invention, design a kind of initial center point choosing method
It is as follows:
Give any data set U (including N number of D dimensions strong point) to be clustered, and cluster numbers K;
Data set X is obtained to data set progress dimensionality reduction first and (includes N number of d dimensions strong point, d<=D) because in original number
According to concentration, it is understood that there may be linearly related dimension, by the way that the dimension of linear correlation to be transformed to the expression of linear independence, so as to extract
The principal character component of data.
Assessed in data set Y per one-dimensional dispersion degree, namely calculate the variance per one-dimensional data, if xijRepresent
The value of the jth dimension of i data, and define XjRepresent the value of all data of jth dimension;
It is as follows to calculate dispersion degree method:
Var () represents to seek variance in above-mentioned formula, after seeking variance to every one-dimensional data in Y, takes variance yields maximum that
It is one-dimensional to be used as most discrete dimension.
The most discrete dimension data of note is S (S is the vector of N × 1).
Calculate vectorial S average Ms=mean (S).
The data point that will be more than in S corresponding to Ms value is divided into first box, is less than data corresponding to Ms value in S
Point is divided into second box.
The data point number in each box is calculated, and selects the box comprising most data points as next time to be operated
Data set, it is designated as box_max.
According to [0020]-[0026], box_max is continued to be divided into two box, and repeated the above steps, until box's
Untill quantity is consistent with cluster numbers K.
Calculate the average M of data in each boxb=[Mb1,Mb2,...,MbK]T, MbFor K × d matrix.
According to the dimension reduction method in [0019], by MbCarry out a liter dimension for dimension to initial data and obtain Cb=[Cb1,
Cb2,...,CbK]T, CbFor K × D matrix.
C nowbAs K-means clusters initial center.
Brief description of the drawings
Fig. 1 is that the K-means based on relatively most discrete dimension segmentation in the embodiment of the present invention clusters initial center choosing method
Flow chart.
Fig. 2 is the Visual Graph of certain two-dimentional data set.
Fig. 3 is the initial center point and cluster selected in the embodiment of the present invention using the initial center choosing method of the present invention
Central point comparison diagram afterwards, red small circle represent the initial center point that the inventive method is chosen, and the small square frame of black represents cluster
Center of all categories afterwards.
Fig. 4 is the cluster result in the initial center of the inventive method selection.
Embodiment
To make the purpose of the present invention, technical scheme and advantage become apparent from, and the present invention are made below in conjunction with the accompanying drawings further
Detailed description.
Embodiment and flow are as follows:
For data visualization, two-dimemsional number strong point only is used as examples of implementation, data-oriented collection U (420 2-D datas
Point), cluster numbers 4.
Because data set U is 2-D data, therefore no longer dimensionality reduction, this measure is had no effect on to effective after high dimensional data dimensionality reduction
Property.
The relatively most discrete dimension of 2-D data is assessed, data set is divided into two according to the dimension.
That data set that data point is more in two datasets is chosen, assesses the data set relatively most discrete dimension, then
Cutting operation is carried out to it, now, forms 3 data sets.
Continue above-mentioned steps, a data set more by three data intensive data points are chosen, assess its relatively most from
Cutting operation is carried out after dissipating dimension, is classified as two datasets, now, forms four data sets, it is consistent with clusters number.
Average is calculated to this four data sets, because not carrying out dimensionality reduction operation to data set U at the beginning, is not required here
A liter dimension operation is carried out to U, the average of this four data sets is K-means cluster initial centers.Chosen by the inventive method
Initial center see Fig. 3.
The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and
Within principle, any modification, equivalent substitution and improvements made etc., it should be included in the scope of the protection.
Claims (4)
1. a kind of K-means cluster centres point initial method, it is characterised in that this method includes:
Data set containing N number of data is tieed up to any given D, by the table that the data set transformation is linear independence between one group of each dimension
Show, available for the principal character component for extracting the data set, i.e., dimensionality reduction is carried out to the data set.Then chosen for the data set
Relatively most discrete dimension is split according to average point, after required cluster classification number is obtained after segmentation, then is partitioned into each
Classification is averaged, and carries out the initial center point that the data point after liter dimension operation clusters as K-means.
2. K-means cluster centres point initial method according to claim 1, it is characterised in that described basis is relatively most
The method of discrete dimension segmentation includes:
The dispersion degree of every one-dimensional data is assessed, for that relatively most discrete dimension, using the average of the dimension as
Threshold point, data set are divided into two classes according to the dimension, and the data point of the two classes is placed into two box respectively, then at this
The box for possessing more data points is chosen in two box as next data set for preparing operation, is designated as box_max;For
Box_max, continue to calculate per one-dimensional variance, choose according to the method described above it is most discrete that is one-dimensional, according to the dimension to box_max
Split, box_max is divided into two box again, repeated above-mentioned way, be until box quantity is consistent with cluster numbers K
Only.
3. according to most separate division method relatively in claim 2, it is characterised in that described most discrete method includes relatively:
Before data segmentation is carried out, every one-dimensional data is normalized (because may not be in same per one-dimensional data
One order of magnitude), calculated after normalization per one-dimensional variance, that dimension for selecting variance maximum is used as relatively most discrete dimension.
4. according to K-means cluster centres point initial method in claim 1, it is characterised in that described dimensionality reduction rises dimension side
Method includes:
Concentrated in initial data, it is understood that there may be linearly related dimension, by the way that the dimension of linear correlation is transformed into linear independence
Represent, so as to extract the principal character component of data, the transformation matrix used in dimensionality reduction can be operated with by liter dimension below
So that the initial cluster center point after dimensionality reduction is reduced into initial dimension D.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710844898.4A CN107704872A (en) | 2017-09-19 | 2017-09-19 | A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710844898.4A CN107704872A (en) | 2017-09-19 | 2017-09-19 | A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107704872A true CN107704872A (en) | 2018-02-16 |
Family
ID=61172890
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710844898.4A Pending CN107704872A (en) | 2017-09-19 | 2017-09-19 | A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107704872A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108717465A (en) * | 2018-06-04 | 2018-10-30 | 哈尔滨工程大学 | Subgroup based on user behavior analysis finds method |
CN109271555A (en) * | 2018-09-19 | 2019-01-25 | 上海哔哩哔哩科技有限公司 | Information cluster method, system, server and computer readable storage medium |
CN111737469A (en) * | 2020-06-23 | 2020-10-02 | 中山大学 | Data mining method and device, terminal equipment and readable storage medium |
CN113780404A (en) * | 2020-01-14 | 2021-12-10 | 支付宝(杭州)信息技术有限公司 | Resource data processing method and device |
-
2017
- 2017-09-19 CN CN201710844898.4A patent/CN107704872A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108717465A (en) * | 2018-06-04 | 2018-10-30 | 哈尔滨工程大学 | Subgroup based on user behavior analysis finds method |
CN109271555A (en) * | 2018-09-19 | 2019-01-25 | 上海哔哩哔哩科技有限公司 | Information cluster method, system, server and computer readable storage medium |
CN113780404A (en) * | 2020-01-14 | 2021-12-10 | 支付宝(杭州)信息技术有限公司 | Resource data processing method and device |
CN111737469A (en) * | 2020-06-23 | 2020-10-02 | 中山大学 | Data mining method and device, terminal equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107704872A (en) | A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method | |
CN109871860A (en) | A kind of daily load curve dimensionality reduction clustering method based on core principle component analysis | |
CN110674407A (en) | Hybrid recommendation method based on graph convolution neural network | |
CN107291895B (en) | Quick hierarchical document query method | |
Pardeshi et al. | Improved k-medoids clustering based on cluster validity index and object density | |
CN108846338A (en) | Polarization characteristic selection and classification method based on object-oriented random forest | |
CN110751027B (en) | Pedestrian re-identification method based on deep multi-instance learning | |
CN107316053A (en) | A kind of cloth image Rapid matching search method | |
CN106156374A (en) | A kind of view-based access control model dictionary optimizes and the image search method of query expansion | |
JP4937395B2 (en) | Feature vector generation apparatus, feature vector generation method and program | |
CN104346459A (en) | Text classification feature selecting method based on term frequency and chi-square statistics | |
CN103778206A (en) | Method for providing network service resources | |
Tidake et al. | Multi-label classification: a survey | |
CN111797267A (en) | Medical image retrieval method and system, electronic device and storage medium | |
CN107391594A (en) | A kind of image search method based on the sequence of iteration vision | |
Mandal et al. | Unsupervised non-redundant feature selection: a graph-theoretic approach | |
CN111080351A (en) | Clustering method and system for multi-dimensional data set | |
Mohammed et al. | Weight-based firefly algorithm for document clustering | |
Vijay et al. | Hamming distance based clustering algorithm | |
Jiang et al. | A hybrid clustering algorithm | |
Devi et al. | A proficient method for text clustering using harmony search method | |
Hu et al. | A Novel clustering scheme based on density peaks and spectral analysis | |
Altintakan et al. | An improved BOW approach using fuzzy feature encoding and visual-word weighting | |
CN107423379B (en) | Image search method based on CNN feature words tree | |
Duchscherer | Classifying Building Usages: A Machine Learning Approach on Building Extractions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180216 |
|
WD01 | Invention patent application deemed withdrawn after publication |