CN107704872A - A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method - Google Patents

A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method Download PDF

Info

Publication number
CN107704872A
CN107704872A CN201710844898.4A CN201710844898A CN107704872A CN 107704872 A CN107704872 A CN 107704872A CN 201710844898 A CN201710844898 A CN 201710844898A CN 107704872 A CN107704872 A CN 107704872A
Authority
CN
China
Prior art keywords
dimension
data
relatively
discrete
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710844898.4A
Other languages
Chinese (zh)
Inventor
吴造林
胡长俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University of Science and Technology
Original Assignee
Anhui University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University of Science and Technology filed Critical Anhui University of Science and Technology
Priority to CN201710844898.4A priority Critical patent/CN107704872A/en
Publication of CN107704872A publication Critical patent/CN107704872A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of method that K means cluster initial center points based on relatively most discrete dimension segmentation are chosen.This method thinking is:The data set of given D dimensions, s1. carry out dimension-reduction treatment to data set;S2. the dispersion degree of each dimension of data set after dimensionality reduction is assessed;S3. select relatively most discrete dimension to be split, all data are divided into two classes according to the dimension average point;S4. one kind that data point is most in the classification after being split is chosen, relatively most discrete dimension is chosen according to s2 and s3, is continued to be split according to most discrete Wei Junzhidianchu, according to above-mentioned steps untill being divided into required classification number;S5. data in each classification split are averaged;S6. the average of each classification is carried out into a liter dimension to operate, and as the initial center point of K means clusters.The beneficial effects of the invention are as follows:Data after dimensionality reduction can reduce operand, accelerate arithmetic speed so that K means clusters can reach higher cluster accuracy rate with less iterations.

Description

A kind of K-means cluster initial centers based on relatively most discrete dimension segmentation are chosen Method
Technical field
The present invention relates to data mining technology field, more particularly to a kind of K-means based on relatively most discrete dimension segmentation Cluster initial center choosing method.
Background technology
The process that the set of physics or abstract object is divided into the multiple classes being made up of similar object is referred to as clustering.
Cluster is exactly that one group of individual is returned into some classifications according to similitude, i.e., " things of a kind come together, people of a mind fall into the same group ".Its purpose is so that Distance is as small as possible between belonging to same category of individual, and the distance between different classes of individual is as big as possible.Each class is also known as For cluster, the similitude of object is higher in cluster, and the similitude of object is relatively low between cluster.According to this feature, cluster, which can be divided into, to be based on Clustering algorithm of division, density, level and grid etc..
K-means is a kind of Classic Clustering Algorithms based on division, because the characteristics of its is easy and effective is widely used in counting According in the tasks such as excavation, machine learning, pattern-recognition.
K-means algorithms are hard clustering algorithms, are the representatives of the typically object function clustering method based on prototype, it is Data point asks the method for extreme value to obtain the tune of interative computation to certain object function of distance as optimization of prototype using function Whole rule.For K-means algorithms using Euclidean distance as similarity measure, it is to seek corresponding a certain initial cluster center vector V most Optimal sorting class so that evaluation index J is minimum.Algorithm is using error sum of squares criterion function as clustering criteria function.
K-means general principles are as follows:
If data acquisition system to be clustered:X={ xi|xi∈RD, i=1,2,3 ..., N };
It is K to cluster classification number, and K cluster centre is respectively C1,C2,…,CK
K initial center is randomly choosed from N number of data object;
The distance of each object and cluster center (average point) object is calculated, and is distributed to object accordingly according to minimum distance Cluster;
Distance is defined as Euclidean distance between object, and (x is set between any two data pointi, xj) Euclidean distance is
Recalculate the average of each cluster;
[0010]~[0012] is repeated, untill object function no longer changes.
Each object and the Euclidean distance sum at the cluster center of cluster where the object are referred to as object function in data set, general to use Alphabetical J is represented;
Above-mentioned is traditional K-means algorithms, and traditional K-means algorithms are easy to be influenceed by initial center point, such as Fruit initial center is clicked bad, and may increase iterations causes amount of calculation to increase, or even cluster can be made to be absorbed in part most Excellent solution, desired effect can not be reached.
The content of the invention
Implementation of the present invention proposes a kind of K-means initial center points choosing method, it is possible to reduce K-means algorithms Iterations.
Implementation of the present invention proposes a kind of K-means initial center points choosing method, and adopting said method can cause K- Means cluster result accuracys rate are higher, will not be absorbed in local optimum.
In traditional K-means algorithms, initial cluster center is all randomly assigned, such that K-means was clustered Iterations increase in journey, and local optimum is absorbed in, therefore, in the present invention, design a kind of initial center point choosing method It is as follows:
Give any data set U (including N number of D dimensions strong point) to be clustered, and cluster numbers K;
Data set X is obtained to data set progress dimensionality reduction first and (includes N number of d dimensions strong point, d<=D) because in original number According to concentration, it is understood that there may be linearly related dimension, by the way that the dimension of linear correlation to be transformed to the expression of linear independence, so as to extract The principal character component of data.
Assessed in data set Y per one-dimensional dispersion degree, namely calculate the variance per one-dimensional data, if xijRepresent The value of the jth dimension of i data, and define XjRepresent the value of all data of jth dimension;
It is as follows to calculate dispersion degree method:
Var () represents to seek variance in above-mentioned formula, after seeking variance to every one-dimensional data in Y, takes variance yields maximum that It is one-dimensional to be used as most discrete dimension.
The most discrete dimension data of note is S (S is the vector of N × 1).
Calculate vectorial S average Ms=mean (S).
The data point that will be more than in S corresponding to Ms value is divided into first box, is less than data corresponding to Ms value in S Point is divided into second box.
The data point number in each box is calculated, and selects the box comprising most data points as next time to be operated Data set, it is designated as box_max.
According to [0020]-[0026], box_max is continued to be divided into two box, and repeated the above steps, until box's Untill quantity is consistent with cluster numbers K.
Calculate the average M of data in each boxb=[Mb1,Mb2,...,MbK]T, MbFor K × d matrix.
According to the dimension reduction method in [0019], by MbCarry out a liter dimension for dimension to initial data and obtain Cb=[Cb1, Cb2,...,CbK]T, CbFor K × D matrix.
C nowbAs K-means clusters initial center.
Brief description of the drawings
Fig. 1 is that the K-means based on relatively most discrete dimension segmentation in the embodiment of the present invention clusters initial center choosing method Flow chart.
Fig. 2 is the Visual Graph of certain two-dimentional data set.
Fig. 3 is the initial center point and cluster selected in the embodiment of the present invention using the initial center choosing method of the present invention Central point comparison diagram afterwards, red small circle represent the initial center point that the inventive method is chosen, and the small square frame of black represents cluster Center of all categories afterwards.
Fig. 4 is the cluster result in the initial center of the inventive method selection.
Embodiment
To make the purpose of the present invention, technical scheme and advantage become apparent from, and the present invention are made below in conjunction with the accompanying drawings further Detailed description.
Embodiment and flow are as follows:
For data visualization, two-dimemsional number strong point only is used as examples of implementation, data-oriented collection U (420 2-D datas Point), cluster numbers 4.
Because data set U is 2-D data, therefore no longer dimensionality reduction, this measure is had no effect on to effective after high dimensional data dimensionality reduction Property.
The relatively most discrete dimension of 2-D data is assessed, data set is divided into two according to the dimension.
That data set that data point is more in two datasets is chosen, assesses the data set relatively most discrete dimension, then Cutting operation is carried out to it, now, forms 3 data sets.
Continue above-mentioned steps, a data set more by three data intensive data points are chosen, assess its relatively most from Cutting operation is carried out after dissipating dimension, is classified as two datasets, now, forms four data sets, it is consistent with clusters number.
Average is calculated to this four data sets, because not carrying out dimensionality reduction operation to data set U at the beginning, is not required here A liter dimension operation is carried out to U, the average of this four data sets is K-means cluster initial centers.Chosen by the inventive method Initial center see Fig. 3.
The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc., it should be included in the scope of the protection.

Claims (4)

1. a kind of K-means cluster centres point initial method, it is characterised in that this method includes:
Data set containing N number of data is tieed up to any given D, by the table that the data set transformation is linear independence between one group of each dimension Show, available for the principal character component for extracting the data set, i.e., dimensionality reduction is carried out to the data set.Then chosen for the data set Relatively most discrete dimension is split according to average point, after required cluster classification number is obtained after segmentation, then is partitioned into each Classification is averaged, and carries out the initial center point that the data point after liter dimension operation clusters as K-means.
2. K-means cluster centres point initial method according to claim 1, it is characterised in that described basis is relatively most The method of discrete dimension segmentation includes:
The dispersion degree of every one-dimensional data is assessed, for that relatively most discrete dimension, using the average of the dimension as Threshold point, data set are divided into two classes according to the dimension, and the data point of the two classes is placed into two box respectively, then at this The box for possessing more data points is chosen in two box as next data set for preparing operation, is designated as box_max;For Box_max, continue to calculate per one-dimensional variance, choose according to the method described above it is most discrete that is one-dimensional, according to the dimension to box_max Split, box_max is divided into two box again, repeated above-mentioned way, be until box quantity is consistent with cluster numbers K Only.
3. according to most separate division method relatively in claim 2, it is characterised in that described most discrete method includes relatively:
Before data segmentation is carried out, every one-dimensional data is normalized (because may not be in same per one-dimensional data One order of magnitude), calculated after normalization per one-dimensional variance, that dimension for selecting variance maximum is used as relatively most discrete dimension.
4. according to K-means cluster centres point initial method in claim 1, it is characterised in that described dimensionality reduction rises dimension side Method includes:
Concentrated in initial data, it is understood that there may be linearly related dimension, by the way that the dimension of linear correlation is transformed into linear independence Represent, so as to extract the principal character component of data, the transformation matrix used in dimensionality reduction can be operated with by liter dimension below So that the initial cluster center point after dimensionality reduction is reduced into initial dimension D.
CN201710844898.4A 2017-09-19 2017-09-19 A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method Pending CN107704872A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710844898.4A CN107704872A (en) 2017-09-19 2017-09-19 A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710844898.4A CN107704872A (en) 2017-09-19 2017-09-19 A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method

Publications (1)

Publication Number Publication Date
CN107704872A true CN107704872A (en) 2018-02-16

Family

ID=61172890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710844898.4A Pending CN107704872A (en) 2017-09-19 2017-09-19 A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method

Country Status (1)

Country Link
CN (1) CN107704872A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717465A (en) * 2018-06-04 2018-10-30 哈尔滨工程大学 Subgroup based on user behavior analysis finds method
CN109271555A (en) * 2018-09-19 2019-01-25 上海哔哩哔哩科技有限公司 Information cluster method, system, server and computer readable storage medium
CN111737469A (en) * 2020-06-23 2020-10-02 中山大学 Data mining method and device, terminal equipment and readable storage medium
CN113780404A (en) * 2020-01-14 2021-12-10 支付宝(杭州)信息技术有限公司 Resource data processing method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717465A (en) * 2018-06-04 2018-10-30 哈尔滨工程大学 Subgroup based on user behavior analysis finds method
CN109271555A (en) * 2018-09-19 2019-01-25 上海哔哩哔哩科技有限公司 Information cluster method, system, server and computer readable storage medium
CN113780404A (en) * 2020-01-14 2021-12-10 支付宝(杭州)信息技术有限公司 Resource data processing method and device
CN111737469A (en) * 2020-06-23 2020-10-02 中山大学 Data mining method and device, terminal equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN107704872A (en) A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method
CN109871860A (en) A kind of daily load curve dimensionality reduction clustering method based on core principle component analysis
CN110674407A (en) Hybrid recommendation method based on graph convolution neural network
CN107291895B (en) Quick hierarchical document query method
Pardeshi et al. Improved k-medoids clustering based on cluster validity index and object density
CN108846338A (en) Polarization characteristic selection and classification method based on object-oriented random forest
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
CN107316053A (en) A kind of cloth image Rapid matching search method
CN106156374A (en) A kind of view-based access control model dictionary optimizes and the image search method of query expansion
JP4937395B2 (en) Feature vector generation apparatus, feature vector generation method and program
CN104346459A (en) Text classification feature selecting method based on term frequency and chi-square statistics
CN103778206A (en) Method for providing network service resources
Tidake et al. Multi-label classification: a survey
CN111797267A (en) Medical image retrieval method and system, electronic device and storage medium
CN107391594A (en) A kind of image search method based on the sequence of iteration vision
Mandal et al. Unsupervised non-redundant feature selection: a graph-theoretic approach
CN111080351A (en) Clustering method and system for multi-dimensional data set
Mohammed et al. Weight-based firefly algorithm for document clustering
Vijay et al. Hamming distance based clustering algorithm
Jiang et al. A hybrid clustering algorithm
Devi et al. A proficient method for text clustering using harmony search method
Hu et al. A Novel clustering scheme based on density peaks and spectral analysis
Altintakan et al. An improved BOW approach using fuzzy feature encoding and visual-word weighting
CN107423379B (en) Image search method based on CNN feature words tree
Duchscherer Classifying Building Usages: A Machine Learning Approach on Building Extractions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180216

WD01 Invention patent application deemed withdrawn after publication