CN109063769A

CN109063769A - Clustering method, system and the medium of number of clusters amount are automatically confirmed that based on the coefficient of variation

Info

Publication number: CN109063769A
Application number: CN201810864958.3A
Authority: CN
Inventors: 刘腾腾; 曲守宁; 张坤; 杜韬; 王凯; 郭庆北; 朱连江; 王钦
Original assignee: University of Jinan
Current assignee: University of Jinan
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2018-12-21
Anticipated expiration: 2038-08-01
Also published as: CN109063769B

Abstract

The invention discloses clustering method, system and media that number of clusters amount is automatically confirmed that based on the coefficient of variation, calculate the density value of each data point in data set, calculate dnesity index according to density value, select the maximum data point of dnesity index as first cluster centre；The shortest distance between each data point and current existing cluster centre is calculated, the probability of cluster centre is then chosen as according to each data point of minimum distance calculation, preselects cluster centre according to wheel disc method；Until selecting setting cluster centre, k-means cluster is carried out to generate the cluster of corresponding number according to the initial cluster center selected；The coefficient of variation between the coefficient of variation and most tuftlet in calculating mean cluster, then, the difference for calculating the coefficient of variation between the coefficient of variation and most tuftlet in mean cluster, difference is compared with setting value, if difference is less than setting value, the smallest two clusters of the coefficient of variation between two clusters are merged；Until difference is more than or equal to setting value, then cluster result is exported.

Description

Clustering method, system and the medium of number of clusters amount are automatically confirmed that based on the coefficient of variation

Technical field

The present invention relates to clustering method, system and media that number of clusters amount is automatically confirmed that based on the coefficient of variation.

Background technique

With the fast development of information technology, many industries such as business, enterprise, scientific research institution and government department are all accumulated Magnanimity, different form storage data information often implies various useful information in these mass datas, only The query and search mechanism and statistical method for only relying on database are difficult to obtain these information, therefore the also data rapidly developed Digging technology, Clustering Analysis Technology are an important fields of research in data mining, have been widely used in many and have answered In, including pattern-recognition, data analysis, image procossing and market survey.

Clustering Analysis Technology is a kind of unsupervised learning method, wherein the clustering algorithm based on division is simple and can be with For various data types, but need the quantity of setting cluster and, k-means++ algorithm pair sensitive to initial cluster center in advance Traditional k-means algorithm is improved, but still haves the defects that the quantity of cluster is manually arranged.

Summary of the invention

In order to solve the deficiencies in the prior art, the present invention provides the cluster sides that number of clusters amount is automatically confirmed that based on the coefficient of variation Method, system and medium, solve traditional k-means++ clustering algorithm be manually arranged cluster quantity and initial mass center choose not When defect, the k-means++ clustering algorithm based on division is changed using the concept of the coefficient of variation and dnesity index Into also ensuring the accuracy of cluster result it is not necessary that the quantity of cluster is manually arranged；

In order to solve the above-mentioned technical problem, the present invention adopts the following technical scheme:

As the first aspect of the present invention, the clustering method that number of clusters amount is automatically confirmed that based on the coefficient of variation is provided；

The clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation, comprising:

Step (1): calculating the density value of each data point in data set, calculates dnesity index according to density value, selects close The maximum data point of index is spent as first cluster centre；

Step (2): the shortest distance between each data point and current existing cluster centre is calculated, then according to most short distance From the probability that each data point of calculating is chosen as cluster centre, finally, preselecting cluster centre according to wheel disc method；The pre-selection cluster The dnesity index at center is greater than given threshold；

Step (3): repeating step (2), until the cluster centre of setting number is selected, it is then initial according to selecting Cluster centre carries out k-means cluster to generate the cluster of corresponding number；

Step (4): then the coefficient of variation between the coefficient of variation and most tuftlet in calculating mean cluster calculates variation in mean cluster Difference is compared by the difference of the coefficient of variation between coefficient and most tuftlet with setting value, if difference is less than setting value, by two The smallest two clusters of the coefficient of variation merge between a cluster；It repeats step (4), until difference is more than or equal to setting value, then exports Cluster result.

Further, the step of calculating the density value of each data point in data set are as follows:

Assuming that data set (S₁, S₂..., S_d) there is d dimension attribute, and data space S=S₁×S₂×…×S_dIt is d dimension According to space, x ∈ (x₁,x₂,…,x_d) data point of the expression on d dimension data space in data set.

Firstly, the quantity k of setting initial cluster^*(k₁<k^*<k₂) value, wherein k₁And k₂It is the quantity greater than target cluster.

Then, the density value ρ of data point x is calculated_x, and indicated with formula (1) and (2):

Wherein, num is the number of data point, d_xyFor the distance of data point y in data set to data point x, R is density model It encloses, f (X) is to judge whether data point y is less than or equal to the function of density range R at a distance from data point x；

Further, dnesity index is calculated according to density value, selects the maximum data point of dnesity index poly- as first Class center；The step of are as follows:

According to density value ρ_xIt calculates packing density index D I (Density Index), and by the maximum data of dnesity index Point is used as first cluster centre:

Further, the step of calculating the shortest distance between each data point and current existing cluster centre are as follows:

According to the mode for selecting initial cluster center in k-means++ algorithm, for the remainder strong point in data set, according to Secondary calculating data point compares at a distance from the initial cluster center having been selected out and selects shortest distance as the data Shortest distance D (x) between point and current existing cluster centre.

Further, the step of probability of cluster centre being chosen as according to minimum distance calculation each data point are as follows:

Wherein, D (x) indicates the shortest distance between each data point and current existing cluster centre；P (x) indicates each Data point is chosen as the probability of cluster centre；

Further, the step of preselecting cluster centre according to wheel disc method are as follows:

Threshold tau is set, only when the dnesity index for preselecting cluster centre reaches τ, just can be used as formal cluster centre, Otherwise new data point is reselected as cluster centre；Wheel disc method is repeated always until selecting k^*A cluster centre.

Further, the step of calculating the coefficient of variation in mean cluster are as follows:

Firstly, calculating coefficient of variation CV in the cluster of each cluster_i:

Then, the coefficient of variation in mean cluster is calculated

Wherein, μ_iFor the mass center of cluster i, m_iFor the data point number of cluster i, x_jFor j-th of data point in cluster i, k^*Indicate pre- The number for the cluster centre selected.

Because the coefficient of variation is bigger to illustrate that data point is more discrete, so reflecting cluster by the coefficient of variation in calculating cluster The quality of condensation degree.

Further, between calculating most tuftlet the step of the coefficient of variation are as follows:

Firstly, calculating coefficient of variation CV between cluster_ij:

Then, coefficient of variation D between calculating most tuftlet_min:

D_min=min { CV_ij, i=1,2 ..., k^*, j=1,2 ..., k^*} (8)

Wherein, m_ijFor the number of data points of cluster i and cluster j, μ_ijFor the mass center of cluster i and cluster j, x_lFor the l in cluster i and cluster j A data point.

Further, the difference for calculating the coefficient of variation between the coefficient of variation and most tuftlet in mean cluster, by difference and setting value It is compared, if difference is less than setting value, the smallest two clusters of the coefficient of variation between two clusters is merged；If difference More than or equal to setting value, then the step of exporting cluster result are as follows:

The difference T for calculating the coefficient of variation between the coefficient of variation and most tuftlet in mean cluster, according to difference judge whether to need into The merging of row cluster:

If T < 0, i.e.,The smallest two clusters of the coefficient of variation between merging cluster；

If T >=0, i.e.,As 0≤T < ε, merge the smallest two clusters of the coefficient of variation between cluster；

As ε≤T, export cluster quantity and each cluster corresponding to data point.

As a second aspect of the invention, the clustering system that number of clusters amount is automatically confirmed that based on the coefficient of variation is provided；

The clustering system of number of clusters amount is automatically confirmed that based on the coefficient of variation, comprising: memory, processor and be stored in storage The computer instruction run on device and on a processor, when the computer instruction is run by processor, completes any of the above-described side Step described in method.

As the third aspect of the present invention, a kind of computer readable storage medium is provided；

A kind of computer readable storage medium, is stored thereon with computer instruction, and the computer instruction is transported by processor When row, step described in any of the above-described method is completed.

Compared with prior art, the beneficial effects of the present invention are:

The k-means++ clustering algorithm based on division is changed using the concept of the coefficient of variation and dnesity index Into also ensuring the accuracy of cluster result it is not necessary that the quantity of cluster is manually arranged.

It selects the maximum data point of dnesity index as first cluster centre, is due to the clustering algorithm pair based on division The selection of initial mass center is more sensitive, can effectively avoid the exceptional value in data set in this way.

The improved clustering algorithm for automatically confirming that number of clusters amount, using the coefficient of variation concept number of clusters amount confirmation and just It is all optimized in the selection of the prothyl heart, can have greatly improved on clustering result quality, data can be effectively applied to Clustering.

Condensation degree in the cluster of cluster is indicated with the coefficient of variation in cluster, and separating degree between the cluster of cluster is indicated with the coefficient of variation between cluster, when When condensation degree and separating degree reach maximum, Clustering Effect is optimal.

Detailed description of the invention

The accompanying drawings constituting a part of this application is used to provide further understanding of the present application, and the application's shows Meaning property embodiment and its explanation are not constituted an undue limitation on the present application for explaining the application.

Fig. 1 is the clustering algorithm flow chart that number of clusters amount is automatically confirmed that based on the coefficient of variation.

Specific embodiment

It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the application.Unless another It indicates, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.

It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.

As shown in Figure 1, automatically confirming that the clustering method of number of clusters amount based on the coefficient of variation, comprising:

Step1: the density value ρ of each data point in data set is calculated_x, dnesity index DI, selection are calculated according to density value The maximum data point of dnesity index is as first cluster centre.

Step2: the shortest distance D (x) between each data point and current existing cluster centre is calculated, then according to distance Size calculates the probability P (x) that each data point is chosen as next cluster centre, finally by wheel disc method selection pre-selection cluster The heart just can be used as new cluster centre, otherwise recalculate selection when the dnesity index for preselecting cluster centre reaches threshold tau.

Step3: repeating Step2, until selecting k^*(k₁<k^*<k₂) a cluster centre, and carry out k-means cluster and generate k^* A cluster.

Step4: the coefficient of variation in mean cluster is calculatedCoefficient of variation D between most tuftlet_min, difference T is obtained, if T < 0, i.e.,Lesser two clusters of separating degree are merged；If T >=0, i.e.,As 0≤T < ε, by separating degree compared with Two small clusters merge, and as ε≤T, Clustering Effect is optimal.

Step5: circulation executes Step4, until Clustering Effect is optimal.

First with the conceptual choice initial cluster center of dnesity index, clustering result quality is improved.The choosing of initial cluster center Select is to calculate dnesity index by the density value of each data point of calculating according to density value, select the maximum data of dnesity index Point is used as first cluster centre, and then basis calculates data point at a distance from existing cluster centre and is chosen as in next cluster The probability of the heart confirms other cluster centres with this, while the dnesity index of cluster centre will reach certain threshold value, most laggard Row k-means algorithm forms initial clustering.

The theme of meeting paper is varied, so needing to carry out clustering to meeting paper, will have similar topic Paper be brought together.But we are not aware that specific categorical measure at the beginning, in order to obtain the cluster effect of high quality Fruit, so the clustering algorithm for automatically confirming that number of clusters amount of proposition is applied to this.We were with NIPS meeting in 1987 to 2015 Argumentative writing is experimental data set, mainly according in the meeting paper in data set use English word number to meeting paper into Row clustering.The data set has 11463 dimension attributes and 5811 sample datas, and data space S=S₁×S₂×… ×S₁₁₄₆₃It is 11463 dimension data spaces, x ∈ (x₁,x₂,…,x₅₈₁₁) indicate that each word goes out occurrence in a NIPS meeting paper Several situations.

The quantity for confirming initial meeting paper classification, confirms k at random^*(k₁<k^*<k₂) value, wherein k₁And k₂It is obvious Greater than the value of target meeting category of paper quantity.

Calculate meeting paper data set (S₁, S₂..., S₁₁₄₆₃) in meeting paper x density value ρ_x, i.e., with meeting paper x's Diversity factor is less than or equal to the quantity of the meeting paper in density range,

Wherein, num is the quantity of meeting paper, d_xyFor the difference of meeting paper y and meeting paper x in meeting paper data set Different degree, R are density range, and f (X) is to judge whether meeting paper y and the diversity factor of meeting paper x are less than or equal to density range The function of R.

According to the density value ρ of every meeting paper_xIts dnesity index DI (Density Index) is calculated, and density is referred to The maximum meeting paper of number is as first cluster centre, i.e. DI_max, and indicated with formula (3),

The maximum meeting paper of dnesity index is selected as first cluster centre, to be because the cluster based on division is calculated Method is more sensitive to the selection of initial mass center, selects the biggish meeting paper of density that can effectively avoid as cluster centre different Normal paper data, to improve the quality of cluster.

The smallest diversity factor D (x) for calculating each meeting paper and current existing cluster centre, then according to diversity factor meter The probability that each meeting paper is chosen as next cluster centre is calculated,

Selection for initial cluster center, it should select the biggish meeting paper of mutual diversity factor as cluster centre, Therefore, the probability that each meeting paper is chosen as cluster centre is calculated, it is bigger with the diversity factor of existing cluster centre, then it is selected Probability as cluster centre is bigger, so that the cluster centre relative discrete selected.

According to probability by wheel disc method selection pre-selection cluster centre, since the clustering algorithm based on division is more quick to exceptional value Sense, so setting threshold tau just can be used as formal cluster centre only when the dnesity index for preselecting cluster centre reaches τ, Otherwise new meeting paper is reselected as cluster centre.This process is repeated always until selecting k^*A cluster centre, root According to obtained k^*A initial cluster center carries out traditional k-means algorithm and forms k^*A cluster.

Due to the category of paper quantity k of initial selected^*Significantly greater than target k value, so needing to carry out the merging of cluster for cluster Number be reduced to k, but be not aware that the category of paper quantity of target at the beginning, so introduce the concept of the coefficient of variation, Determine when the merging of stopping cluster.By calculating k^*In the mean cluster of a cluster between the coefficient of variation and most tuftlet the coefficient of variation relationship It determines whether the categorical measure of paper is optimal, i.e., indicates condensation degree in the cluster of cluster with the coefficient of variation in cluster, made a variation between cluster Coefficient indicates separating degree between the cluster of cluster, and when condensation degree and separating degree reach maximum, Clustering Effect is optimal.

The concept of the coefficient of variation is introduced, the coefficient of variation is to indicate a statistic of data distribution situation, for reflecting number According to dispersion degree, it is a characteristic that benefit, which is the average value for needing not refer to data, different comparing two groups of dimensions Or when the different data of mean value, it should use the coefficient of variation rather than standard deviation is as the reference compared, therefore use the coefficient of variation The threshold value for calculating number of clusters amount is suitable for all types of data sets.

It is meant that the ratio between the indicator of variation of one group of data and its average index, the i.e. ratio of standard deviation sigma and average value mu, And indicated with formula (5) and (6),

The coefficient of variation in the cluster of each cluster is calculated according to the coefficient of variation, then seeks the average value of the coefficient of variation in clusterIt is used in combination Formula (7) and (8) expression,

Wherein, μ_iFor the mass center of cluster i, m_iFor the quantity of the meeting paper of cluster i, x_jFor the jth piece meeting paper in cluster i.Cause For the bigger distribution for illustrating meeting paper of the coefficient of variation more disperses, so reflecting each cluster by the coefficient of variation in calculating cluster Condensation degree quality.

According to the coefficient of variation between the cluster between coefficient of variation calculating any two cluster, the minimum of the coefficient of variation between cluster is then sought Value D_min, and indicated with formula (9) and (10),

D_min=min { CV_ij, i=1,2 ..., k^*, j=1,2 ..., k^*} (10)

Wherein, m_ijFor the quantity and μ of the meeting paper of cluster i and cluster j_ijFor the mass center of cluster i and cluster j, x_lFor in cluster i and cluster j L meeting papers.Reflect the quality of two cluster separating degrees by the coefficient of variation between calculating cluster.

The difference T for calculating the coefficient of variation between the coefficient of variation and most tuftlet in mean cluster, according to difference judge whether to need into The merging of row cluster,

If T < 0, i.e.,Illustrate that there are lesser two clusters of the coefficient of variation between cluster.The coefficient of variation is smaller between cluster, and two The distribution of meeting paper in a cluster is more agglomerated, and separating degree is lower；Since the quantity of the cluster of initial setting up is greater than target cluster Quantity, so the coefficient of variation is smaller in mean cluster and amplitude of variation is smaller, the condensation degree of each cluster is higher, so only needing Carry out the merging of cluster.Combined strategy is to merge the smallest two clusters of separating degree, i.e., the coefficient of variation is D between cluster_minTwo Cluster.

If T >=0, i.e.,As 0≤T < ε, difference is smaller, illustrates there are the coefficient of variation between cluster lesser two Cluster, the coefficient of variation is more close in the coefficient of variation and cluster between cluster, and the distribution of meeting paper is more agglomerated in two clusters, and separating degree is got over It is low, and the condensation degree of each cluster is higher, then needs to carry out the merging of cluster；As ε≤T, there are certain difference, illustrate to become between cluster Different coefficient is bigger, and coefficient of variation difference is bigger in the coefficient of variation and cluster between cluster, the meeting paper distribution in two clusters more from It dissipating, separating degree is bigger, while the condensation degree of each cluster is higher, when the separating degree between all clusters reaches a certain level, this When Clustering Effect it is good, the quantity of optimal meeting paper classification can be obtained.

If carrying out the merging of cluster, need to recalculate the coefficient of variation in mean clusterCoefficient of variation D between most tuftlet_min, Then judged whether to be optimal Clustering Effect according to the difference of the two, otherwise continue the merging of cluster, circulation executes this mistake Journey, until reaching termination condition.

The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.

Claims

1. automatically confirming that the clustering method of number of clusters amount based on the coefficient of variation, characterized in that include:

Step (1): calculating the density value of each data point in data set, calculates dnesity index according to density value, density is selected to refer to The maximum data point of number is as first cluster centre；

Step (2): the shortest distance between each data point and current existing cluster centre is calculated, then according to shortest distance meter The probability that each data point is chosen as cluster centre is calculated, finally, preselecting cluster centre according to wheel disc method；The pre-selection cluster centre Dnesity index be greater than given threshold；

Step (3): repeating step (2), until the cluster centre of setting number is selected, then according to the initial clustering selected Center carries out k-means cluster to generate the cluster of corresponding number；

Step (4): in calculating mean cluster then the coefficient of variation between the coefficient of variation and most tuftlet calculates the coefficient of variation in mean cluster Difference is compared by the difference of the coefficient of variation between most tuftlet with setting value, if difference is less than setting value, by two clusters Between the smallest two clusters of the coefficient of variation merge；It repeats step (4), until difference is more than or equal to setting value, then exports cluster As a result.

2. the clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation as described in claim 1, characterized in that

The step of calculating the density value of each data point in data set are as follows:

Assuming that data set (S₁, S₂..., S_d) there is d dimension attribute, and data space S=S₁×S₂×…×S_dIt is that d dimension data is empty Between, x ∈ (x₁,x₂,…,x_d) data point of the expression on d dimension data space in data set；

Firstly, the quantity k of setting initial cluster^*Value, wherein k₁<k^*<k₂, k₁And k₂It is the quantity greater than target cluster；

Wherein, num is the number of data point, d_xyFor the distance of data point y in data set to data point x, R is density range, f It (X) is to judge whether data point y is less than or equal to the function of density range R at a distance from data point x.

3. the clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation as described in claim 1, characterized in that

Dnesity index is calculated according to density value, selects the step of maximum data point of dnesity index is as first cluster centre Are as follows:

According to density value ρ_xPacking density index D I is calculated, and using the maximum data point of dnesity index as first cluster centre:

4. the clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation as described in claim 1, characterized in that

The step of calculating the shortest distance between each data point and current existing cluster centre are as follows:

The mode of initial cluster center is selected successively to count the remainder strong point in data set according in k-means++ algorithm The data point is calculated at a distance from the initial cluster center having been selected out, compare select shortest distance as the data point with Shortest distance D (x) between current existing cluster centre.

5. the clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation as described in claim 1, characterized in that

The step of being chosen as the probability of cluster centre according to each data point of minimum distance calculation are as follows:

The step of preselecting cluster centre according to wheel disc method are as follows:

6. the clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation as described in claim 1, characterized in that

The step of calculating the coefficient of variation in mean cluster are as follows:

Then, the coefficient of variation in mean cluster is calculated

Wherein, μ_iFor the mass center of cluster i, m_iFor the data point number of cluster i, x_jFor j-th of data point in cluster i, k^*Expression is selected in advance Cluster centre number.

7. the clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation as described in claim 1, characterized in that

Between calculating most tuftlet the step of the coefficient of variation are as follows:

Firstly, calculating coefficient of variation CV between cluster_ij:

Then, coefficient of variation D between calculating most tuftlet_min:

D_min=min { CV_ij, i=1,2 ..., k^*, j=1,2 ..., k^*} (8)

Wherein, m_ijFor the number of data points of cluster i and cluster j, μ_ijFor the mass center of cluster i and cluster j, x_lFor first of number in cluster i and cluster j Strong point.

8. the clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation as described in claim 1, characterized in that

The difference for calculating the coefficient of variation between the coefficient of variation and most tuftlet in mean cluster, difference is compared with setting value, if Difference is less than setting value, then merges the smallest two clusters of the coefficient of variation between two clusters；If difference is more than or equal to setting The step of being worth, then exporting cluster result are as follows:

The difference T for calculating the coefficient of variation between the coefficient of variation and most tuftlet in mean cluster, judges whether to need to carry out cluster according to difference Merging:

9. automatically confirming that the clustering system of number of clusters amount based on the coefficient of variation, comprising: memory, processor and be stored in memory Computer instruction that is upper and running on a processor, when the computer instruction is run by processor, completes any of the above-described method The step.

10. a kind of computer readable storage medium, is stored thereon with computer instruction, the computer instruction is run by processor When, complete step described in any of the above-described method.