CN109063769B

CN109063769B - Clustering method, system and medium for automatically determining cluster number based on coefficient of variation

Info

Publication number: CN109063769B
Application number: CN201810864958.3A
Authority: CN
Inventors: 刘腾腾; 曲守宁; 张坤; 杜韬; 王凯; 郭庆北; 朱连江; 王钦
Original assignee: University of Jinan
Current assignee: University of Jinan
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2021-04-09
Anticipated expiration: 2038-08-01
Also published as: CN109063769A

Abstract

The invention discloses a clustering method, a system and a medium for automatically confirming cluster quantity based on a coefficient of variation, wherein the density value of each data point in a data set is calculated, the density index is calculated according to the density value, and the data point with the maximum density index is selected as a first clustering center; calculating the shortest distance between each data point and the current existing clustering center, then calculating the probability of selecting each data point as the clustering center according to the shortest distance, and preselecting the clustering centers according to a wheel disc method; until a set clustering center is selected, performing k-means clustering according to the selected initial clustering center to generate clusters with corresponding number; calculating the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient, then calculating the difference value between the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient, comparing the difference value with a set value, and merging the two clusters with the minimum inter-cluster variation coefficient if the difference value is smaller than the set value; and outputting the clustering result until the difference value is greater than or equal to the set value.

Description

Clustering method, system and medium for automatically determining cluster number based on coefficient of variation

Technical Field

The invention relates to a clustering method, a system and a medium for automatically confirming cluster quantity based on a variation coefficient.

Background

With the rapid development of information technology, a large amount of data materials stored in different forms are accumulated in many industries such as businesses, enterprises, scientific research institutions and government departments, and various useful information is often hidden in the large amount of data and is difficult to obtain only by means of a query retrieval mechanism and a statistical method of a database, so that a data mining technology is rapidly developed, and a cluster analysis technology is an important research field in data mining and has been widely used in many applications including pattern recognition, data analysis, image processing and market research.

The clustering analysis technique is an unsupervised learning method in which a partition-based clustering algorithm is simple and can be used for various data types, but the number of clusters needs to be set in advance and is sensitive to initial clustering centers, and the k-means + + algorithm is an improvement over the conventional k-means algorithm, but the defect that the number of clusters is set manually still exists.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a clustering method, a system and a medium for automatically confirming the cluster number based on a variation coefficient, which solve the defects that the traditional k-means + + clustering algorithm manually sets the cluster number and the initial centroid is not properly selected, improve the k-means + + clustering algorithm based on division by using the concepts of the variation coefficient and the density index, do not need to manually set the cluster number, and ensure the accuracy of a clustering result;

in order to solve the technical problems, the invention adopts the following technical scheme:

as a first aspect of the present invention, there is provided a clustering method of automatically confirming the number of clusters based on a coefficient of variation;

the clustering method for automatically confirming the cluster number based on the coefficient of variation comprises the following steps:

step (1): calculating the density value of each data point in the data set, calculating a density index according to the density value, and selecting the data point with the maximum density index as a first clustering center;

step (2): calculating the shortest distance between each data point and the current existing clustering center, then calculating the probability of selecting each data point as the clustering center according to the shortest distance, and finally preselecting the clustering center according to a wheel disc method; the density index of the preselected clustering center is greater than a set threshold;

and (3): repeating the step (2) until a set number of clustering centers are selected, and then performing k-means clustering according to the selected initial clustering centers to generate clusters with corresponding numbers;

and (4): calculating the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient, then calculating the difference value between the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient, comparing the difference value with a set value, and merging the two clusters with the minimum inter-cluster variation coefficient if the difference value is smaller than the set value; and (5) repeating the step (4) until the difference value is larger than or equal to the set value, and outputting a clustering result.

Further, the step of calculating the density value of each data point in the data set comprises:

hypothesis data set (S)₁，S₂，…，S_d) Has d-dimensional property, and data space S ═ S₁×S₂×…×S_dIs a d-dimensional data space, x ∈ (x)₁,x₂,…,x_d) Representing data points in the data set on a d-dimensional data space.

First, the number k of initial clusters is set^*(k₁<k^*<k₂) A value of (a), wherein k₁And k₂Are each greater than the number of target clusters.

Then, the density value ρ of the data point x is calculated_xAnd expressed by equations (1) and (2):

where num is the number of data points, d_xy(f) is a function that determines whether the distance between the data point y and the data point x is less than or equal to the density range R;

further, calculating a density index according to the density value, and selecting a data point with the maximum density index as a first clustering center; comprises the following steps:

according to density value rho_xCalculate the data density index di (density index) and take the data point with the highest density index as the first cluster center:

further, the step of calculating the shortest distance between each data point and the current existing clustering center is as follows:

according to the mode of selecting the initial clustering center in the k-means + + algorithm, for the rest data points in the data set, sequentially calculating the distance between the data point and the selected initial clustering center, and comparing and selecting the shortest distance as the shortest distance D (x) between the data point and the current existing clustering center.

Further, the step of calculating the probability of each data point being selected as the cluster center according to the shortest distance comprises:

wherein, D (x) represents the shortest distance between each data point and the current existing cluster center; p (x) represents the probability of each data point being selected as a cluster center;

further, the step of preselecting the clustering center according to the roulette method comprises the following steps:

setting a threshold value tau, wherein only when the density index of the preselected clustering center reaches tau, the preselected clustering center can be used as a formal clustering center, otherwise, a new data point is reselected as the clustering center; repeating the roulette method until k is selected^*And (4) clustering centers.

Further, the step of calculating the average intra-cluster coefficient of variation is:

first, the intra-cluster coefficient of variation CV for each cluster is calculated_i：

Then, the mean intra-cluster coefficient of variation is calculated

Wherein, mu_iIs the centroid of cluster i, m_iNumber of data points for cluster i, x_jIs the jth data point, k, in cluster i^*Indicating the number of preselected cluster centers.

Since a larger coefficient of variation indicates more discrete data points, the degree of cluster aggregation is reflected by calculating the intra-cluster coefficient of variation.

Further, the step of calculating the minimum inter-cluster variation coefficient is:

first, the inter-cluster variation coefficient CV is calculated_ij：

Then, the minimum inter-cluster variation coefficient D is calculated_min：

D_min＝min{CV_ij,i＝1,2,…,k^*,j＝1,2,…,k^*} (8)

Wherein m is_ijNumber of data points, μ, for clusters i and j_ijIs the centroid of cluster i and cluster j, x_lThe ith data point in cluster i and cluster j.

Further, calculating a difference value between the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient, comparing the difference value with a set value, and merging the two clusters with the minimum inter-cluster variation coefficient if the difference value is smaller than the set value; if the difference value is larger than or equal to the set value, the step of outputting the clustering result is as follows:

calculating the difference value T between the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient, and judging whether cluster combination needs to be carried out according to the difference value:

if T<0, i.e.

Merging two clusters with the minimum inter-cluster variation coefficient;

if T is greater than or equal to 0, that is

When 0 is less than or equal to T<When epsilon, merging two clusters with the minimum inter-cluster variation coefficient;

and when epsilon is less than or equal to T, outputting the number of clusters and the data point corresponding to each cluster.

As a second aspect of the present invention, there is provided a clustering system that automatically confirms the number of clusters based on a coefficient of variation;

a clustering system for automatically determining the number of clusters based on the coefficient of variation, comprising: the computer program product comprises a memory, a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of any of the above methods.

As a third aspect of the present invention, there is provided a computer-readable storage medium;

a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, perform the steps of any of the above methods.

Compared with the prior art, the invention has the beneficial effects that:

the k-means + + clustering algorithm based on division is improved by using the concepts of the variation coefficient and the density index, the number of clusters does not need to be manually set, and the accuracy of a clustering result is also ensured.

The data point with the maximum density index is selected as the first clustering center, and the clustering algorithm based on division is sensitive to the selection of the initial centroid, so that abnormal values in the data set can be effectively avoided.

The improved clustering algorithm for automatically confirming the cluster number optimizes the confirmation of the cluster number and the selection of the initial centroid by utilizing the concept of the coefficient of variation, greatly improves the clustering quality, and can be effectively applied to the clustering analysis of data.

And the intra-cluster variation coefficient is used for representing the intra-cluster cohesion degree of the clusters, the inter-cluster variation coefficient is used for representing the inter-cluster separation degree of the clusters, and when the cohesion degree and the separation degree reach the maximum, the clustering effect is optimal.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

Fig. 1 is a flow chart of a clustering algorithm for automatically determining the number of clusters based on the coefficient of variation.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the clustering method for automatically determining the number of clusters based on the coefficient of variation includes:

step 1: calculating each data point in the data setDensity value ρ_xThe density index DI is calculated from the density values and the data point with the highest density index is selected as the first cluster center.

Step 2: calculating the shortest distance D (x) between each data point and the current existing clustering center, then calculating the probability P (x) of each data point being selected as the next clustering center according to the distance, finally selecting the preselected clustering center according to a roulette method, and when the density index of the preselected clustering center reaches a threshold value tau, using the preselected clustering center as a new clustering center, or recalculating and selecting.

Step 3: step2 is repeated until k is selected^*(k₁<k^*<k₂) Individual clustering centers and performing k-means clustering to generate k^*And (4) clustering.

Step 4: calculating the mean intra-cluster coefficient of variation

And minimum inter-cluster coefficient of variation D_minObtaining a difference value T if T<0, i.e.

Merging the two clusters with smaller separation degree; if T is greater than or equal to 0, that is

When 0 is less than or equal to T<And when epsilon is less than or equal to T, the clustering effect is optimal.

Step 5: and (5) circularly executing Step4 until the clustering effect is optimal.

Firstly, an initial clustering center is selected by utilizing the concept of density index, and the clustering quality is improved. The initial clustering center is selected by calculating the density value of each data point, calculating the density index according to the density value, selecting the data point with the maximum density index as the first clustering center, then calculating the probability that the data point is selected as the next clustering center according to the distance between the data point and the existing clustering center, confirming other clustering centers, and finally performing a k-means algorithm to form the initial clustering when the density index of the clustering center reaches a certain threshold value.

The topic of the conference paper is various, so that cluster analysis needs to be performed on the conference paper to gather papers with similar topics. But we do not know the specific number of categories at first, so in order to obtain high-quality clustering effect, the proposed clustering algorithm for automatically confirming the number of clusters is applied to this. The NIPS conference paper from 1987 to 2015 is taken as an experimental data set, and clustering analysis is mainly performed on the conference paper according to the number of times of using English words in the conference paper in the data set. The data set has 11463-dimensional attributes and 5811 sample data, and the data space S ═ S₁×S₂×…×S₁₁₄₆₃Is a data space of 11463 dimensions, x ∈ (x)₁,x₂,…,x₅₈₁₁) Showing the number of occurrences of each word in one of the NIPS conference papers.

Confirming the number of categories of the initial conference paper, and randomly confirming k^*(k₁<k^*<k₂) A value of (a), wherein k₁And k₂Are all significantly larger values than the number of categories of the target meeting paper.

Computing a meeting paper data set (S)₁，S₂，…，S₁₁₄₆₃) Density value ρ of Mediterranean paper x_xThat is, the difference degree between the conference papers x is smaller than or equal to the number of conference papers in the density range,

where num is the number of meeting papers, d_xyThe difference between the conference paper y and the conference paper x in the conference paper data set is R is the density range, and f (X) is a function for judging whether the difference between the conference paper y and the conference paper x is less than or equal to the density range R.

Density value ρ from each meeting paper_xCalculating the density index DI (Density index) and using the meeting paper with the largest density index as the first clustering center, namely DI_maxAnd is expressed by the formula (3),

the meeting discussion with the largest density index is selected as the first clustering center, because the clustering algorithm based on division is sensitive to the selection of the initial centroid, and abnormal discussion data can be effectively avoided by selecting the meeting discussion with the larger density as the clustering center, so that the clustering quality is improved.

Calculating the minimum difference degree D (x) of each conference paper and the current existing clustering center, then calculating the probability of each conference paper being selected as the next clustering center according to the difference degree,

for the selection of the initial clustering center, the conference papers with larger mutual difference should be selected as the clustering center, so that the probability that each conference paper is selected as the clustering center is calculated, and the larger the difference with the existing clustering center is, the larger the probability that the conference paper is selected as the clustering center is, so that the selected clustering center is relatively discrete.

And selecting a preselected clustering center according to the probability by a roulette method, setting a threshold tau as the clustering algorithm based on division is sensitive to abnormal values, and taking the preselected clustering center as a formal clustering center only when the density index of the preselected clustering center reaches the threshold tau, otherwise, reselecting a new conference paper as the clustering center. This process is repeated until k is selected^*A cluster center based on the obtained k^*Performing a conventional k-means algorithm to form k at an initial clustering center^*And (4) clustering.

Number of paper categories k due to initial selection^*Is obviously larger than the target k value, so the number of clusters needs to be mergedReduced to k, but the number of article categories of the target is not known at first, so the concept of coefficient of variation is introduced to determine when to stop the merging of clusters. By calculating k^*And determining whether the category number of the thesis is optimal or not by the relation between the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient of each cluster, namely, representing the intra-cluster cohesion degree of the clusters by using the intra-cluster variation coefficient, representing the inter-cluster separation degree of the clusters by using the inter-cluster variation coefficient, and when the cohesion degree and the separation degree are both maximized, obtaining the optimal clustering effect.

The concept of the variation coefficient is introduced, the variation coefficient is a statistic for representing the data distribution condition and is used for reflecting the dispersion degree of the data, the advantage is that the average value of the data does not need to be referred, the data is a dimensionless quantity, when two groups of data with different dimensions or different average values are compared, the variation coefficient rather than the standard deviation is used as a reference for comparison, and therefore, the threshold value for calculating the cluster number by adopting the variation coefficient is suitable for all types of data sets.

It means the ratio of the variation index of a set of data to the average index thereof, i.e., the ratio of the standard deviation sigma to the average value mu, and is expressed by the formulas (5) and (6),

calculating the intra-cluster variation coefficient of each cluster according to the variation coefficient, and then averaging the intra-cluster variation coefficients

And expressed by equations (7) and (8),

wherein, mu_iIs the centroid of cluster i, m_iNumber of meeting papers for cluster i, x_jIs the jth meeting paper in cluster i. Since a larger coefficient of variation indicates a more dispersed distribution of the conference paper, how well the degree of aggregation of each cluster is reflected by calculating the intra-cluster coefficient of variation.

Calculating inter-cluster variation coefficient between any two clusters according to the variation coefficient, and then obtaining minimum value D of the inter-cluster variation coefficient_minAnd expressed by formulas (9) and (10),

D_min＝min{CV_ij,i＝1,2,…,k^*,j＝1,2,…,k^*} (10)

wherein m is_ijThe sum of the number of meeting papers for cluster i and cluster j, μ_ijIs the centroid of cluster i and cluster j, x_lThe first meeting paper in cluster i and cluster j. The separation degree of the two clusters is reflected by calculating the variation coefficient among the clusters.

Calculating the difference value T between the variation coefficient in the average cluster and the variation coefficient between the minimum clusters, judging whether cluster combination is needed or not according to the difference value,

if T<0, i.e.

Indicating the presence of two clusters with a smaller inter-cluster variation coefficient. The smaller the inter-cluster variation coefficient, the more cohesive the distribution of the meeting papers in the two clusters, the lower the degree of separation; because the number of the clusters which are initially arranged is larger than that of the target clusters, the variation coefficients in the average clusters are smaller, the variation amplitude is smaller, the degree of agglomeration of each cluster is higher, and the method has the advantages thatOnly a merging of clusters needs to be performed. The strategy of merging is to merge two clusters with the least separation, i.e. the inter-cluster variation coefficient is D_minTwo clusters of (a).

If T is greater than or equal to 0, that is

When 0 is less than or equal to T<When epsilon is generated, the difference value is smaller, which indicates that two clusters with smaller inter-cluster variation coefficient exist, the closer the inter-cluster variation coefficient and the intra-cluster variation coefficient are, the more the distribution of the conference papers in the two clusters is aggregated, the lower the separation degree is, and the higher the aggregation degree of each cluster is, the cluster merging needs to be performed; when epsilon is less than or equal to T, a certain difference exists, which indicates that the inter-cluster variation coefficients are large, the larger the difference between the inter-cluster variation coefficients and the intra-cluster variation coefficients is, the more discrete the distribution of the conference papers in the two clusters is, the larger the separation degree is, and simultaneously, the higher the cohesion degree of each cluster is, when the separation degrees among all the clusters reach a certain degree, the good clustering effect is achieved, and the best number of the conference paper categories can be obtained.

If the clusters are merged, the average intra-cluster variation coefficient needs to be recalculated

And minimum inter-cluster coefficient of variation D_minAnd then judging whether the optimal clustering effect is achieved or not according to the difference value of the two, otherwise, continuing to merge clusters, and circularly executing the process until the termination condition is achieved.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A clustering method for automatically confirming the number of meeting paper clusters based on variation coefficients is characterized by comprising the following steps:

taking NIPS conference paper from 1987 to 2015 as an experimental data set according to the meetings in the data setClustering analysis is carried out on the conference paper by using the times of English words in the discussion paper, the data set has 11463-dimensional attributes and 5811 sample data, and the data space S is S ═ S₁×S₂×…×S₁₁₄₆₃Is a data space of 11463 dimensions, x ∈ (x)₁，x₂，...，x₅₈₁₁) Representing the occurrence number of each word in one NIPS conference paper;

step (1) confirming the number of the categories of the initial conference paper and randomly confirming k^*(k₁＜k^*＜k₂) A value of (a), wherein k₁And k₂Are each a value greater than the number of categories of the target meeting paper,

where num is the number of meeting papers, d_xyThe difference degree between the conference paper y and the conference paper x in the conference paper data set is determined, R is a density range, f (X) is a function for judging whether the difference degree between the conference paper y and the conference paper x is less than or equal to the density range R, and according to the density value rho of each conference paper_xCalculating the density index DI, and using the meeting article with the highest density index as the first clustering center, i.e. DI_maxThe number of the atoms, expressed as,

step (2), calculating the minimum difference degree D (x) between each conference paper and the current existing clustering center, then calculating the probability of each conference paper being selected as the next clustering center according to the difference degree,

and (3) selecting a preselected clustering center according to the probability by a wheel disc method, setting a threshold tau, taking the preselected clustering center as a formal clustering center only when the density index of the preselected clustering center reaches the tau, otherwise, reselecting a new conference paper as the clustering center, and repeating the wheel disc method until k is selected^*A cluster center based on the obtained k^*An initial clustering center, performing k-means clustering to form k^*Clustering;

step (4), calculating the intra-cluster variation coefficient of each cluster, and then calculating the average value of the intra-cluster variation coefficients

As indicated by the general representation of the,

wherein, mu_iIs the centroid of cluster i, m_iNumber of meeting papers for cluster i, x_jFor the jth meeting paper in cluster i,

calculating the inter-cluster variation coefficient between any two clusters, and then finding the minimum value D of the inter-cluster variation coefficient_minThe number of the atoms, expressed as,

D_min＝min{CV_ij，i＝1，2，…，k^*，j＝1，2，…，k^*}；

wherein m is_ijThe sum of the number of meeting papers for cluster i and cluster j, μ_ijIs the centroid of cluster i and cluster j, x_lCalculating the difference value T between the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient for the 1 st meeting paper in the cluster i and the cluster j, judging whether cluster combination needs to be carried out or not according to the difference value,

if T is less than 0, i.e.

Merging two clusters with the minimum inter-cluster variation coefficient;

if T is greater than or equal to 0, that is

When T is more than or equal to 0 and less than epsilon, merging two clusters with the minimum inter-cluster variation coefficient;

when epsilon is less than or equal to T, the optimal clustering effect is achieved, the number of clusters and the data points corresponding to each cluster are output,

2. A clustering system for automatically confirming the number of meeting paper clusters based on coefficient of variation comprises: a memory, a processor, and computer instructions stored on the memory and executed on the processor, which when executed by the processor, perform the steps of claim 1.

3. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, perform the steps of claim 1.