CN109063769B - Clustering method, system and medium for automatically determining cluster number based on coefficient of variation - Google Patents

Clustering method, system and medium for automatically determining cluster number based on coefficient of variation Download PDF

Info

Publication number
CN109063769B
CN109063769B CN201810864958.3A CN201810864958A CN109063769B CN 109063769 B CN109063769 B CN 109063769B CN 201810864958 A CN201810864958 A CN 201810864958A CN 109063769 B CN109063769 B CN 109063769B
Authority
CN
China
Prior art keywords
cluster
clustering
paper
clusters
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810864958.3A
Other languages
Chinese (zh)
Other versions
CN109063769A (en
Inventor
刘腾腾
曲守宁
张坤
杜韬
王凯
郭庆北
朱连江
王钦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Jinan
Original Assignee
University of Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Jinan filed Critical University of Jinan
Priority to CN201810864958.3A priority Critical patent/CN109063769B/en
Publication of CN109063769A publication Critical patent/CN109063769A/en
Application granted granted Critical
Publication of CN109063769B publication Critical patent/CN109063769B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a clustering method, a system and a medium for automatically confirming cluster quantity based on a coefficient of variation, wherein the density value of each data point in a data set is calculated, the density index is calculated according to the density value, and the data point with the maximum density index is selected as a first clustering center; calculating the shortest distance between each data point and the current existing clustering center, then calculating the probability of selecting each data point as the clustering center according to the shortest distance, and preselecting the clustering centers according to a wheel disc method; until a set clustering center is selected, performing k-means clustering according to the selected initial clustering center to generate clusters with corresponding number; calculating the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient, then calculating the difference value between the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient, comparing the difference value with a set value, and merging the two clusters with the minimum inter-cluster variation coefficient if the difference value is smaller than the set value; and outputting the clustering result until the difference value is greater than or equal to the set value.

Description

Clustering method, system and medium for automatically determining cluster number based on coefficient of variation
Technical Field
The invention relates to a clustering method, a system and a medium for automatically confirming cluster quantity based on a variation coefficient.
Background
With the rapid development of information technology, a large amount of data materials stored in different forms are accumulated in many industries such as businesses, enterprises, scientific research institutions and government departments, and various useful information is often hidden in the large amount of data and is difficult to obtain only by means of a query retrieval mechanism and a statistical method of a database, so that a data mining technology is rapidly developed, and a cluster analysis technology is an important research field in data mining and has been widely used in many applications including pattern recognition, data analysis, image processing and market research.
The clustering analysis technique is an unsupervised learning method in which a partition-based clustering algorithm is simple and can be used for various data types, but the number of clusters needs to be set in advance and is sensitive to initial clustering centers, and the k-means + + algorithm is an improvement over the conventional k-means algorithm, but the defect that the number of clusters is set manually still exists.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a clustering method, a system and a medium for automatically confirming the cluster number based on a variation coefficient, which solve the defects that the traditional k-means + + clustering algorithm manually sets the cluster number and the initial centroid is not properly selected, improve the k-means + + clustering algorithm based on division by using the concepts of the variation coefficient and the density index, do not need to manually set the cluster number, and ensure the accuracy of a clustering result;
in order to solve the technical problems, the invention adopts the following technical scheme:
as a first aspect of the present invention, there is provided a clustering method of automatically confirming the number of clusters based on a coefficient of variation;
the clustering method for automatically confirming the cluster number based on the coefficient of variation comprises the following steps:
step (1): calculating the density value of each data point in the data set, calculating a density index according to the density value, and selecting the data point with the maximum density index as a first clustering center;
step (2): calculating the shortest distance between each data point and the current existing clustering center, then calculating the probability of selecting each data point as the clustering center according to the shortest distance, and finally preselecting the clustering center according to a wheel disc method; the density index of the preselected clustering center is greater than a set threshold;
and (3): repeating the step (2) until a set number of clustering centers are selected, and then performing k-means clustering according to the selected initial clustering centers to generate clusters with corresponding numbers;
and (4): calculating the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient, then calculating the difference value between the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient, comparing the difference value with a set value, and merging the two clusters with the minimum inter-cluster variation coefficient if the difference value is smaller than the set value; and (5) repeating the step (4) until the difference value is larger than or equal to the set value, and outputting a clustering result.
Further, the step of calculating the density value of each data point in the data set comprises:
hypothesis data set (S)1,S2,…,Sd) Has d-dimensional property, and data space S ═ S1×S2×…×SdIs a d-dimensional data space, x ∈ (x)1,x2,…,xd) Representing data points in the data set on a d-dimensional data space.
First, the number k of initial clusters is set*(k1<k*<k2) A value of (a), wherein k1And k2Are each greater than the number of target clusters.
Then, the density value ρ of the data point x is calculatedxAnd expressed by equations (1) and (2):
Figure BDA0001750686510000021
Figure BDA0001750686510000022
where num is the number of data points, dxy(f) is a function that determines whether the distance between the data point y and the data point x is less than or equal to the density range R;
further, calculating a density index according to the density value, and selecting a data point with the maximum density index as a first clustering center; comprises the following steps:
according to density value rhoxCalculate the data density index di (density index) and take the data point with the highest density index as the first cluster center:
Figure BDA0001750686510000023
further, the step of calculating the shortest distance between each data point and the current existing clustering center is as follows:
according to the mode of selecting the initial clustering center in the k-means + + algorithm, for the rest data points in the data set, sequentially calculating the distance between the data point and the selected initial clustering center, and comparing and selecting the shortest distance as the shortest distance D (x) between the data point and the current existing clustering center.
Further, the step of calculating the probability of each data point being selected as the cluster center according to the shortest distance comprises:
Figure BDA0001750686510000024
wherein, D (x) represents the shortest distance between each data point and the current existing cluster center; p (x) represents the probability of each data point being selected as a cluster center;
further, the step of preselecting the clustering center according to the roulette method comprises the following steps:
setting a threshold value tau, wherein only when the density index of the preselected clustering center reaches tau, the preselected clustering center can be used as a formal clustering center, otherwise, a new data point is reselected as the clustering center; repeating the roulette method until k is selected*And (4) clustering centers.
Further, the step of calculating the average intra-cluster coefficient of variation is:
first, the intra-cluster coefficient of variation CV for each cluster is calculatedi
Figure BDA0001750686510000031
Then, the mean intra-cluster coefficient of variation is calculated
Figure BDA0001750686510000032
Figure BDA0001750686510000033
Wherein, muiIs the centroid of cluster i, miNumber of data points for cluster i, xjIs the jth data point, k, in cluster i*Indicating the number of preselected cluster centers.
Since a larger coefficient of variation indicates more discrete data points, the degree of cluster aggregation is reflected by calculating the intra-cluster coefficient of variation.
Further, the step of calculating the minimum inter-cluster variation coefficient is:
first, the inter-cluster variation coefficient CV is calculatedij
Figure BDA0001750686510000034
Then, the minimum inter-cluster variation coefficient D is calculatedmin
Dmin=min{CVij,i=1,2,…,k*,j=1,2,…,k*} (8)
Wherein m isijNumber of data points, μ, for clusters i and jijIs the centroid of cluster i and cluster j, xlThe ith data point in cluster i and cluster j.
Further, calculating a difference value between the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient, comparing the difference value with a set value, and merging the two clusters with the minimum inter-cluster variation coefficient if the difference value is smaller than the set value; if the difference value is larger than or equal to the set value, the step of outputting the clustering result is as follows:
calculating the difference value T between the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient, and judging whether cluster combination needs to be carried out according to the difference value:
Figure BDA0001750686510000035
if T<0, i.e.
Figure BDA0001750686510000036
Merging two clusters with the minimum inter-cluster variation coefficient;
if T is greater than or equal to 0, that is
Figure BDA0001750686510000037
When 0 is less than or equal to T<When epsilon, merging two clusters with the minimum inter-cluster variation coefficient;
and when epsilon is less than or equal to T, outputting the number of clusters and the data point corresponding to each cluster.
As a second aspect of the present invention, there is provided a clustering system that automatically confirms the number of clusters based on a coefficient of variation;
a clustering system for automatically determining the number of clusters based on the coefficient of variation, comprising: the computer program product comprises a memory, a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of any of the above methods.
As a third aspect of the present invention, there is provided a computer-readable storage medium;
a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, perform the steps of any of the above methods.
Compared with the prior art, the invention has the beneficial effects that:
the k-means + + clustering algorithm based on division is improved by using the concepts of the variation coefficient and the density index, the number of clusters does not need to be manually set, and the accuracy of a clustering result is also ensured.
The data point with the maximum density index is selected as the first clustering center, and the clustering algorithm based on division is sensitive to the selection of the initial centroid, so that abnormal values in the data set can be effectively avoided.
The improved clustering algorithm for automatically confirming the cluster number optimizes the confirmation of the cluster number and the selection of the initial centroid by utilizing the concept of the coefficient of variation, greatly improves the clustering quality, and can be effectively applied to the clustering analysis of data.
And the intra-cluster variation coefficient is used for representing the intra-cluster cohesion degree of the clusters, the inter-cluster variation coefficient is used for representing the inter-cluster separation degree of the clusters, and when the cohesion degree and the separation degree reach the maximum, the clustering effect is optimal.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
Fig. 1 is a flow chart of a clustering algorithm for automatically determining the number of clusters based on the coefficient of variation.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in fig. 1, the clustering method for automatically determining the number of clusters based on the coefficient of variation includes:
step 1: calculating each data point in the data setDensity value ρxThe density index DI is calculated from the density values and the data point with the highest density index is selected as the first cluster center.
Step 2: calculating the shortest distance D (x) between each data point and the current existing clustering center, then calculating the probability P (x) of each data point being selected as the next clustering center according to the distance, finally selecting the preselected clustering center according to a roulette method, and when the density index of the preselected clustering center reaches a threshold value tau, using the preselected clustering center as a new clustering center, or recalculating and selecting.
Step 3: step2 is repeated until k is selected*(k1<k*<k2) Individual clustering centers and performing k-means clustering to generate k*And (4) clustering.
Step 4: calculating the mean intra-cluster coefficient of variation
Figure BDA0001750686510000055
And minimum inter-cluster coefficient of variation DminObtaining a difference value T if T<0, i.e.
Figure BDA0001750686510000051
Merging the two clusters with smaller separation degree; if T is greater than or equal to 0, that is
Figure BDA0001750686510000052
When 0 is less than or equal to T<And when epsilon is less than or equal to T, the clustering effect is optimal.
Step 5: and (5) circularly executing Step4 until the clustering effect is optimal.
Firstly, an initial clustering center is selected by utilizing the concept of density index, and the clustering quality is improved. The initial clustering center is selected by calculating the density value of each data point, calculating the density index according to the density value, selecting the data point with the maximum density index as the first clustering center, then calculating the probability that the data point is selected as the next clustering center according to the distance between the data point and the existing clustering center, confirming other clustering centers, and finally performing a k-means algorithm to form the initial clustering when the density index of the clustering center reaches a certain threshold value.
The topic of the conference paper is various, so that cluster analysis needs to be performed on the conference paper to gather papers with similar topics. But we do not know the specific number of categories at first, so in order to obtain high-quality clustering effect, the proposed clustering algorithm for automatically confirming the number of clusters is applied to this. The NIPS conference paper from 1987 to 2015 is taken as an experimental data set, and clustering analysis is mainly performed on the conference paper according to the number of times of using English words in the conference paper in the data set. The data set has 11463-dimensional attributes and 5811 sample data, and the data space S ═ S1×S2×…×S11463Is a data space of 11463 dimensions, x ∈ (x)1,x2,…,x5811) Showing the number of occurrences of each word in one of the NIPS conference papers.
Confirming the number of categories of the initial conference paper, and randomly confirming k*(k1<k*<k2) A value of (a), wherein k1And k2Are all significantly larger values than the number of categories of the target meeting paper.
Computing a meeting paper data set (S)1,S2,…,S11463) Density value ρ of Mediterranean paper xxThat is, the difference degree between the conference papers x is smaller than or equal to the number of conference papers in the density range,
Figure BDA0001750686510000053
Figure BDA0001750686510000054
where num is the number of meeting papers, dxyThe difference between the conference paper y and the conference paper x in the conference paper data set is R is the density range, and f (X) is a function for judging whether the difference between the conference paper y and the conference paper x is less than or equal to the density range R.
Density value ρ from each meeting paperxCalculating the density index DI (Density index) and using the meeting paper with the largest density index as the first clustering center, namely DImaxAnd is expressed by the formula (3),
Figure BDA0001750686510000063
the meeting discussion with the largest density index is selected as the first clustering center, because the clustering algorithm based on division is sensitive to the selection of the initial centroid, and abnormal discussion data can be effectively avoided by selecting the meeting discussion with the larger density as the clustering center, so that the clustering quality is improved.
Calculating the minimum difference degree D (x) of each conference paper and the current existing clustering center, then calculating the probability of each conference paper being selected as the next clustering center according to the difference degree,
Figure BDA0001750686510000061
for the selection of the initial clustering center, the conference papers with larger mutual difference should be selected as the clustering center, so that the probability that each conference paper is selected as the clustering center is calculated, and the larger the difference with the existing clustering center is, the larger the probability that the conference paper is selected as the clustering center is, so that the selected clustering center is relatively discrete.
And selecting a preselected clustering center according to the probability by a roulette method, setting a threshold tau as the clustering algorithm based on division is sensitive to abnormal values, and taking the preselected clustering center as a formal clustering center only when the density index of the preselected clustering center reaches the threshold tau, otherwise, reselecting a new conference paper as the clustering center. This process is repeated until k is selected*A cluster center based on the obtained k*Performing a conventional k-means algorithm to form k at an initial clustering center*And (4) clustering.
Number of paper categories k due to initial selection*Is obviously larger than the target k value, so the number of clusters needs to be mergedReduced to k, but the number of article categories of the target is not known at first, so the concept of coefficient of variation is introduced to determine when to stop the merging of clusters. By calculating k*And determining whether the category number of the thesis is optimal or not by the relation between the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient of each cluster, namely, representing the intra-cluster cohesion degree of the clusters by using the intra-cluster variation coefficient, representing the inter-cluster separation degree of the clusters by using the inter-cluster variation coefficient, and when the cohesion degree and the separation degree are both maximized, obtaining the optimal clustering effect.
The concept of the variation coefficient is introduced, the variation coefficient is a statistic for representing the data distribution condition and is used for reflecting the dispersion degree of the data, the advantage is that the average value of the data does not need to be referred, the data is a dimensionless quantity, when two groups of data with different dimensions or different average values are compared, the variation coefficient rather than the standard deviation is used as a reference for comparison, and therefore, the threshold value for calculating the cluster number by adopting the variation coefficient is suitable for all types of data sets.
It means the ratio of the variation index of a set of data to the average index thereof, i.e., the ratio of the standard deviation sigma to the average value mu, and is expressed by the formulas (5) and (6),
Figure BDA0001750686510000062
Figure BDA0001750686510000071
calculating the intra-cluster variation coefficient of each cluster according to the variation coefficient, and then averaging the intra-cluster variation coefficients
Figure BDA0001750686510000072
And expressed by equations (7) and (8),
Figure BDA0001750686510000073
Figure BDA0001750686510000074
wherein, muiIs the centroid of cluster i, miNumber of meeting papers for cluster i, xjIs the jth meeting paper in cluster i. Since a larger coefficient of variation indicates a more dispersed distribution of the conference paper, how well the degree of aggregation of each cluster is reflected by calculating the intra-cluster coefficient of variation.
Calculating inter-cluster variation coefficient between any two clusters according to the variation coefficient, and then obtaining minimum value D of the inter-cluster variation coefficientminAnd expressed by formulas (9) and (10),
Figure BDA0001750686510000075
Dmin=min{CVij,i=1,2,…,k*,j=1,2,…,k*} (10)
wherein m isijThe sum of the number of meeting papers for cluster i and cluster j, μijIs the centroid of cluster i and cluster j, xlThe first meeting paper in cluster i and cluster j. The separation degree of the two clusters is reflected by calculating the variation coefficient among the clusters.
Calculating the difference value T between the variation coefficient in the average cluster and the variation coefficient between the minimum clusters, judging whether cluster combination is needed or not according to the difference value,
Figure BDA0001750686510000076
if T<0, i.e.
Figure BDA0001750686510000077
Indicating the presence of two clusters with a smaller inter-cluster variation coefficient. The smaller the inter-cluster variation coefficient, the more cohesive the distribution of the meeting papers in the two clusters, the lower the degree of separation; because the number of the clusters which are initially arranged is larger than that of the target clusters, the variation coefficients in the average clusters are smaller, the variation amplitude is smaller, the degree of agglomeration of each cluster is higher, and the method has the advantages thatOnly a merging of clusters needs to be performed. The strategy of merging is to merge two clusters with the least separation, i.e. the inter-cluster variation coefficient is DminTwo clusters of (a).
If T is greater than or equal to 0, that is
Figure BDA0001750686510000078
When 0 is less than or equal to T<When epsilon is generated, the difference value is smaller, which indicates that two clusters with smaller inter-cluster variation coefficient exist, the closer the inter-cluster variation coefficient and the intra-cluster variation coefficient are, the more the distribution of the conference papers in the two clusters is aggregated, the lower the separation degree is, and the higher the aggregation degree of each cluster is, the cluster merging needs to be performed; when epsilon is less than or equal to T, a certain difference exists, which indicates that the inter-cluster variation coefficients are large, the larger the difference between the inter-cluster variation coefficients and the intra-cluster variation coefficients is, the more discrete the distribution of the conference papers in the two clusters is, the larger the separation degree is, and simultaneously, the higher the cohesion degree of each cluster is, when the separation degrees among all the clusters reach a certain degree, the good clustering effect is achieved, and the best number of the conference paper categories can be obtained.
If the clusters are merged, the average intra-cluster variation coefficient needs to be recalculated
Figure BDA0001750686510000081
And minimum inter-cluster coefficient of variation DminAnd then judging whether the optimal clustering effect is achieved or not according to the difference value of the two, otherwise, continuing to merge clusters, and circularly executing the process until the termination condition is achieved.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (3)

1. A clustering method for automatically confirming the number of meeting paper clusters based on variation coefficients is characterized by comprising the following steps:
taking NIPS conference paper from 1987 to 2015 as an experimental data set according to the meetings in the data setClustering analysis is carried out on the conference paper by using the times of English words in the discussion paper, the data set has 11463-dimensional attributes and 5811 sample data, and the data space S is S ═ S1×S2×…×S11463Is a data space of 11463 dimensions, x ∈ (x)1,x2,...,x5811) Representing the occurrence number of each word in one NIPS conference paper;
step (1) confirming the number of the categories of the initial conference paper and randomly confirming k*(k1<k*<k2) A value of (a), wherein k1And k2Are each a value greater than the number of categories of the target meeting paper,
computing a meeting paper data set (S)1,S2,…,S11463) Density value ρ of Mediterranean paper xxThat is, the difference degree between the conference papers x is smaller than or equal to the number of conference papers in the density range,
Figure FDA0002944144920000011
Figure FDA0002944144920000012
where num is the number of meeting papers, dxyThe difference degree between the conference paper y and the conference paper x in the conference paper data set is determined, R is a density range, f (X) is a function for judging whether the difference degree between the conference paper y and the conference paper x is less than or equal to the density range R, and according to the density value rho of each conference paperxCalculating the density index DI, and using the meeting article with the highest density index as the first clustering center, i.e. DImaxThe number of the atoms, expressed as,
Figure FDA0002944144920000013
step (2), calculating the minimum difference degree D (x) between each conference paper and the current existing clustering center, then calculating the probability of each conference paper being selected as the next clustering center according to the difference degree,
Figure FDA0002944144920000014
and (3) selecting a preselected clustering center according to the probability by a wheel disc method, setting a threshold tau, taking the preselected clustering center as a formal clustering center only when the density index of the preselected clustering center reaches the tau, otherwise, reselecting a new conference paper as the clustering center, and repeating the wheel disc method until k is selected*A cluster center based on the obtained k*An initial clustering center, performing k-means clustering to form k*Clustering;
step (4), calculating the intra-cluster variation coefficient of each cluster, and then calculating the average value of the intra-cluster variation coefficients
Figure FDA0002944144920000015
As indicated by the general representation of the,
Figure FDA0002944144920000016
Figure FDA0002944144920000017
wherein, muiIs the centroid of cluster i, miNumber of meeting papers for cluster i, xjFor the jth meeting paper in cluster i,
calculating the inter-cluster variation coefficient between any two clusters, and then finding the minimum value D of the inter-cluster variation coefficientminThe number of the atoms, expressed as,
Figure FDA0002944144920000021
Dmin=min{CVij,i=1,2,…,k*,j=1,2,…,k*};
wherein m isijThe sum of the number of meeting papers for cluster i and cluster j, μijIs the centroid of cluster i and cluster j, xlCalculating the difference value T between the average intra-cluster variation coefficient and the minimum inter-cluster variation coefficient for the 1 st meeting paper in the cluster i and the cluster j, judging whether cluster combination needs to be carried out or not according to the difference value,
Figure FDA0002944144920000022
if T is less than 0, i.e.
Figure FDA0002944144920000023
Merging two clusters with the minimum inter-cluster variation coefficient;
if T is greater than or equal to 0, that is
Figure FDA0002944144920000024
When T is more than or equal to 0 and less than epsilon, merging two clusters with the minimum inter-cluster variation coefficient;
when epsilon is less than or equal to T, the optimal clustering effect is achieved, the number of clusters and the data points corresponding to each cluster are output,
if the clusters are merged, the average intra-cluster variation coefficient needs to be recalculated
Figure FDA0002944144920000025
And minimum inter-cluster coefficient of variation DminAnd then judging whether the optimal clustering effect is achieved or not according to the difference value of the two, otherwise, continuing to merge clusters, and circularly executing the process until the termination condition is achieved.
2. A clustering system for automatically confirming the number of meeting paper clusters based on coefficient of variation comprises: a memory, a processor, and computer instructions stored on the memory and executed on the processor, which when executed by the processor, perform the steps of claim 1.
3. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, perform the steps of claim 1.
CN201810864958.3A 2018-08-01 2018-08-01 Clustering method, system and medium for automatically determining cluster number based on coefficient of variation Active CN109063769B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810864958.3A CN109063769B (en) 2018-08-01 2018-08-01 Clustering method, system and medium for automatically determining cluster number based on coefficient of variation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810864958.3A CN109063769B (en) 2018-08-01 2018-08-01 Clustering method, system and medium for automatically determining cluster number based on coefficient of variation

Publications (2)

Publication Number Publication Date
CN109063769A CN109063769A (en) 2018-12-21
CN109063769B true CN109063769B (en) 2021-04-09

Family

ID=64832407

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810864958.3A Active CN109063769B (en) 2018-08-01 2018-08-01 Clustering method, system and medium for automatically determining cluster number based on coefficient of variation

Country Status (1)

Country Link
CN (1) CN109063769B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111027585B (en) * 2019-10-25 2023-04-07 南京大学 K-means algorithm hardware realization method and system based on k-means + + centroid initialization
CN111368876A (en) * 2020-02-11 2020-07-03 广东工业大学 Double-threshold sequential clustering method
CN111476270B (en) * 2020-03-04 2024-04-30 中国平安人寿保险股份有限公司 Course information determining method, device, equipment and storage medium based on K-means algorithm
CN111833171B (en) * 2020-03-06 2021-06-25 北京芯盾时代科技有限公司 Abnormal operation detection and model training method, device and readable storage medium
CN111507428B (en) * 2020-05-29 2024-01-05 深圳市商汤科技有限公司 Data processing method and device, processor, electronic equipment and storage medium
CN112070387B (en) * 2020-09-04 2023-09-26 北京交通大学 Method for evaluating multipath component clustering performance of complex propagation environment
CN112053063B (en) * 2020-09-08 2023-12-19 山东大学 Load partitioning method and system for planning and designing energy system
CN113378682B (en) * 2021-06-03 2023-04-07 山东省科学院自动化研究所 Millimeter wave radar fall detection method and system based on improved clustering algorithm
CN113301600A (en) * 2021-07-27 2021-08-24 南京中网卫星通信股份有限公司 Abnormal data detection method and device for performance of satellite and wireless communication converged network
CN116109933B (en) * 2023-04-13 2023-06-23 山东省土地发展集团有限公司 Dynamic identification method for ecological restoration of abandoned mine

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105139282A (en) * 2015-08-20 2015-12-09 国家电网公司 Power grid index data processing method, device and calculation device
CN105488589A (en) * 2015-11-27 2016-04-13 江苏省电力公司电力科学研究院 Genetic simulated annealing algorithm based power grid line loss management evaluation method
CN106570729A (en) * 2016-11-14 2017-04-19 南昌航空大学 Air conditioner reliability influence factor-based regional clustering method
CN107133652A (en) * 2017-05-17 2017-09-05 国网山东省电力公司烟台供电公司 Electricity customers Valuation Method and system based on K means clustering algorithms
CN107229751A (en) * 2017-06-28 2017-10-03 济南大学 A kind of concurrent incremental formula association rule mining method towards stream data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8473215B2 (en) * 2003-04-25 2013-06-25 Leland Stanford Junior University Method for clustering data items through distance-merging and density-merging techniques

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105139282A (en) * 2015-08-20 2015-12-09 国家电网公司 Power grid index data processing method, device and calculation device
CN105488589A (en) * 2015-11-27 2016-04-13 江苏省电力公司电力科学研究院 Genetic simulated annealing algorithm based power grid line loss management evaluation method
CN106570729A (en) * 2016-11-14 2017-04-19 南昌航空大学 Air conditioner reliability influence factor-based regional clustering method
CN107133652A (en) * 2017-05-17 2017-09-05 国网山东省电力公司烟台供电公司 Electricity customers Valuation Method and system based on K means clustering algorithms
CN107229751A (en) * 2017-06-28 2017-10-03 济南大学 A kind of concurrent incremental formula association rule mining method towards stream data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Detecting cluster numbers based on density changes using density-index enhanced Scale-invariant density-based clustering initialization algorithm;Onapa Limwattanapibool等;《2017 9th International Conference on Information Technology and Electrical Engineering》;20180111;第1-5页 *
聚类K-means算法的应用研究;石云平;《理论与方法》;20090831;第28-31页 *

Also Published As

Publication number Publication date
CN109063769A (en) 2018-12-21

Similar Documents

Publication Publication Date Title
CN109063769B (en) Clustering method, system and medium for automatically determining cluster number based on coefficient of variation
CN109615014B (en) KL divergence optimization-based 3D object data classification system and method
CN101853389A (en) Detection device and method for multi-class targets
WO2009099448A1 (en) Methods and systems for score consistency
US7818322B2 (en) Efficient method for clustering nodes
CN110111113B (en) Abnormal transaction node detection method and device
CN112639842A (en) Suppression of deviation data using machine learning models
CN115454779A (en) Cloud monitoring stream data detection method and device based on cluster analysis and storage medium
CN111339247B (en) Microblog subtopic user comment emotional tendency analysis method
CN115686432B (en) Document evaluation method for retrieval sorting, storage medium and terminal
CN113111063A (en) Medical patient main index discovery method applied to multiple data sources
WO2023050652A1 (en) Text recognition-based method for determining esg index in region, and related product
CN111625578B (en) Feature extraction method suitable for time series data in cultural science and technology fusion field
CN115544257B (en) Method and device for quickly classifying network disk documents, network disk and storage medium
CN111914930A (en) Density peak value clustering method based on self-adaptive micro-cluster fusion
CN110991517A (en) Classification method and system for unbalanced data set in stroke
CN111652733A (en) Financial information management system based on cloud computing and block chain
Qi et al. Object retrieval with image graph traversal-based re-ranking
Ren et al. Multivariate functional data clustering using adaptive density peak detection
CN111861706A (en) Data discretization regulation and control method and system and risk control model establishing method and system
Choo et al. Automatic folder allocation system for electronic text document repositories using enhanced Bayesian classification approach
Gonçalves et al. Approaching authorship attribution as a multi-view supervised learning task
Li Text Classification Retrieval Based on Complex Network and ICA Algorithm.
AlSaif Large scale data mining for banking credit risk prediction
Zhou et al. Information fusion for combining visual and textual image retrieval in imageclef@ icpr

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant