CN109063769A - Clustering method, system and the medium of number of clusters amount are automatically confirmed that based on the coefficient of variation - Google Patents

Clustering method, system and the medium of number of clusters amount are automatically confirmed that based on the coefficient of variation Download PDF

Info

Publication number
CN109063769A
CN109063769A CN201810864958.3A CN201810864958A CN109063769A CN 109063769 A CN109063769 A CN 109063769A CN 201810864958 A CN201810864958 A CN 201810864958A CN 109063769 A CN109063769 A CN 109063769A
Authority
CN
China
Prior art keywords
cluster
coefficient
variation
data point
clusters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810864958.3A
Other languages
Chinese (zh)
Other versions
CN109063769B (en
Inventor
刘腾腾
曲守宁
张坤
杜韬
王凯
郭庆北
朱连江
王钦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Jinan
Original Assignee
University of Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Jinan filed Critical University of Jinan
Priority to CN201810864958.3A priority Critical patent/CN109063769B/en
Publication of CN109063769A publication Critical patent/CN109063769A/en
Application granted granted Critical
Publication of CN109063769B publication Critical patent/CN109063769B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Abstract

The invention discloses clustering method, system and media that number of clusters amount is automatically confirmed that based on the coefficient of variation, calculate the density value of each data point in data set, calculate dnesity index according to density value, select the maximum data point of dnesity index as first cluster centre;The shortest distance between each data point and current existing cluster centre is calculated, the probability of cluster centre is then chosen as according to each data point of minimum distance calculation, preselects cluster centre according to wheel disc method;Until selecting setting cluster centre, k-means cluster is carried out to generate the cluster of corresponding number according to the initial cluster center selected;The coefficient of variation between the coefficient of variation and most tuftlet in calculating mean cluster, then, the difference for calculating the coefficient of variation between the coefficient of variation and most tuftlet in mean cluster, difference is compared with setting value, if difference is less than setting value, the smallest two clusters of the coefficient of variation between two clusters are merged;Until difference is more than or equal to setting value, then cluster result is exported.

Description

Clustering method, system and the medium of number of clusters amount are automatically confirmed that based on the coefficient of variation
Technical field
The present invention relates to clustering method, system and media that number of clusters amount is automatically confirmed that based on the coefficient of variation.
Background technique
With the fast development of information technology, many industries such as business, enterprise, scientific research institution and government department are all accumulated Magnanimity, different form storage data information often implies various useful information in these mass datas, only The query and search mechanism and statistical method for only relying on database are difficult to obtain these information, therefore the also data rapidly developed Digging technology, Clustering Analysis Technology are an important fields of research in data mining, have been widely used in many and have answered In, including pattern-recognition, data analysis, image procossing and market survey.
Clustering Analysis Technology is a kind of unsupervised learning method, wherein the clustering algorithm based on division is simple and can be with For various data types, but need the quantity of setting cluster and, k-means++ algorithm pair sensitive to initial cluster center in advance Traditional k-means algorithm is improved, but still haves the defects that the quantity of cluster is manually arranged.
Summary of the invention
In order to solve the deficiencies in the prior art, the present invention provides the cluster sides that number of clusters amount is automatically confirmed that based on the coefficient of variation Method, system and medium, solve traditional k-means++ clustering algorithm be manually arranged cluster quantity and initial mass center choose not When defect, the k-means++ clustering algorithm based on division is changed using the concept of the coefficient of variation and dnesity index Into also ensuring the accuracy of cluster result it is not necessary that the quantity of cluster is manually arranged;
In order to solve the above-mentioned technical problem, the present invention adopts the following technical scheme:
As the first aspect of the present invention, the clustering method that number of clusters amount is automatically confirmed that based on the coefficient of variation is provided;
The clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation, comprising:
Step (1): calculating the density value of each data point in data set, calculates dnesity index according to density value, selects close The maximum data point of index is spent as first cluster centre;
Step (2): the shortest distance between each data point and current existing cluster centre is calculated, then according to most short distance From the probability that each data point of calculating is chosen as cluster centre, finally, preselecting cluster centre according to wheel disc method;The pre-selection cluster The dnesity index at center is greater than given threshold;
Step (3): repeating step (2), until the cluster centre of setting number is selected, it is then initial according to selecting Cluster centre carries out k-means cluster to generate the cluster of corresponding number;
Step (4): then the coefficient of variation between the coefficient of variation and most tuftlet in calculating mean cluster calculates variation in mean cluster Difference is compared by the difference of the coefficient of variation between coefficient and most tuftlet with setting value, if difference is less than setting value, by two The smallest two clusters of the coefficient of variation merge between a cluster;It repeats step (4), until difference is more than or equal to setting value, then exports Cluster result.
Further, the step of calculating the density value of each data point in data set are as follows:
Assuming that data set (S1, S2..., Sd) there is d dimension attribute, and data space S=S1×S2×…×SdIt is d dimension According to space, x ∈ (x1,x2,…,xd) data point of the expression on d dimension data space in data set.
Firstly, the quantity k of setting initial cluster*(k1<k*<k2) value, wherein k1And k2It is the quantity greater than target cluster.
Then, the density value ρ of data point x is calculatedx, and indicated with formula (1) and (2):
Wherein, num is the number of data point, dxyFor the distance of data point y in data set to data point x, R is density model It encloses, f (X) is to judge whether data point y is less than or equal to the function of density range R at a distance from data point x;
Further, dnesity index is calculated according to density value, selects the maximum data point of dnesity index poly- as first Class center;The step of are as follows:
According to density value ρxIt calculates packing density index D I (Density Index), and by the maximum data of dnesity index Point is used as first cluster centre:
Further, the step of calculating the shortest distance between each data point and current existing cluster centre are as follows:
According to the mode for selecting initial cluster center in k-means++ algorithm, for the remainder strong point in data set, according to Secondary calculating data point compares at a distance from the initial cluster center having been selected out and selects shortest distance as the data Shortest distance D (x) between point and current existing cluster centre.
Further, the step of probability of cluster centre being chosen as according to minimum distance calculation each data point are as follows:
Wherein, D (x) indicates the shortest distance between each data point and current existing cluster centre;P (x) indicates each Data point is chosen as the probability of cluster centre;
Further, the step of preselecting cluster centre according to wheel disc method are as follows:
Threshold tau is set, only when the dnesity index for preselecting cluster centre reaches τ, just can be used as formal cluster centre, Otherwise new data point is reselected as cluster centre;Wheel disc method is repeated always until selecting k*A cluster centre.
Further, the step of calculating the coefficient of variation in mean cluster are as follows:
Firstly, calculating coefficient of variation CV in the cluster of each clusteri:
Then, the coefficient of variation in mean cluster is calculated
Wherein, μiFor the mass center of cluster i, miFor the data point number of cluster i, xjFor j-th of data point in cluster i, k*Indicate pre- The number for the cluster centre selected.
Because the coefficient of variation is bigger to illustrate that data point is more discrete, so reflecting cluster by the coefficient of variation in calculating cluster The quality of condensation degree.
Further, between calculating most tuftlet the step of the coefficient of variation are as follows:
Firstly, calculating coefficient of variation CV between clusterij:
Then, coefficient of variation D between calculating most tuftletmin:
Dmin=min { CVij, i=1,2 ..., k*, j=1,2 ..., k*} (8)
Wherein, mijFor the number of data points of cluster i and cluster j, μijFor the mass center of cluster i and cluster j, xlFor the l in cluster i and cluster j A data point.
Further, the difference for calculating the coefficient of variation between the coefficient of variation and most tuftlet in mean cluster, by difference and setting value It is compared, if difference is less than setting value, the smallest two clusters of the coefficient of variation between two clusters is merged;If difference More than or equal to setting value, then the step of exporting cluster result are as follows:
The difference T for calculating the coefficient of variation between the coefficient of variation and most tuftlet in mean cluster, according to difference judge whether to need into The merging of row cluster:
If T < 0, i.e.,The smallest two clusters of the coefficient of variation between merging cluster;
If T >=0, i.e.,As 0≤T < ε, merge the smallest two clusters of the coefficient of variation between cluster;
As ε≤T, export cluster quantity and each cluster corresponding to data point.
As a second aspect of the invention, the clustering system that number of clusters amount is automatically confirmed that based on the coefficient of variation is provided;
The clustering system of number of clusters amount is automatically confirmed that based on the coefficient of variation, comprising: memory, processor and be stored in storage The computer instruction run on device and on a processor, when the computer instruction is run by processor, completes any of the above-described side Step described in method.
As the third aspect of the present invention, a kind of computer readable storage medium is provided;
A kind of computer readable storage medium, is stored thereon with computer instruction, and the computer instruction is transported by processor When row, step described in any of the above-described method is completed.
Compared with prior art, the beneficial effects of the present invention are:
The k-means++ clustering algorithm based on division is changed using the concept of the coefficient of variation and dnesity index Into also ensuring the accuracy of cluster result it is not necessary that the quantity of cluster is manually arranged.
It selects the maximum data point of dnesity index as first cluster centre, is due to the clustering algorithm pair based on division The selection of initial mass center is more sensitive, can effectively avoid the exceptional value in data set in this way.
The improved clustering algorithm for automatically confirming that number of clusters amount, using the coefficient of variation concept number of clusters amount confirmation and just It is all optimized in the selection of the prothyl heart, can have greatly improved on clustering result quality, data can be effectively applied to Clustering.
Condensation degree in the cluster of cluster is indicated with the coefficient of variation in cluster, and separating degree between the cluster of cluster is indicated with the coefficient of variation between cluster, when When condensation degree and separating degree reach maximum, Clustering Effect is optimal.
Detailed description of the invention
The accompanying drawings constituting a part of this application is used to provide further understanding of the present application, and the application's shows Meaning property embodiment and its explanation are not constituted an undue limitation on the present application for explaining the application.
Fig. 1 is the clustering algorithm flow chart that number of clusters amount is automatically confirmed that based on the coefficient of variation.
Specific embodiment
It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the application.Unless another It indicates, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.
It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.
As shown in Figure 1, automatically confirming that the clustering method of number of clusters amount based on the coefficient of variation, comprising:
Step1: the density value ρ of each data point in data set is calculatedx, dnesity index DI, selection are calculated according to density value The maximum data point of dnesity index is as first cluster centre.
Step2: the shortest distance D (x) between each data point and current existing cluster centre is calculated, then according to distance Size calculates the probability P (x) that each data point is chosen as next cluster centre, finally by wheel disc method selection pre-selection cluster The heart just can be used as new cluster centre, otherwise recalculate selection when the dnesity index for preselecting cluster centre reaches threshold tau.
Step3: repeating Step2, until selecting k*(k1<k*<k2) a cluster centre, and carry out k-means cluster and generate k* A cluster.
Step4: the coefficient of variation in mean cluster is calculatedCoefficient of variation D between most tuftletmin, difference T is obtained, if T < 0, i.e.,Lesser two clusters of separating degree are merged;If T >=0, i.e.,As 0≤T < ε, by separating degree compared with Two small clusters merge, and as ε≤T, Clustering Effect is optimal.
Step5: circulation executes Step4, until Clustering Effect is optimal.
First with the conceptual choice initial cluster center of dnesity index, clustering result quality is improved.The choosing of initial cluster center Select is to calculate dnesity index by the density value of each data point of calculating according to density value, select the maximum data of dnesity index Point is used as first cluster centre, and then basis calculates data point at a distance from existing cluster centre and is chosen as in next cluster The probability of the heart confirms other cluster centres with this, while the dnesity index of cluster centre will reach certain threshold value, most laggard Row k-means algorithm forms initial clustering.
The theme of meeting paper is varied, so needing to carry out clustering to meeting paper, will have similar topic Paper be brought together.But we are not aware that specific categorical measure at the beginning, in order to obtain the cluster effect of high quality Fruit, so the clustering algorithm for automatically confirming that number of clusters amount of proposition is applied to this.We were with NIPS meeting in 1987 to 2015 Argumentative writing is experimental data set, mainly according in the meeting paper in data set use English word number to meeting paper into Row clustering.The data set has 11463 dimension attributes and 5811 sample datas, and data space S=S1×S2×… ×S11463It is 11463 dimension data spaces, x ∈ (x1,x2,…,x5811) indicate that each word goes out occurrence in a NIPS meeting paper Several situations.
The quantity for confirming initial meeting paper classification, confirms k at random*(k1<k*<k2) value, wherein k1And k2It is obvious Greater than the value of target meeting category of paper quantity.
Calculate meeting paper data set (S1, S2..., S11463) in meeting paper x density value ρx, i.e., with meeting paper x's Diversity factor is less than or equal to the quantity of the meeting paper in density range,
Wherein, num is the quantity of meeting paper, dxyFor the difference of meeting paper y and meeting paper x in meeting paper data set Different degree, R are density range, and f (X) is to judge whether meeting paper y and the diversity factor of meeting paper x are less than or equal to density range The function of R.
According to the density value ρ of every meeting paperxIts dnesity index DI (Density Index) is calculated, and density is referred to The maximum meeting paper of number is as first cluster centre, i.e. DImax, and indicated with formula (3),
The maximum meeting paper of dnesity index is selected as first cluster centre, to be because the cluster based on division is calculated Method is more sensitive to the selection of initial mass center, selects the biggish meeting paper of density that can effectively avoid as cluster centre different Normal paper data, to improve the quality of cluster.
The smallest diversity factor D (x) for calculating each meeting paper and current existing cluster centre, then according to diversity factor meter The probability that each meeting paper is chosen as next cluster centre is calculated,
Selection for initial cluster center, it should select the biggish meeting paper of mutual diversity factor as cluster centre, Therefore, the probability that each meeting paper is chosen as cluster centre is calculated, it is bigger with the diversity factor of existing cluster centre, then it is selected Probability as cluster centre is bigger, so that the cluster centre relative discrete selected.
According to probability by wheel disc method selection pre-selection cluster centre, since the clustering algorithm based on division is more quick to exceptional value Sense, so setting threshold tau just can be used as formal cluster centre only when the dnesity index for preselecting cluster centre reaches τ, Otherwise new meeting paper is reselected as cluster centre.This process is repeated always until selecting k*A cluster centre, root According to obtained k*A initial cluster center carries out traditional k-means algorithm and forms k*A cluster.
Due to the category of paper quantity k of initial selected*Significantly greater than target k value, so needing to carry out the merging of cluster for cluster Number be reduced to k, but be not aware that the category of paper quantity of target at the beginning, so introduce the concept of the coefficient of variation, Determine when the merging of stopping cluster.By calculating k*In the mean cluster of a cluster between the coefficient of variation and most tuftlet the coefficient of variation relationship It determines whether the categorical measure of paper is optimal, i.e., indicates condensation degree in the cluster of cluster with the coefficient of variation in cluster, made a variation between cluster Coefficient indicates separating degree between the cluster of cluster, and when condensation degree and separating degree reach maximum, Clustering Effect is optimal.
The concept of the coefficient of variation is introduced, the coefficient of variation is to indicate a statistic of data distribution situation, for reflecting number According to dispersion degree, it is a characteristic that benefit, which is the average value for needing not refer to data, different comparing two groups of dimensions Or when the different data of mean value, it should use the coefficient of variation rather than standard deviation is as the reference compared, therefore use the coefficient of variation The threshold value for calculating number of clusters amount is suitable for all types of data sets.
It is meant that the ratio between the indicator of variation of one group of data and its average index, the i.e. ratio of standard deviation sigma and average value mu, And indicated with formula (5) and (6),
The coefficient of variation in the cluster of each cluster is calculated according to the coefficient of variation, then seeks the average value of the coefficient of variation in clusterIt is used in combination Formula (7) and (8) expression,
Wherein, μiFor the mass center of cluster i, miFor the quantity of the meeting paper of cluster i, xjFor the jth piece meeting paper in cluster i.Cause For the bigger distribution for illustrating meeting paper of the coefficient of variation more disperses, so reflecting each cluster by the coefficient of variation in calculating cluster Condensation degree quality.
According to the coefficient of variation between the cluster between coefficient of variation calculating any two cluster, the minimum of the coefficient of variation between cluster is then sought Value Dmin, and indicated with formula (9) and (10),
Dmin=min { CVij, i=1,2 ..., k*, j=1,2 ..., k*} (10)
Wherein, mijFor the quantity and μ of the meeting paper of cluster i and cluster jijFor the mass center of cluster i and cluster j, xlFor in cluster i and cluster j L meeting papers.Reflect the quality of two cluster separating degrees by the coefficient of variation between calculating cluster.
The difference T for calculating the coefficient of variation between the coefficient of variation and most tuftlet in mean cluster, according to difference judge whether to need into The merging of row cluster,
If T < 0, i.e.,Illustrate that there are lesser two clusters of the coefficient of variation between cluster.The coefficient of variation is smaller between cluster, and two The distribution of meeting paper in a cluster is more agglomerated, and separating degree is lower;Since the quantity of the cluster of initial setting up is greater than target cluster Quantity, so the coefficient of variation is smaller in mean cluster and amplitude of variation is smaller, the condensation degree of each cluster is higher, so only needing Carry out the merging of cluster.Combined strategy is to merge the smallest two clusters of separating degree, i.e., the coefficient of variation is D between clusterminTwo Cluster.
If T >=0, i.e.,As 0≤T < ε, difference is smaller, illustrates there are the coefficient of variation between cluster lesser two Cluster, the coefficient of variation is more close in the coefficient of variation and cluster between cluster, and the distribution of meeting paper is more agglomerated in two clusters, and separating degree is got over It is low, and the condensation degree of each cluster is higher, then needs to carry out the merging of cluster;As ε≤T, there are certain difference, illustrate to become between cluster Different coefficient is bigger, and coefficient of variation difference is bigger in the coefficient of variation and cluster between cluster, the meeting paper distribution in two clusters more from It dissipating, separating degree is bigger, while the condensation degree of each cluster is higher, when the separating degree between all clusters reaches a certain level, this When Clustering Effect it is good, the quantity of optimal meeting paper classification can be obtained.
If carrying out the merging of cluster, need to recalculate the coefficient of variation in mean clusterCoefficient of variation D between most tuftletmin, Then judged whether to be optimal Clustering Effect according to the difference of the two, otherwise continue the merging of cluster, circulation executes this mistake Journey, until reaching termination condition.
The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.

Claims (10)

1. automatically confirming that the clustering method of number of clusters amount based on the coefficient of variation, characterized in that include:
Step (1): calculating the density value of each data point in data set, calculates dnesity index according to density value, density is selected to refer to The maximum data point of number is as first cluster centre;
Step (2): the shortest distance between each data point and current existing cluster centre is calculated, then according to shortest distance meter The probability that each data point is chosen as cluster centre is calculated, finally, preselecting cluster centre according to wheel disc method;The pre-selection cluster centre Dnesity index be greater than given threshold;
Step (3): repeating step (2), until the cluster centre of setting number is selected, then according to the initial clustering selected Center carries out k-means cluster to generate the cluster of corresponding number;
Step (4): in calculating mean cluster then the coefficient of variation between the coefficient of variation and most tuftlet calculates the coefficient of variation in mean cluster Difference is compared by the difference of the coefficient of variation between most tuftlet with setting value, if difference is less than setting value, by two clusters Between the smallest two clusters of the coefficient of variation merge;It repeats step (4), until difference is more than or equal to setting value, then exports cluster As a result.
2. the clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation as described in claim 1, characterized in that
The step of calculating the density value of each data point in data set are as follows:
Assuming that data set (S1, S2..., Sd) there is d dimension attribute, and data space S=S1×S2×…×SdIt is that d dimension data is empty Between, x ∈ (x1,x2,…,xd) data point of the expression on d dimension data space in data set;
Firstly, the quantity k of setting initial cluster*Value, wherein k1<k*<k2, k1And k2It is the quantity greater than target cluster;
Then, the density value ρ of data point x is calculatedx, and indicated with formula (1) and (2):
Wherein, num is the number of data point, dxyFor the distance of data point y in data set to data point x, R is density range, f It (X) is to judge whether data point y is less than or equal to the function of density range R at a distance from data point x.
3. the clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation as described in claim 1, characterized in that
Dnesity index is calculated according to density value, selects the step of maximum data point of dnesity index is as first cluster centre Are as follows:
According to density value ρxPacking density index D I is calculated, and using the maximum data point of dnesity index as first cluster centre:
4. the clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation as described in claim 1, characterized in that
The step of calculating the shortest distance between each data point and current existing cluster centre are as follows:
The mode of initial cluster center is selected successively to count the remainder strong point in data set according in k-means++ algorithm The data point is calculated at a distance from the initial cluster center having been selected out, compare select shortest distance as the data point with Shortest distance D (x) between current existing cluster centre.
5. the clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation as described in claim 1, characterized in that
The step of being chosen as the probability of cluster centre according to each data point of minimum distance calculation are as follows:
Wherein, D (x) indicates the shortest distance between each data point and current existing cluster centre;P (x) indicates each data Point is chosen as the probability of cluster centre;
The step of preselecting cluster centre according to wheel disc method are as follows:
Threshold tau is set, only when the dnesity index for preselecting cluster centre reaches τ, just can be used as formal cluster centre, otherwise New data point is reselected as cluster centre;Wheel disc method is repeated always until selecting k*A cluster centre.
6. the clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation as described in claim 1, characterized in that
The step of calculating the coefficient of variation in mean cluster are as follows:
Firstly, calculating coefficient of variation CV in the cluster of each clusteri:
Then, the coefficient of variation in mean cluster is calculated
Wherein, μiFor the mass center of cluster i, miFor the data point number of cluster i, xjFor j-th of data point in cluster i, k*Expression is selected in advance Cluster centre number.
7. the clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation as described in claim 1, characterized in that
Between calculating most tuftlet the step of the coefficient of variation are as follows:
Firstly, calculating coefficient of variation CV between clusterij:
Then, coefficient of variation D between calculating most tuftletmin:
Dmin=min { CVij, i=1,2 ..., k*, j=1,2 ..., k*} (8)
Wherein, mijFor the number of data points of cluster i and cluster j, μijFor the mass center of cluster i and cluster j, xlFor first of number in cluster i and cluster j Strong point.
8. the clustering method of number of clusters amount is automatically confirmed that based on the coefficient of variation as described in claim 1, characterized in that
The difference for calculating the coefficient of variation between the coefficient of variation and most tuftlet in mean cluster, difference is compared with setting value, if Difference is less than setting value, then merges the smallest two clusters of the coefficient of variation between two clusters;If difference is more than or equal to setting The step of being worth, then exporting cluster result are as follows:
The difference T for calculating the coefficient of variation between the coefficient of variation and most tuftlet in mean cluster, judges whether to need to carry out cluster according to difference Merging:
If T < 0, i.e.,The smallest two clusters of the coefficient of variation between merging cluster;
If T >=0, i.e.,As 0≤T < ε, merge the smallest two clusters of the coefficient of variation between cluster;
As ε≤T, export cluster quantity and each cluster corresponding to data point.
9. automatically confirming that the clustering system of number of clusters amount based on the coefficient of variation, comprising: memory, processor and be stored in memory Computer instruction that is upper and running on a processor, when the computer instruction is run by processor, completes any of the above-described method The step.
10. a kind of computer readable storage medium, is stored thereon with computer instruction, the computer instruction is run by processor When, complete step described in any of the above-described method.
CN201810864958.3A 2018-08-01 2018-08-01 Clustering method, system and medium for automatically determining cluster number based on coefficient of variation Active CN109063769B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810864958.3A CN109063769B (en) 2018-08-01 2018-08-01 Clustering method, system and medium for automatically determining cluster number based on coefficient of variation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810864958.3A CN109063769B (en) 2018-08-01 2018-08-01 Clustering method, system and medium for automatically determining cluster number based on coefficient of variation

Publications (2)

Publication Number Publication Date
CN109063769A true CN109063769A (en) 2018-12-21
CN109063769B CN109063769B (en) 2021-04-09

Family

ID=64832407

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810864958.3A Active CN109063769B (en) 2018-08-01 2018-08-01 Clustering method, system and medium for automatically determining cluster number based on coefficient of variation

Country Status (1)

Country Link
CN (1) CN109063769B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111027585A (en) * 2019-10-25 2020-04-17 南京大学 K-means algorithm hardware realization method and system based on k-means + + centroid initialization
CN111368876A (en) * 2020-02-11 2020-07-03 广东工业大学 Double-threshold sequential clustering method
CN111476270A (en) * 2020-03-04 2020-07-31 中国平安人寿保险股份有限公司 Course information determining method, device, equipment and storage medium based on K-means algorithm
CN111507428A (en) * 2020-05-29 2020-08-07 深圳市商汤科技有限公司 Data processing method and device, processor, electronic equipment and storage medium
CN111833171A (en) * 2020-03-06 2020-10-27 北京芯盾时代科技有限公司 Abnormal operation detection and model training method, device and readable storage medium
CN112053063A (en) * 2020-09-08 2020-12-08 山东大学 Load partitioning method and system for energy system planning design
CN112070387A (en) * 2020-09-04 2020-12-11 北京交通大学 Multipath component clustering performance evaluation method in complex propagation environment
CN113301600A (en) * 2021-07-27 2021-08-24 南京中网卫星通信股份有限公司 Abnormal data detection method and device for performance of satellite and wireless communication converged network
CN113378682A (en) * 2021-06-03 2021-09-10 山东省科学院自动化研究所 Millimeter wave radar fall detection method and system based on improved clustering algorithm
CN116109933A (en) * 2023-04-13 2023-05-12 山东省土地发展集团有限公司 Dynamic identification method for ecological restoration of abandoned mine
CN111476270B (en) * 2020-03-04 2024-04-30 中国平安人寿保险股份有限公司 Course information determining method, device, equipment and storage medium based on K-means algorithm

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105139282A (en) * 2015-08-20 2015-12-09 国家电网公司 Power grid index data processing method, device and calculation device
CN105488589A (en) * 2015-11-27 2016-04-13 江苏省电力公司电力科学研究院 Genetic simulated annealing algorithm based power grid line loss management evaluation method
US20170091282A1 (en) * 2003-04-25 2017-03-30 The Board Of Trustees Of The Leland Stanford Junior University A method for identifying clusters of fluorescence-activated cell sorting data points
CN106570729A (en) * 2016-11-14 2017-04-19 南昌航空大学 Air conditioner reliability influence factor-based regional clustering method
CN107133652A (en) * 2017-05-17 2017-09-05 国网山东省电力公司烟台供电公司 Electricity customers Valuation Method and system based on K means clustering algorithms
CN107229751A (en) * 2017-06-28 2017-10-03 济南大学 A kind of concurrent incremental formula association rule mining method towards stream data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170091282A1 (en) * 2003-04-25 2017-03-30 The Board Of Trustees Of The Leland Stanford Junior University A method for identifying clusters of fluorescence-activated cell sorting data points
CN105139282A (en) * 2015-08-20 2015-12-09 国家电网公司 Power grid index data processing method, device and calculation device
CN105488589A (en) * 2015-11-27 2016-04-13 江苏省电力公司电力科学研究院 Genetic simulated annealing algorithm based power grid line loss management evaluation method
CN106570729A (en) * 2016-11-14 2017-04-19 南昌航空大学 Air conditioner reliability influence factor-based regional clustering method
CN107133652A (en) * 2017-05-17 2017-09-05 国网山东省电力公司烟台供电公司 Electricity customers Valuation Method and system based on K means clustering algorithms
CN107229751A (en) * 2017-06-28 2017-10-03 济南大学 A kind of concurrent incremental formula association rule mining method towards stream data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ONAPA LIMWATTANAPIBOOL等: "Detecting cluster numbers based on density changes using density-index enhanced Scale-invariant density-based clustering initialization algorithm", 《2017 9TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND ELECTRICAL ENGINEERING》 *
石云平: "聚类K-means算法的应用研究", 《理论与方法》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111027585A (en) * 2019-10-25 2020-04-17 南京大学 K-means algorithm hardware realization method and system based on k-means + + centroid initialization
CN111368876A (en) * 2020-02-11 2020-07-03 广东工业大学 Double-threshold sequential clustering method
CN111476270A (en) * 2020-03-04 2020-07-31 中国平安人寿保险股份有限公司 Course information determining method, device, equipment and storage medium based on K-means algorithm
CN111476270B (en) * 2020-03-04 2024-04-30 中国平安人寿保险股份有限公司 Course information determining method, device, equipment and storage medium based on K-means algorithm
CN111833171A (en) * 2020-03-06 2020-10-27 北京芯盾时代科技有限公司 Abnormal operation detection and model training method, device and readable storage medium
CN111507428A (en) * 2020-05-29 2020-08-07 深圳市商汤科技有限公司 Data processing method and device, processor, electronic equipment and storage medium
CN111507428B (en) * 2020-05-29 2024-01-05 深圳市商汤科技有限公司 Data processing method and device, processor, electronic equipment and storage medium
CN112070387B (en) * 2020-09-04 2023-09-26 北京交通大学 Method for evaluating multipath component clustering performance of complex propagation environment
CN112070387A (en) * 2020-09-04 2020-12-11 北京交通大学 Multipath component clustering performance evaluation method in complex propagation environment
CN112053063B (en) * 2020-09-08 2023-12-19 山东大学 Load partitioning method and system for planning and designing energy system
CN112053063A (en) * 2020-09-08 2020-12-08 山东大学 Load partitioning method and system for energy system planning design
CN113378682A (en) * 2021-06-03 2021-09-10 山东省科学院自动化研究所 Millimeter wave radar fall detection method and system based on improved clustering algorithm
WO2023004899A1 (en) * 2021-07-27 2023-02-02 南京中网卫星通信股份有限公司 Method and apparatus for detecting abnormal data of satellite and wireless communication convergence network performance
CN113301600A (en) * 2021-07-27 2021-08-24 南京中网卫星通信股份有限公司 Abnormal data detection method and device for performance of satellite and wireless communication converged network
CN116109933A (en) * 2023-04-13 2023-05-12 山东省土地发展集团有限公司 Dynamic identification method for ecological restoration of abandoned mine
CN116109933B (en) * 2023-04-13 2023-06-23 山东省土地发展集团有限公司 Dynamic identification method for ecological restoration of abandoned mine

Also Published As

Publication number Publication date
CN109063769B (en) 2021-04-09

Similar Documents

Publication Publication Date Title
CN109063769A (en) Clustering method, system and the medium of number of clusters amount are automatically confirmed that based on the coefficient of variation
He et al. A two-stage genetic algorithm for automatic clustering
CN109873501B (en) Automatic identification method for low-voltage distribution network topology
CN109063945A (en) A kind of 360 degree of customer portrait construction methods of sale of electricity company based on Value accounting system
Chou et al. Identifying prospective customers
CN107220337B (en) Cross-media retrieval method based on hybrid migration network
CN101853389A (en) Detection device and method for multi-class targets
CN111931505A (en) Cross-language entity alignment method based on subgraph embedding
CN111401785A (en) Power system equipment fault early warning method based on fuzzy association rule
CN107545360A (en) A kind of air control intelligent rules deriving method and system based on decision tree
CN103324939A (en) Deviation classification and parameter optimization method based on least square support vector machine technology
CN112580742A (en) Graph neural network rapid training method based on label propagation
Hruschka et al. Improving the efficiency of a clustering genetic algorithm
Sun et al. Does Every Data Instance Matter? Enhancing Sequential Recommendation by Eliminating Unreliable Data.
CN110427365A (en) Improve the address merging method and system for closing single accuracy
CN111625578B (en) Feature extraction method suitable for time series data in cultural science and technology fusion field
CN109977131A (en) A kind of house type matching system
CN112836750A (en) System resource allocation method, device and equipment
CN102262682A (en) Rapid attribute reduction method based on rough classification knowledge discovery
CN109543712B (en) Method for identifying entities on temporal data set
CN107423759B (en) Comprehensive evaluation method, device and application of low-dimensional successive projection pursuit clustering model
Kaewwichian Multiclass classification with imbalanced datasets for car ownership demand model–Cost-sensitive learning
CN110009024A (en) A kind of data classification method based on ID3 algorithm
CN109344320A (en) A kind of book recommendation method based on Apriori
CN110084376B (en) Method and device for automatically separating data into boxes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant