WO2021029835A1 - A method and system for clustering performance evaluation and increment - Google Patents
A method and system for clustering performance evaluation and increment Download PDFInfo
- Publication number
- WO2021029835A1 WO2021029835A1 PCT/TR2019/050681 TR2019050681W WO2021029835A1 WO 2021029835 A1 WO2021029835 A1 WO 2021029835A1 TR 2019050681 W TR2019050681 W TR 2019050681W WO 2021029835 A1 WO2021029835 A1 WO 2021029835A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cluster
- subcluster
- threshold value
- measure
- clustering
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
- G06F18/2193—Validation; Performance evaluation; Active pattern learning techniques based on specific statistical tests
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Definitions
- the present invention relates generally to the field of data clustering, and more particularly to computer-implemented automated data analysis and clustering.
- the present invention also relates to a clustering performance evaluation and increment by subclustering.
- Clustering is a commonly used method in many fields of both pure and social sciences for analyzing very large databases of information. Clustering provides grouping the data into classes or categories and this often helps to describe similarities and differences in data in a way that helps understanding and describes relationships. Clustering also provides weight or significance to each group, identify a subset of data, and identify data that are least similar to the rest of the database. Basic principles of data clustering are described in Jain et al., “Data Clustering: A Review,” ACM Computing Surveys, Vol. 31 , No.3, pp. 264-323 (September 1999), the entirety of which is incorporated by reference herein.
- a clustering algorithm is followed by a supervised learning technique such as classification or regression.
- the final model including the clustering and the supervised learning, is applied to a business or an analytics problem.
- the final selection of prospects is determined by performing a culling process or augmenting process to reduce or increase, respectively, the initial set of prospects using a heuristic measure H, until the number of prospects in the initial set of prospects matches the target number of prospects.
- the method described in this document does not deal with the problem of a better membership of a prospect in a cluster than the initially identified one, but only with the number of prospects.
- clustering algorithms which is known under the term cluster validity
- First internal statistical and machine learning techniques such as internal methods: Silhoute index, Error Sum of squares, Seperation, Bayesian Information Criteria.
- Second external statistical and machine learning methods: Purity, Accuracy, Entropy, Precision/Recall, F-Measure, Jaccard Index or rand index.
- Third Field tests business cases like A/B tests. This last approach is the final evaluation for most of the cases and the ultimate purpose of almost all systems is increasing the success of business case.
- the present invention provides a novel, end-to-end evaluation technique besides a method for increasing the success rate of the clustering algorithm.
- the main problem for most of the clustering algorithms is that the clusters do not represent 100% of membership for all the data points covered.
- United States Patent Application US 2014/0372214 A1 discloses a hierarchical clustering algorithm which is performed on the plurality of feature vectors to provide a plurality of clusters with a cluster similarity measure for each cluster representing the quality of the cluster.
- HC hierarchical clustering
- the algorithm is based on the success of business case and furthermore, it can be applied to big data problems while HC cannot be applied easily because of the performance problems.
- the inventor has recognized and appreciated that statistical learning techniques can be used to systematically and cost-effectively evaluate the potential predictive modeling solutions for prediction problems.
- the aim of the present invention is to provide by the method is automating the evaluation of clustering algorithms based on the business case success and applying the subclustering approach for boosting the success automatically.
- non-transitory computer readable medium storing machine which comprises executable instructions to perform a method for clustering data comprising data points, the instructions executable by an associated processor to:
- Figure 1 illustrates one example of a clustering system 100.
- data points are clustered with a clustering algorithm to provide clusters and a cluster measure for each cluster.
- Each of the plurality of clusters generated has an associated cluster similarity measure, which can be generated as part of the clustering process or afterward, that represents the quality of the cluster.
- the quality of the cluster represents the extent to which data points within the cluster contain similar values accross the selected features.
- a first threshold value is determined.
- the cluster measures are displayed to a user and the user identifies an appropriate threshold value to separate high quality clusters from lower quality clusters and provides the threshold value through an appropriate input device.
- the threshold value can be calculated from the cluster similarity measures associated with the plurality of clusters.
- a cluster is selected. It will be appreciated that the clusters can be evaluated in any order.
- a second threshold value is determined.
- a subcluster is selected.
- the subcluster is left as it is.
- the subcluster is moved to one of the next best clusters.
- the subcluster is made a new cluster.
- the method then advances to 112, where it is determined if all subclusters have been selected. If not (NO), the method returns to 107 to select a new subcluster. If all subclusters have been selected (YES), the system advances to 113, where it is determined if all clusters have been selected. If not (NO), the method returns to 103 to select a new cluster. If all clusters have been selected (YES), the method terminates.
- the non-transitory computer readable medium the instructions being further executable to display the cluster measure for each cluster and subcluster measure for each subcluster to a user; and accept a provided value for the threshold values from the user through an appropriate input device.
- the non-transitory computer readable medium wherein cluster measure is calculated can be generated as part of the clustering process or afterward, that represents the quality of the cluster.
- the quality of the cluster represents the extent to which data points within the cluster contain similar values accross the selected features.
- the non-transitory computer readable medium wherein subcluster measure is calculated.
- the subcluster measure can be generated as part of the subclustering process or afterward, that represents the quality of the cluster.
- the quality of the subcluster represents the extent to which data points within the cluster contain similar values accross the selected features.
- the instructions being further executable to repeat the steps (iii) to (v) by using a different first threshold value than the previous one.
- the instructions being further executable to repeat the steps (iii) to (v) by using a different second threshold value than the previous one.
- the method for clustering data comprising a customer database containing prospects for a product promotion, or an employee database containing prospects for assigning jobs, or consumer database containing prospects for fraud detection.
- a case like segmenting the customers and displaying ads depending on the customer segments In an examplary embodiment, a case like segmenting the customers and displaying ads depending on the customer segments.
- customers current or potential customers
- the marketing automation decides on the ads to display or products for promotion to the customer segments.
- the proposed method is automating both the performance evaluation of clustering and increasing the success of clustering at the same time together with assigning ads to the customer segments.
- the objective function might be the total performance (evaluation) after assignments based on the clusters created.
- the threshold values are selected based at least in part on feedback from an environment, wherein the environment specifies the success results from a domain specific data set.
- the termination event is either one or more of the satisfaction of an objective function and achievement of a desired success rate in a domain specific data set.
- a data clustering system for clustering data points, comprising: a non-transitory computer readable medium storing machine executable instructions comprising: clustering algorithms to provide clusters and a cluster measure for each cluster representing the quality of the cluster; a cluster analysis component accepting each cluster having a cluster measure satisfying a first threshold value; reclustering algorithms to provide subclusters from each cluster having a cluster measure that does not satisfy the first threshold value and a subcluster for each subcluster representing the quality of the subcluster; wherein the cluster analysis component performs either one of the following operations on each subcluster that fails to satisfy the second threshold value: (a) leave the subcluster as it is, or
- a system for training a clustering model using machine learning comprising: at least one processor; and a storage medium storing instructions that, when executed by the at least one processor, cause the at least one processor to perform instructions comprising:
- clustering data comprising data points by using a clustering algorithm to provide clusters, and a cluster measure for each cluster representing the quality of the cluster
- the environment specifies the success results from a domain specific data set.
- a storage medium storing instructions that, when executed by the at least one processor, cause the at least one processor to perform instructions comprising: (vii) determining the clustering algorithm by which the most successful result is achieved in a domain specific data set.
- a computer-implemented method for training a clustering model using machine learning comprising:
- clustering data comprising data points by using a clustering algorithm to provide clusters, and a cluster measure for each cluster representing the quality of the cluster
- the environment specifies the success results from a domain specific data set.
- the method further comprising (vii) determining the clustering algorithm by which the most successful result is achieved in a domain specific data set.
- Figure 1 illustrates one example of a clustering system.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This invention relates to a non-transitory computer readable medium storing machine executable instructions to perform a method for clustering data comprising data points, the instructions executable by an associated processor to (i) cluster the data points by using a clustering algorithm to provide clusters, and a cluster measure for each cluster representing the quality of the cluster; (ii) accept each cluster having a cluster measure satisfying a first threshold value; (iii) recluster each cluster that fails to satisfy the the first threshold value to provide a set of subclusters, and a subcluster measure representing the quality of the subcluster; (iv) accept each subcluster having a subcluster measure satisfying a second threshold value; (v) perform either one of the following operations on each subcluster that fails to satisfy the second threshold value: (a) leave the subcluster as it is, or (b) move the subcluster to one of the next best cluster, or (c) make the subcluster a new cluster, (vi) repeat the steps (iii) to (v) until a termination event occurs.
Description
A METHOD AND SYSTEM FOR CLUSTERING PERFORMANCE EVALUATION
AND INCREMENT
The present invention relates generally to the field of data clustering, and more particularly to computer-implemented automated data analysis and clustering.
The present invention also relates to a clustering performance evaluation and increment by subclustering.
Clustering is a commonly used method in many fields of both pure and social sciences for analyzing very large databases of information. Clustering provides grouping the data into classes or categories and this often helps to describe similarities and differences in data in a way that helps understanding and describes relationships. Clustering also provides weight or significance to each group, identify a subset of data, and identify data that are least similar to the rest of the database. Basic principles of data clustering are described in Jain et al., “Data Clustering: A Review,” ACM Computing Surveys, Vol. 31 , No.3, pp. 264-323 (September 1999), the entirety of which is incorporated by reference herein.
In most of the business cases, a clustering algorithm is followed by a supervised learning technique such as classification or regression. Also, the final model, including the clustering and the supervised learning, is applied to a business or an analytics problem.
Assessing the quality of a model is one of the most important considerations when deploying any machine learning algorithm. For supervised learning problems, this is easy. There are already labels for every example, so the practitioner can test the model’s performance on a reserved evaluation set. One of the main problems of clustering is, it is an unsupervised learning and it is not possible to evaluate without an initial knowledge. Evaluation of clustering algorithms is also an important case because of increasing trends in the consensus learning and automated machine learning. These trendy approaches in data science are also requiring an automated evaluation technique for the clustering algorithms.
In the state of the art, United States Patent Application US 2005/0149466 A1 provides a method, system, and article of manufacture for selecting prospects for a product promotion through clustering. When the number of initially identified prospects mismatches the target number of prospects, the final selection of prospects is determined by performing a culling process or augmenting process to reduce or increase, respectively, the initial set of prospects using a heuristic measure H, until the number of prospects in the initial set of prospects matches the target number of prospects. The method described in this document does not deal with the problem of a better membership of a prospect in a cluster than the initially identified one, but only with the number of prospects.
The performance evaluation of clustering algorithms (which is known under the term cluster validity) in the literature can be divided into three groups: First; internal statistical and machine learning techniques such as internal methods: Silhoute index, Error Sum of squares, Seperation, Bayesian Information Criteria. Second; external statistical and machine learning methods: Purity, Accuracy, Entropy, Precision/Recall, F-Measure, Jaccard Index or rand index. Third; Field tests business cases like A/B tests. This last approach is the final evaluation for most of the cases and the ultimate purpose of almost all systems is increasing the success of business case.
The present invention provides a novel, end-to-end evaluation technique besides a method for increasing the success rate of the clustering algorithm. The main problem for most of the clustering algorithms is that the clusters do not represent 100% of membership for all the data points covered.
In the state of the art, United States Patent Application US 2014/0372214 A1 discloses a hierarchical clustering algorithm which is performed on the plurality of feature vectors to provide a plurality of clusters with a cluster similarity measure for each cluster representing the quality of the cluster. Compared to hierarchical clustering (HC), the algorithm is based on the success of business case and furthermore, it can be applied to big data problems while HC cannot be applied easily because of the performance problems.
There is a need for a method that systematically and cost-effectively evaluates the performance of clustering algorithms and potential predictive modeling techniques for prediction problems.
The inventor has recognized and appreciated that statistical learning techniques can be used to systematically and cost-effectively evaluate the potential predictive modeling solutions for prediction problems.
The aim of the present invention is to provide by the method is automating the evaluation of clustering algorithms based on the business case success and applying the subclustering approach for boosting the success automatically.
This aim has been achieved by the clustering method as defined in Claims 1 , 10, 11 and 14. Further advantages of the invention have been attained by the technical features defined in the dependent claims.
In a non-transitory computer readable medium storing machine which comprises executable instructions to perform a method for clustering data comprising data points, the instructions executable by an associated processor to:
(i) cluster the data points by using a clustering algorithm to provide clusters, and a cluster measure for each cluster representing the quality of the cluster;
(ii) accept each cluster having a cluster measure satisfying a first threshold value;
(iii) recluster each cluster that fails to satisfy the the first threshold value to provide a set of subclusters, and a subcluster measure representing the quality of the subcluster;
(iv) accept each subcluster having a subcluster measure satisfying a second threshold value;
(v) perform either one of the following operations on each subcluster that fails to satisfy the second threshold value:
(a) leave the subcluster as it is, or
(b) move the subcluster to one of the next best cluster, or
(c) make the subcluster a new cluster,
(vi) repeat the steps (iii) to (v) until a termination event occurs.
Figure 1 illustrates one example of a clustering system 100. At 101 ; data points are clustered with a clustering algorithm to provide clusters and a cluster measure for each cluster. Each of the plurality of clusters generated has an associated cluster similarity measure, which can be generated as part of the clustering process or afterward, that represents the quality of the cluster. In one implementation, the quality of the cluster represents the extent to which data points within the cluster contain similar values accross the selected features.
At 102; a first threshold value is determined. In one implementation, the cluster measures are displayed to a user and the user identifies an appropriate threshold value to separate high quality clusters from lower quality clusters and provides the threshold value through an appropriate input device. In another implementation, the threshold value can be calculated from the cluster similarity measures associated with the plurality of clusters.
At 103; a cluster is selected. It will be appreciated that the clusters can be evaluated in any order. At 104; it is determined, for each of the clusters, if its associated cluster measure meets the threshold value. If the threshold is met (YES), the cluster is left as it is (114). If the threshold is not met (NO), the cluster is subjected to reclustering at 105 to provide a plurality of subclusters and associated measures. At 106; then a second threshold value is determined.
At 107; a subcluster is selected. At 108, it is determined, for each of the subclusters, if its associated cluster measure meets a second threshold value. If the threshold is met (YES), the subcluster is left as it is (115). If the threshold is not met (NO), there are three options. At 109; the subcluster is left as it is. At 110; the subcluster is moved to one of the next best clusters. At 111 ; the subcluster is made a new cluster.
The method then advances to 112, where it is determined if all subclusters have been selected. If not (NO), the method returns to 107 to select a new subcluster. If all subclusters have been selected (YES), the system advances to 113, where it is determined if all clusters have been selected. If not (NO), the method returns to 103 to select a new cluster. If all clusters have been selected (YES), the method terminates.
In an embodiment of the invention the non-transitory computer readable medium the instructions being further executable to display the cluster measure for each cluster and subcluster measure for each subcluster to a user; and accept a provided value for the threshold values from the user through an appropriate input device.
In an embodiment of the invention the non-transitory computer readable medium wherein cluster measure is calculated. The cluster measure can be generated as part of the clustering process or afterward, that represents the quality of the cluster. In one implementation, the quality of the cluster represents the extent to which data points within the cluster contain similar values accross the selected features.
In an embodiment of the invention the non-transitory computer readable medium wherein subcluster measure is calculated. The subcluster measure can be generated as part of the subclustering process or afterward, that represents the quality of the cluster. In one implementation, the quality of the subcluster represents the extent to which data points within the cluster contain similar values accross the selected features.
In an embodiment of the invention the instructions being further executable to repeat the steps (iii) to (v) by using a different first threshold value than the previous one.
In an embodiment of the invention the instructions being further executable to repeat the steps (iii) to (v) by using a different second threshold value than the previous one.
In an embodiment of the invention the method for clustering data comprising a customer database containing prospects for a product promotion, or an employee database containing prospects for assigning jobs, or consumer database containing prospects for fraud detection.
In an examplary embodiment, a case like segmenting the customers and displaying ads depending on the customer segments. During the marketing process, customers (current or potential customers) are segmented based on their features (such as income level, age, credit rating etc.), also the actions they have taken (such as number of visits, number of purchases, volume of total transactions etc.). Based on
the clusters created, the marketing automation decides on the ads to display or products for promotion to the customer segments. The proposed method is automating both the performance evaluation of clustering and increasing the success of clustering at the same time together with assigning ads to the customer segments.
In another exemplary embodiment, a case like creating segments of employees and assigning jobs (or promotions) based on the employee segments. The objective function might be the total performance (evaluation) after assignments based on the clusters created.
Another very famous usage of the clustering algorithms is fraud detection (or outlier detection in general). In the approach, the segments are created and the outliers are detected if a data point does not gets into one of the segments. In other words, outliers are detected by the data points staying out of all the segments after the segmentation.
In an embodiment of the invention the threshold values are selected based at least in part on feedback from an environment, wherein the environment specifies the success results from a domain specific data set.
In an embodiment of the invention the termination event is either one or more of the satisfaction of an objective function and achievement of a desired success rate in a domain specific data set.
In a data clustering system for clustering data points, comprising: a non-transitory computer readable medium storing machine executable instructions comprising: clustering algorithms to provide clusters and a cluster measure for each cluster representing the quality of the cluster; a cluster analysis component accepting each cluster having a cluster measure satisfying a first threshold value; reclustering algorithms to provide subclusters from each cluster having a cluster measure that does not satisfy the first threshold value and a subcluster for each subcluster representing the quality of the subcluster; wherein the cluster analysis component performs either one of the following operations on each subcluster that fails to satisfy the second threshold value:
(a) leave the subcluster as it is, or
(b) move the subcluster to one of the next best cluster, or
(c) make the subcluster a new cluster, a processor to execute the machine readable instructions stored on the non-transitory computer readable medium.
In a system for training a clustering model using machine learning, based on feedback from an environment, the system comprising: at least one processor; and a storage medium storing instructions that, when executed by the at least one processor, cause the at least one processor to perform instructions comprising:
(i) clustering data comprising data points by using a clustering algorithm to provide clusters, and a cluster measure for each cluster representing the quality of the cluster;
(ii) accepting each cluster having a cluster measure satisfying a first threshold value; wherein the first threshold value is selected based at least in part on feedback from an environment;
(iii) reclustering each cluster that fails to satisfy the the first threshold value to provide a set of subclusters, and a subcluster measure representing the quality of the subcluster;
(iv) accepting each subcluster having a subcluster measure satisfying a second threshold value; wherein the second threshold value is selected based at least in part on the feedback from an environment;
(v) performing either one of the following operations on each subcluster that fails to satisfy the second threshold value:
(a) leaving the subcluster as it is, or
(b) moving the subcluster to one of the next best clusters, or
(c) making the subcluster a new cluster,
(vi) repeating the steps (iii) to (v) until a termination event occurs.
In an embodiment of the invention the environment specifies the success results from a domain specific data set.
In an embodiment of the invention a storage medium storing instructions that, when executed by the at least one processor, cause the at least one processor to perform
instructions comprising: (vii) determining the clustering algorithm by which the most successful result is achieved in a domain specific data set.
In a computer-implemented method for training a clustering model using machine learning, based on feedback from an environment, the method comprising:
(i) clustering data comprising data points by using a clustering algorithm to provide clusters, and a cluster measure for each cluster representing the quality of the cluster;
(ii) accepting each cluster having a cluster measure satisfying a first threshold value; wherein the first threshold value is selected based at least in part on feedback from an environment;
(iii) reclustering each cluster that fails to satisfy the the first threshold value to provide a set of subclusters, and a subcluster measure representing the quality of the subcluster;
(iv) accepting each subcluster having a subcluster measure satisfying a second threshold value; wherein the second threshold value is selected based at least in part on feedback from an environment;
(v) performing either one of the following operations on each subcluster that fails to satisfy the second threshold value:
(a) leaving the subcluster as it is, or
(b) moving the subcluster to one of the next best clusters, or
(c) making the subcluster a new cluster,
(vi) repeating the steps (iii) to (v) until a termination event occurs.
In an embodiment the environment specifies the success results from a domain specific data set.
In an embodiment the method further comprising (vii) determining the clustering algorithm by which the most successful result is achieved in a domain specific data set.
Brief description of the drawings:
Figure 1 illustrates one example of a clustering system.
Claims
1. A non-transitory computer readable medium storing machine executable instructions to perform a method for clustering data comprising data points, the instructions executable by an associated processor to:
(i) cluster the data points by using a clustering algorithm to provide clusters, and a cluster measure for each cluster representing the quality of the cluster;
(ii) accept each cluster having a cluster measure satisfying a first threshold value;
(iii) recluster each cluster that fails to satisfy the the first threshold value to provide a set of subclusters, and a subcluster measure representing the quality of the subcluster;
(iv) accept each subcluster having a subcluster measure satisfying a second threshold value;
(v) perform either one of the following operations on each subcluster that fails to satisfy the second threshold value:
(a) leave the subcluster as it is, or
(b) move the subcluster to one of the next best cluster, or
(c) make the subcluster a new cluster,
(vi) repeat the steps (iii) to (v) until a termination event occurs.
2. The non-transitory computer readable medium of claim 1 , the instructions being further executable to: display the cluster measure for each cluster and subcluster measure for each subcluster to a user; and accept a provided value for the threshold values from the user through an appropriate input device.
5. The non-transitory computer readable medium of any claims 1 to 4, the instructions being further executable to repeat the steps (iii) to (v) by using a different first threshold value than the previous one.
6. The non-transitory computer readable medium of any claims 1 to 5, the instructions being further executable to repeat the steps (iii) to (v) by using a different second threshold value than the previous one.
7. The non-transitory computer readable medium of any claims 1 to 6, wherein the method for clustering data comprising a customer database containing prospects for a product promotion, or an employee database containing prospects for assigning jobs, or consumer database containing prospects for fraud detection.
8. The non-transitory computer readable medium of any claims 1 to 7, wherein the threshold values are selected based at least in part on feedback from an environment, wherein the environment specifies the success results from a domain specific data set.
9. The non-transitory computer readable medium of any claims 1 to 8, wherein the termination event is either one or more of the satisfaction of an objective function and achievement of a desired success rate in a domain specific data set.
10. A data clustering system for clustering data points, comprising: a non-transitory computer readable medium storing machine executable instructions comprising: clustering algorithms to provide clusters and a cluster measure for each cluster representing the quality of the cluster; a cluster analysis component accepting each cluster having a cluster measure satisfying a first threshold value; reclustering algorithms to provide subclusters from each cluster having a cluster measure that does not satisfy the first threshold value and a subcluster for each subcluster representing the quality of the subcluster; wherein the cluster analysis component performs either one of the following operations on each subcluster that fails to satisfy the second threshold value:
(a) leave the subcluster as it is, or
(b) move the subcluster to one of the next best cluster, or
(c) make the subcluster a new cluster, a processor to execute the machine readable instructions stored on the non-transitory computer readable medium.
11. A system for training a clustering model using machine learning, based on feedback from an environment, the system comprising:
at least one processor; and a storage medium storing instructions that, when executed by the at least one processor, cause the at least one processor to perform instructions comprising:
(i) clustering data comprising data points by using a clustering algorithm to provide clusters, and a cluster measure for each cluster representing the quality of the cluster;
(ii) accepting each cluster having a cluster measure satisfying a first threshold value; wherein the first threshold value is selected based at least in part on feedback from an environment;
(iii) reclustering each cluster that fails to satisfy the the first threshold value to provide a set of subclusters, and a subcluster measure representing the quality of the subcluster;
(iv) accepting each subcluster having a subcluster measure satisfying a second threshold value; wherein the second threshold value is selected based at least in part on the feedback from an environment;
(v) performing either one of the following operations on each subcluster that fails to satisfy the second threshold value:
(a) leaving the subcluster as it is, or
(b) moving the subcluster to one of the next best clusters, or
(c) making the subcluster a new cluster,
(vi) repeating the steps (iii) to (v) until a termination event occurs.
12. The system of claim 11 , wherein the environment specifies the success results from a domain specific data set.
13. The system of claim 12, a storage medium storing instructions that, when executed by the at least one processor, cause the at least one processor to perform instructions comprising:
(vii) determining the clustering algorithm by which the most successful result is achieved in a domain specific data set.
14. A computer-implemented method for training a clustering model using machine learning, based on feedback from an environment, the method comprising:
(i) clustering data comprising data points by using a clustering algorithm to provide clusters, and a cluster measure for each cluster representing the quality of the cluster;
(ii) accepting each cluster having a cluster measure satisfying a first threshold value; wherein the first threshold value is selected based at least in part on feedback from an environment;
(iii) reclustering each cluster that fails to satisfy the the first threshold value to provide a set of subclusters, and a subcluster measure representing the quality of the subcluster;
(iv) accepting each subcluster having a subcluster measure satisfying a second threshold value; wherein the second threshold value is selected based at least in part on feedback from an environment;
(v) performing either one of the following operations on each subcluster that fails to satisfy the second threshold value:
(a) leaving the subcluster as it is, or
(b) moving the subcluster to one of the next best clusters, or
(c) making the subcluster a new cluster,
(vi) repeating the steps (iii) to (v) until a termination event occurs.
15. The method of claim 14, wherein the environment specifies the success results from a domain specific data set.
16. The method of claim 15, further comprising:
(vii) determining the clustering algorithm by which the most successful result is achieved in a domain specific data set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/TR2019/050681 WO2021029835A1 (en) | 2019-08-09 | 2019-08-09 | A method and system for clustering performance evaluation and increment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/TR2019/050681 WO2021029835A1 (en) | 2019-08-09 | 2019-08-09 | A method and system for clustering performance evaluation and increment |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021029835A1 true WO2021029835A1 (en) | 2021-02-18 |
Family
ID=74570675
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/TR2019/050681 WO2021029835A1 (en) | 2019-08-09 | 2019-08-09 | A method and system for clustering performance evaluation and increment |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2021029835A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023016087A1 (en) * | 2021-08-09 | 2023-02-16 | 腾讯科技(深圳)有限公司 | Method and apparatus for image clustering, computer device, and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140037214A1 (en) * | 2012-07-31 | 2014-02-06 | Vinay Deolalikar | Adaptive hierarchical clustering algorithm |
US20180285772A1 (en) * | 2017-03-31 | 2018-10-04 | At&T Intellectual Property I, L.P. | Dynamic updating of machine learning models |
-
2019
- 2019-08-09 WO PCT/TR2019/050681 patent/WO2021029835A1/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140037214A1 (en) * | 2012-07-31 | 2014-02-06 | Vinay Deolalikar | Adaptive hierarchical clustering algorithm |
US20180285772A1 (en) * | 2017-03-31 | 2018-10-04 | At&T Intellectual Property I, L.P. | Dynamic updating of machine learning models |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023016087A1 (en) * | 2021-08-09 | 2023-02-16 | 腾讯科技(深圳)有限公司 | Method and apparatus for image clustering, computer device, and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210103858A1 (en) | Method and system for model auto-selection using an ensemble of machine learning models | |
WO2012045496A2 (en) | Probabilistic data mining model comparison engine | |
US20200151748A1 (en) | Feature-based item similarity and forecasting system | |
CN112560474B (en) | Method, device, equipment and storage medium for generating portrait of express delivery industry | |
CN111209469A (en) | Personalized recommendation method and device, computer equipment and storage medium | |
CN113822390B (en) | User portrait construction method and device, electronic equipment and storage medium | |
CN111340071A (en) | System and method for personalized product recommendation using hierarchical bayes | |
CN114997916A (en) | Prediction method, system, electronic device and storage medium of potential user | |
Chen et al. | An extended study of the K-means algorithm for data clustering and its applications | |
Ranggadara et al. | Applying customer loyalty classification with RFM and Naïve Bayes for better decision making | |
Martins et al. | Sales forecasting using machine learning algorithms | |
Mitra et al. | Sales forecasting of a food and beverage company using deep clustering frameworks | |
Fahrudin et al. | Comparison of k-medoids and k-means algorithms in segmenting customers based on RFM criteria | |
Martins et al. | Retail sales forecasting information systems: comparison between traditional methods and machine learning algorithms | |
Elrefai et al. | Using artificial intelligence in enhancing banking services | |
Bhargavi et al. | Comparative study of consumer purchasing and decision pattern analysis using pincer search based data mining method | |
WO2021029835A1 (en) | A method and system for clustering performance evaluation and increment | |
UpendraReddy et al. | Prediction of likely customers for car industries using K-Means clustering compared with Logistic Regression | |
WO2020056286A1 (en) | System and method for predicting average inventory with new items | |
Boyapati et al. | Predicting sales using Machine Learning Techniques | |
He et al. | A Novel Subspace-Based GMM Clustering Ensemble Algorithm for High-Dimensional Data | |
Puspita et al. | Hardware sales forecasting using clustering and machine learning approach | |
Hasudungan et al. | The Impact of k-means on Association Rules Mining Algorithms Performance | |
Wang et al. | Discovering consumer's behavior changes based on purchase sequences | |
Stagge | A time series forecasting approach for queue wait-time prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19941708 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 06.05.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19941708 Country of ref document: EP Kind code of ref document: A1 |