WO2021029835A1

WO2021029835A1 - A method and system for clustering performance evaluation and increment

Info

Publication number: WO2021029835A1
Application number: PCT/TR2019/050681
Authority: WO
Inventors: Şadi Evren ŞEKER
Original assignee: Bilkav Eğitim Danişmanlik A.Ş.
Priority date: 2019-08-09
Filing date: 2019-08-09
Publication date: 2021-02-18

Abstract

This invention relates to a non-transitory computer readable medium storing machine executable instructions to perform a method for clustering data comprising data points, the instructions executable by an associated processor to (i) cluster the data points by using a clustering algorithm to provide clusters, and a cluster measure for each cluster representing the quality of the cluster; (ii) accept each cluster having a cluster measure satisfying a first threshold value; (iii) recluster each cluster that fails to satisfy the the first threshold value to provide a set of subclusters, and a subcluster measure representing the quality of the subcluster; (iv) accept each subcluster having a subcluster measure satisfying a second threshold value; (v) perform either one of the following operations on each subcluster that fails to satisfy the second threshold value: (a) leave the subcluster as it is, or (b) move the subcluster to one of the next best cluster, or (c) make the subcluster a new cluster, (vi) repeat the steps (iii) to (v) until a termination event occurs.

Description

A METHOD AND SYSTEM FOR CLUSTERING PERFORMANCE EVALUATION

AND INCREMENT

The present invention relates generally to the field of data clustering, and more particularly to computer-implemented automated data analysis and clustering.

The present invention also relates to a clustering performance evaluation and increment by subclustering.

Clustering is a commonly used method in many fields of both pure and social sciences for analyzing very large databases of information. Clustering provides grouping the data into classes or categories and this often helps to describe similarities and differences in data in a way that helps understanding and describes relationships. Clustering also provides weight or significance to each group, identify a subset of data, and identify data that are least similar to the rest of the database. Basic principles of data clustering are described in Jain et al., “Data Clustering: A Review,” ACM Computing Surveys, Vol. 31 , No.3, pp. 264-323 (September 1999), the entirety of which is incorporated by reference herein.

In most of the business cases, a clustering algorithm is followed by a supervised learning technique such as classification or regression. Also, the final model, including the clustering and the supervised learning, is applied to a business or an analytics problem.

Assessing the quality of a model is one of the most important considerations when deploying any machine learning algorithm. For supervised learning problems, this is easy. There are already labels for every example, so the practitioner can test the model’s performance on a reserved evaluation set. One of the main problems of clustering is, it is an unsupervised learning and it is not possible to evaluate without an initial knowledge. Evaluation of clustering algorithms is also an important case because of increasing trends in the consensus learning and automated machine learning. These trendy approaches in data science are also requiring an automated evaluation technique for the clustering algorithms. In the state of the art, United States Patent Application US 2005/0149466 A1 provides a method, system, and article of manufacture for selecting prospects for a product promotion through clustering. When the number of initially identified prospects mismatches the target number of prospects, the final selection of prospects is determined by performing a culling process or augmenting process to reduce or increase, respectively, the initial set of prospects using a heuristic measure H, until the number of prospects in the initial set of prospects matches the target number of prospects. The method described in this document does not deal with the problem of a better membership of a prospect in a cluster than the initially identified one, but only with the number of prospects.

The performance evaluation of clustering algorithms (which is known under the term cluster validity) in the literature can be divided into three groups: First; internal statistical and machine learning techniques such as internal methods: Silhoute index, Error Sum of squares, Seperation, Bayesian Information Criteria. Second; external statistical and machine learning methods: Purity, Accuracy, Entropy, Precision/Recall, F-Measure, Jaccard Index or rand index. Third; Field tests business cases like A/B tests. This last approach is the final evaluation for most of the cases and the ultimate purpose of almost all systems is increasing the success of business case.

The present invention provides a novel, end-to-end evaluation technique besides a method for increasing the success rate of the clustering algorithm. The main problem for most of the clustering algorithms is that the clusters do not represent 100% of membership for all the data points covered.

In the state of the art, United States Patent Application US 2014/0372214 A1 discloses a hierarchical clustering algorithm which is performed on the plurality of feature vectors to provide a plurality of clusters with a cluster similarity measure for each cluster representing the quality of the cluster. Compared to hierarchical clustering (HC), the algorithm is based on the success of business case and furthermore, it can be applied to big data problems while HC cannot be applied easily because of the performance problems. There is a need for a method that systematically and cost-effectively evaluates the performance of clustering algorithms and potential predictive modeling techniques for prediction problems.

The inventor has recognized and appreciated that statistical learning techniques can be used to systematically and cost-effectively evaluate the potential predictive modeling solutions for prediction problems.

The aim of the present invention is to provide by the method is automating the evaluation of clustering algorithms based on the business case success and applying the subclustering approach for boosting the success automatically.

This aim has been achieved by the clustering method as defined in Claims 1 , 10, 11 and 14. Further advantages of the invention have been attained by the technical features defined in the dependent claims.

In a non-transitory computer readable medium storing machine which comprises executable instructions to perform a method for clustering data comprising data points, the instructions executable by an associated processor to:

(i) cluster the data points by using a clustering algorithm to provide clusters, and a cluster measure for each cluster representing the quality of the cluster;

(ii) accept each cluster having a cluster measure satisfying a first threshold value;

(iii) recluster each cluster that fails to satisfy the the first threshold value to provide a set of subclusters, and a subcluster measure representing the quality of the subcluster;

(iv) accept each subcluster having a subcluster measure satisfying a second threshold value;

(v) perform either one of the following operations on each subcluster that fails to satisfy the second threshold value:

(a) leave the subcluster as it is, or

(b) move the subcluster to one of the next best cluster, or

(c) make the subcluster a new cluster,

(vi) repeat the steps (iii) to (v) until a termination event occurs. Figure 1 illustrates one example of a clustering system 100. At 101 ; data points are clustered with a clustering algorithm to provide clusters and a cluster measure for each cluster. Each of the plurality of clusters generated has an associated cluster similarity measure, which can be generated as part of the clustering process or afterward, that represents the quality of the cluster. In one implementation, the quality of the cluster represents the extent to which data points within the cluster contain similar values accross the selected features.

At 102; a first threshold value is determined. In one implementation, the cluster measures are displayed to a user and the user identifies an appropriate threshold value to separate high quality clusters from lower quality clusters and provides the threshold value through an appropriate input device. In another implementation, the threshold value can be calculated from the cluster similarity measures associated with the plurality of clusters.

At 103; a cluster is selected. It will be appreciated that the clusters can be evaluated in any order. At 104; it is determined, for each of the clusters, if its associated cluster measure meets the threshold value. If the threshold is met (YES), the cluster is left as it is (114). If the threshold is not met (NO), the cluster is subjected to reclustering at 105 to provide a plurality of subclusters and associated measures. At 106; then a second threshold value is determined.

At 107; a subcluster is selected. At 108, it is determined, for each of the subclusters, if its associated cluster measure meets a second threshold value. If the threshold is met (YES), the subcluster is left as it is (115). If the threshold is not met (NO), there are three options. At 109; the subcluster is left as it is. At 110; the subcluster is moved to one of the next best clusters. At 111 ; the subcluster is made a new cluster.

The method then advances to 112, where it is determined if all subclusters have been selected. If not (NO), the method returns to 107 to select a new subcluster. If all subclusters have been selected (YES), the system advances to 113, where it is determined if all clusters have been selected. If not (NO), the method returns to 103 to select a new cluster. If all clusters have been selected (YES), the method terminates. In an embodiment of the invention the non-transitory computer readable medium the instructions being further executable to display the cluster measure for each cluster and subcluster measure for each subcluster to a user; and accept a provided value for the threshold values from the user through an appropriate input device.

In an embodiment of the invention the non-transitory computer readable medium wherein cluster measure is calculated. The cluster measure can be generated as part of the clustering process or afterward, that represents the quality of the cluster. In one implementation, the quality of the cluster represents the extent to which data points within the cluster contain similar values accross the selected features.

In an embodiment of the invention the non-transitory computer readable medium wherein subcluster measure is calculated. The subcluster measure can be generated as part of the subclustering process or afterward, that represents the quality of the cluster. In one implementation, the quality of the subcluster represents the extent to which data points within the cluster contain similar values accross the selected features.

In an embodiment of the invention the instructions being further executable to repeat the steps (iii) to (v) by using a different first threshold value than the previous one.

In an embodiment of the invention the instructions being further executable to repeat the steps (iii) to (v) by using a different second threshold value than the previous one.

In an embodiment of the invention the method for clustering data comprising a customer database containing prospects for a product promotion, or an employee database containing prospects for assigning jobs, or consumer database containing prospects for fraud detection.

In an examplary embodiment, a case like segmenting the customers and displaying ads depending on the customer segments. During the marketing process, customers (current or potential customers) are segmented based on their features (such as income level, age, credit rating etc.), also the actions they have taken (such as number of visits, number of purchases, volume of total transactions etc.). Based on the clusters created, the marketing automation decides on the ads to display or products for promotion to the customer segments. The proposed method is automating both the performance evaluation of clustering and increasing the success of clustering at the same time together with assigning ads to the customer segments.

In another exemplary embodiment, a case like creating segments of employees and assigning jobs (or promotions) based on the employee segments. The objective function might be the total performance (evaluation) after assignments based on the clusters created.

Another very famous usage of the clustering algorithms is fraud detection (or outlier detection in general). In the approach, the segments are created and the outliers are detected if a data point does not gets into one of the segments. In other words, outliers are detected by the data points staying out of all the segments after the segmentation.

In an embodiment of the invention the threshold values are selected based at least in part on feedback from an environment, wherein the environment specifies the success results from a domain specific data set.

In an embodiment of the invention the termination event is either one or more of the satisfaction of an objective function and achievement of a desired success rate in a domain specific data set.

In a data clustering system for clustering data points, comprising: a non-transitory computer readable medium storing machine executable instructions comprising: clustering algorithms to provide clusters and a cluster measure for each cluster representing the quality of the cluster; a cluster analysis component accepting each cluster having a cluster measure satisfying a first threshold value; reclustering algorithms to provide subclusters from each cluster having a cluster measure that does not satisfy the first threshold value and a subcluster for each subcluster representing the quality of the subcluster; wherein the cluster analysis component performs either one of the following operations on each subcluster that fails to satisfy the second threshold value: (a) leave the subcluster as it is, or

(b) move the subcluster to one of the next best cluster, or

(c) make the subcluster a new cluster, a processor to execute the machine readable instructions stored on the non-transitory computer readable medium.

In a system for training a clustering model using machine learning, based on feedback from an environment, the system comprising: at least one processor; and a storage medium storing instructions that, when executed by the at least one processor, cause the at least one processor to perform instructions comprising:

(i) clustering data comprising data points by using a clustering algorithm to provide clusters, and a cluster measure for each cluster representing the quality of the cluster;

(ii) accepting each cluster having a cluster measure satisfying a first threshold value; wherein the first threshold value is selected based at least in part on feedback from an environment;

(iii) reclustering each cluster that fails to satisfy the the first threshold value to provide a set of subclusters, and a subcluster measure representing the quality of the subcluster;

(iv) accepting each subcluster having a subcluster measure satisfying a second threshold value; wherein the second threshold value is selected based at least in part on the feedback from an environment;

(v) performing either one of the following operations on each subcluster that fails to satisfy the second threshold value:

(a) leaving the subcluster as it is, or

(b) moving the subcluster to one of the next best clusters, or

(c) making the subcluster a new cluster,

(vi) repeating the steps (iii) to (v) until a termination event occurs.

In an embodiment of the invention the environment specifies the success results from a domain specific data set.

In an embodiment of the invention a storage medium storing instructions that, when executed by the at least one processor, cause the at least one processor to perform instructions comprising: (vii) determining the clustering algorithm by which the most successful result is achieved in a domain specific data set.

In a computer-implemented method for training a clustering model using machine learning, based on feedback from an environment, the method comprising:

(iv) accepting each subcluster having a subcluster measure satisfying a second threshold value; wherein the second threshold value is selected based at least in part on feedback from an environment;

(a) leaving the subcluster as it is, or

(b) moving the subcluster to one of the next best clusters, or

(c) making the subcluster a new cluster,

(vi) repeating the steps (iii) to (v) until a termination event occurs.

In an embodiment the environment specifies the success results from a domain specific data set.

In an embodiment the method further comprising (vii) determining the clustering algorithm by which the most successful result is achieved in a domain specific data set.

Brief description of the drawings:

Figure 1 illustrates one example of a clustering system.

Claims

1. A non-transitory computer readable medium storing machine executable instructions to perform a method for clustering data comprising data points, the instructions executable by an associated processor to:

(a) leave the subcluster as it is, or

(b) move the subcluster to one of the next best cluster, or

(c) make the subcluster a new cluster,

(vi) repeat the steps (iii) to (v) until a termination event occurs.

2. The non-transitory computer readable medium of claim 1 , the instructions being further executable to: display the cluster measure for each cluster and subcluster measure for each subcluster to a user; and accept a provided value for the threshold values from the user through an appropriate input device.

5. The non-transitory computer readable medium of any claims 1 to 4, the instructions being further executable to repeat the steps (iii) to (v) by using a different first threshold value than the previous one.

6. The non-transitory computer readable medium of any claims 1 to 5, the instructions being further executable to repeat the steps (iii) to (v) by using a different second threshold value than the previous one. 7. The non-transitory computer readable medium of any claims 1 to 6, wherein the method for clustering data comprising a customer database containing prospects for a product promotion, or an employee database containing prospects for assigning jobs, or consumer database containing prospects for fraud detection.

8. The non-transitory computer readable medium of any claims 1 to 7, wherein the threshold values are selected based at least in part on feedback from an environment, wherein the environment specifies the success results from a domain specific data set.

9. The non-transitory computer readable medium of any claims 1 to 8, wherein the termination event is either one or more of the satisfaction of an objective function and achievement of a desired success rate in a domain specific data set.

10. A data clustering system for clustering data points, comprising: a non-transitory computer readable medium storing machine executable instructions comprising: clustering algorithms to provide clusters and a cluster measure for each cluster representing the quality of the cluster; a cluster analysis component accepting each cluster having a cluster measure satisfying a first threshold value; reclustering algorithms to provide subclusters from each cluster having a cluster measure that does not satisfy the first threshold value and a subcluster for each subcluster representing the quality of the subcluster; wherein the cluster analysis component performs either one of the following operations on each subcluster that fails to satisfy the second threshold value:

(a) leave the subcluster as it is, or

(b) move the subcluster to one of the next best cluster, or

11. A system for training a clustering model using machine learning, based on feedback from an environment, the system comprising: at least one processor; and a storage medium storing instructions that, when executed by the at least one processor, cause the at least one processor to perform instructions comprising:

(a) leaving the subcluster as it is, or

(b) moving the subcluster to one of the next best clusters, or

(c) making the subcluster a new cluster,

(vi) repeating the steps (iii) to (v) until a termination event occurs.

12. The system of claim 11 , wherein the environment specifies the success results from a domain specific data set.

13. The system of claim 12, a storage medium storing instructions that, when executed by the at least one processor, cause the at least one processor to perform instructions comprising:

(vii) determining the clustering algorithm by which the most successful result is achieved in a domain specific data set.

14. A computer-implemented method for training a clustering model using machine learning, based on feedback from an environment, the method comprising:

(i) clustering data comprising data points by using a clustering algorithm to provide clusters, and a cluster measure for each cluster representing the quality of the cluster; (ii) accepting each cluster having a cluster measure satisfying a first threshold value; wherein the first threshold value is selected based at least in part on feedback from an environment;

(a) leaving the subcluster as it is, or

(b) moving the subcluster to one of the next best clusters, or

(c) making the subcluster a new cluster,

(vi) repeating the steps (iii) to (v) until a termination event occurs.

15. The method of claim 14, wherein the environment specifies the success results from a domain specific data set.

16. The method of claim 15, further comprising: