WO2022126810A1

WO2022126810A1 - Text clustering method

Info

Publication number: WO2022126810A1
Application number: PCT/CN2021/071166
Authority: WO
Inventors: 张校源; 马祥祥
Original assignee: 上海爱数信息技术股份有限公司
Priority date: 2020-12-14
Filing date: 2021-01-12
Publication date: 2022-06-23
Also published as: CN112464638B; CN112464638A

Abstract

A text clustering method, comprising: performing word segmentation, stop word removal, and keyword extraction processing on a document set to be clustered (S1); creating a text similarity matrix, an adjacency matrix, a degree matrix, and a Laplacian matrix; calculating eigenvalues and eigenvectors of the Laplacian matrix to obtain an eigenmatrix; using a clustering method to cluster the eigenmatrix to obtain a clustering result (S6); if the number of categories is known, then setting the clustering result as the final clustering result; if the number of categories is unknown, then obtaining multiple clustering results by executing the following operations multiple times and evaluating the multiple clustering results to select a final clustering result: adjusting the clustering parameters, and executing the operations of constructing the adjacency matrix, degree matrix, and Laplacian matrix, calculating eigenvalues and eigenvectors of the Laplacian matrix to obtain an eigenmatrix, and clustering the eigenmatrix to obtain a clustering result; combining the final clustering result and the extracted keywords, and extracting a category keyword on the basis of a TF-IDF algorithm; and outputting the final clustering result and the category keyword (S10).

Description

Text Clustering Methods

This application claims the priority of the Chinese Patent Application No. 202011464923.4 filed with the China Patent Office on December 14, 2020, the entire contents of which are incorporated herein by reference.

technical field

The present application relates to the technical field of text analysis, for example, to a text clustering method.

Background technique

Text clustering is mainly based on the well-known clustering assumption: documents of the same class are more similar, while documents of different classes are less similar. As an unsupervised machine learning method, clustering has certain flexibility and high automatic processing capability because it does not require a training process and does not require manual labeling of documents in advance, and has become an effective method for organizing text information. , summarization and navigation are also important means of attention by more and more researchers.

There are several methods for text clustering: 1. Division method; 2. Density method; 3. Hierarchical method. Commonly used clustering algorithms include kmeans and kmean++ belonging to the partition method, density-based spatial clustering of applications with noise (DBSCAN) belonging to the density method, and hierarchical structure-based clustering belonging to the hierarchical method. Balanced Iterative Reducing and Clustering (BIRCH) algorithm. The spectral clustering algorithm is a method based on the spectral graph theory. Compared with the traditional clustering algorithm, it has the advantage of being able to cluster in any shape of the sample space and converging to the global optimal solution. The spectral clustering algorithm regards each object in the data set as a vertex V of the graph, and quantifies the similarity between vertices as the weight of the corresponding vertex connecting edge E, so that an undirected weighted graph G based on the similarity is obtained ( V, E), so the clustering problem can be transformed into a graph partitioning problem. The optimal division criterion based on graph theory is to maximize the similarity within the subgraphs and minimize the similarity between the subgraphs. The spectral clustering algorithm has different implementation methods, but these implementation methods can be summarized into the following three main steps: 1) construct the similarity matrix S representing the object set; 2) calculate the degree matrix and the Laplace matrix, and construct the feature Vector space; 3) Use kmeans or other classical clustering algorithms to cluster the eigenvectors in the eigenvector space.

The above clustering methods can only perform text clustering when the number of categories is known, and cannot give the category keywords after clustering, so that users cannot directly know the subject content to be expressed in this category according to the keywords. In addition, Most of the clustering results calculated by the clustering method have the problem of low precision and recall, that is, the accuracy of the clustering results is low.

SUMMARY OF THE INVENTION

The present application provides a text clustering method, which can cluster the text in the case of known or unknown number of categories, and can output keywords corresponding to each category at the same time.

Provide a text clustering method, including:

Perform word segmentation, stop word removal and keyword extraction in the document set to be clustered in turn;

According to the extracted keywords, create a text similarity matrix;

Construct an adjacency matrix based on the text similarity matrix, and construct a degree matrix based on the adjacency matrix;

Constructing a Laplacian matrix by combining the adjacency matrix and the degree matrix;

Calculate the eigenvalues and eigenvectors of the Laplacian matrix, and obtain the eigenmatrix corresponding to the document set to be clustered;

Clustering the feature matrix using a clustering method to obtain a clustering result;

When the number of categories of clusters is known, the obtained clustering result is used as the final clustering result; when the number of categories of clusters is unknown, multiple clusters are obtained by performing the following operations multiple times results, and evaluate the multiple clustering results, and select the final clustering results according to the evaluation results: adjust the clustering parameters, return to execute the construction of the adjacency matrix, the degree matrix and the Laplacian matrix, and calculate the The eigenvalues and eigenvectors of the Laplace matrix to obtain a eigenmatrix, and the eigenmatrix is clustered to obtain an operation of clustering results;

Combined with the final clustering result and the extracted keywords, category keywords are extracted based on a term frequency-inverse document frequency (Term Frequency-Inverse Document Frequency, TF-IDF) algorithm;

Output the final clustering result and the category keyword.

Description of drawings

1 is a schematic flowchart of a text clustering method provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of a text clustering process provided by an embodiment of the present application.

Detailed ways

The present application will be described below with reference to the accompanying drawings and specific embodiments.

Example

As shown in Figure 1, a text clustering method based on an improved spectral clustering algorithm includes the following steps:

S1. Perform word segmentation, stop word removal, and keyword extraction in the document set to be clustered in sequence.

S2. Create a text similarity matrix according to the extracted keywords.

S3. Construct an adjacency matrix based on the text similarity matrix, and construct a degree matrix based on the adjacency matrix.

S4, combine the adjacency matrix and the degree matrix to construct a Laplace matrix.

S5. Calculate the eigenvalues and eigenvectors of the Laplacian matrix, and obtain the eigenmatrix corresponding to the document set to be clustered.

S6. Use the classical clustering method to cluster the feature matrix to obtain a corresponding clustering result.

S7. If the number of categories of clusters is known, step S9 is performed.

If the number of categories of clusters is unknown, step S8 is executed.

S8. Adjust the clustering parameters in turn to determine the number of corresponding categories, and then return to steps S3 to S6 to obtain multiple adjusted clustering results, evaluate the multiple adjusted clustering results, and select the optimal clustering result. result.

S9. Combine the clustering result obtained in step S6 or step S8 and the keywords extracted in step S1 to extract category keywords based on the TF-IDF algorithm.

S10, output the clustering result and the corresponding category keywords.

Applying the above method to practice, the working process is shown in Figure 2. From the user inputting the document set to the output text clustering result, it mainly includes the following processes:

1. Perform word segmentation, remove stop words and extract keywords from the input document set

Before text clustering, text keywords need to be extracted for two reasons: one is to reduce the dimension of the vector for creating a text similarity matrix, and the other is to extract category keywords based on text keywords after clustering is completed. When extracting keywords, through part-of-speech filtering, keywords whose parts of speech are nouns, verbs, gerunds, person names, place names, and institutional nouns are mainly reserved to improve the accuracy of text similarity.

2. Create a text similarity matrix through the extracted keywords

Generally, there are cosine similarity, Euclidean distance, jaccard distance and other methods for calculating text similarity. This application adopts the method of constructing a bag of words to calculate text similarity, mainly calculating the TF-IDF value of each keyword in the text, and saving it in A place similar to a bag, by judging whether there are the same keywords between one text and another text, and then using the TF-IDF value in the word bag to calculate the similarity between the texts, this method is similar to the cosine distance calculation method, However, it can reduce the amount of calculation and achieve the effect of improving efficiency. The obtained text similarity matrix is an N*N matrix, and each value is the similarity between texts.

3. Calculate the adjacency matrix (W), degree matrix (D) and Laplace matrix (L)

Adjacency Matrix (W): There are three types of methods for constructing adjacency matrix: ∈-neighbor method, K-neighbor method and fully connected method. Among them, ∈-proximity method, which sets a distance threshold ∈, and then uses Euclidean distance to measure the distance between any two points. That is, the Euclidean distance of the text similarity matrix is:

Among them, s _ij is the Euclidean distance between element x _i and element x _j in the text similarity matrix. According to the size relationship between s _ij and ∈, the adjacency matrix W is defined as follows:

[Correction 07.04.2022 under Rule 91]

Among them, w _ij is the i-th row and j-th column element in the adjacency matrix W.

The K-neighbor method is that as long as a point is in the K-nearest neighbor of another point, s _ij is retained, or two points are in the K-nearest neighbor of each other, s _ij can be retained:

Among them, KNN(x _i ) is the K nearest neighbors of element x _i , KNN(x _j ) is the K nearest neighbors of element x _j , and σ is the variance.

The full connection method, compared with the first two methods, the weight value between all points in the full connection method is greater than 0, so it is called the full connection method. Different kernel functions can be selected to define edge weights, commonly used are polynomial kernel function, Gaussian kernel function and Sigmoid kernel function. For the Gaussian kernel function, the Radial Basis Function (RBF), the similarity matrix and the adjacency matrix are the same:

The degree matrix D is constructed from the adjacency matrix. The degree matrix is a diagonal matrix, only the main diagonal has values, and the values in other positions are 0. The value on the diagonal is the sum of all the values in this row, namely:

Among them, d _i is the element located on the main diagonal in the i-th row of the degree matrix D, and n is the number of texts.

The Laplacian matrix is a symmetric matrix, resulting from the fact that both D and W are symmetric, and all of its eigenvalues are real:

L=D-W

Among them, L is the Laplace matrix, D is the degree matrix, and W is the adjacency matrix.

Fourth, calculate eigenvalues, eigenvectors and eigenmatrixes

Calculate the eigenvalues and eigenvectors of the matrix according to the Laplace matrix, first solve the eigenvalues according to the characteristic polynomial of the Laplace matrix, solve the eigenvectors according to the eigenvalues, and then judge the size of the eigenvalues by the number of clusters (m). The number that satisfies the condition (for example, the value of the feature value is less than (1-1/m)*0.95) is used as the number of dimensions for dimensionality reduction, and the feature matrix of the document set to be clustered is obtained through dimensionality reduction.

V. In this embodiment, the feature matrix is clustered by kmeans

After the feature matrix is constructed, the traditional classical clustering algorithm kmeans is used to cluster the feature matrix. Spectral clustering only needs the similarity matrix between texts, which is more effective for processing sparse data, and it is difficult to use kmeans directly; spectral clustering uses dimensionality reduction, which is better than using kmeans directly when dealing with high-dimensional data. If the number of categories to be clustered is directly passed in, after the kmeans clustering is completed, the following sixth step can be skipped, and the seventh step is to directly extract the category keywords to complete the clustering task. If the number of categories of clusters is not passed in, it is necessary to find a better number of clusters to complete the clustering through the sixth step, and then perform keyword extraction to complete the clustering task.

6. Evaluate the clustering effect

By adjusting the number of parameter clusters, go back to step 3 to obtain the clustering results again, and evaluate the histogram of the clustering results to find the number of clusters corresponding to the histogram with the best effect as the number of categories for this clustering task .

7. Extract category keywords

According to the clustering results and the text keywords, the category keywords are extracted by the TF-IDF algorithm, and the content described in this category can be roughly judged according to the category keywords. Keywords in this category are keywords extracted based on the TF-IDF values calculated for several categories under this clustering task, and have nothing to do with text data other than this task.

Eight, the whole process is over, return the category text and category keywords

In this embodiment, the method of the present application and the kmeans and DBSCAN algorithms are respectively used to perform clustering processing on four types of data, wherein the four types of data are:

Data 1:

Table 1

Data 2:

Table 2

Data 3 (network download data, a total of 14 news categories):

table 3

类别category	数量(篇)Quantity (articles)
财经Finance	200200
彩票lottery	200200
房产real estate	200200
股票stock	200200
家具furniture	200200
教育educate	200200
科技Technology	200200
社会society	200200

时尚Fashion	200200
时政current affairs	200200
体育physical education	200200
星座constellation	200200
娱乐entertainment	200200

Data 4 (network download data set, used for classification model training data set):

Table 4

类别category	数量(篇)Quantity (articles)
ArtArt	800800
EconomyEconomy	800800
PoliticsPolitics	800800
SpaceSpace	800800
SportsSports	800800
AgricultureAgriculture	300300
ComputerComputer	300300
EnviornmentEnviornment	300300
HistoryHistory	300300

This embodiment uses the above four kinds of data, combined with kmeans, DBSCAN algorithm and the method proposed in this application to test to obtain the precision rate, recall rate and F1 value. First, the three test indicators are explained. According to the confusion matrix, if there is a For two classification problems, then the combination of the predicted results and the actual results will have the following four situations:

table 5

Since it is not easy to read with numbers 1 and 0, it is converted to use T (True) for correct, F (False) for error, P (Positive) for 1, and N (Negative) for 0. First look at the predicted results (P|N), and then compare the predicted results with the actual results to give the judgment results (T|F). According to the above logic, after reassignment is:

Table 6

TP, FP, FN, TN can be understood as:

TP: Prediction is 1, actual is 1, the prediction is correct.

FP: Predicted 1, actual 0, wrong prediction.

FN: Predicted 0, actual 1, wrong prediction.

TN: The prediction is 0, the actual is 0, the prediction is correct.

Accuracy: The percentage of correct predictions in the total sample, expressed as:

Accuracy rate: For the prediction result, its meaning is the probability that it is actually a positive sample among all the predicted positive samples. The expression is:

Recall rate: For the original sample, its meaning is the probability of being predicted to be a positive sample in an actual positive sample. The expression is:

The F1 score expression is:

For data 1, the test is carried out when the number of incoming categories is known, and the test results are shown in Table 7:

Table 7

算法algorithm	平均精确率(％)Average Precision (%)	平均召回率(％)Average recall (%)	平均F1值(％)Average F1 value (%)
KmeansKmeans	88.688.6	86.586.5	87.587.5
DBSCANDBSCAN	61.861.8	43.643.6	51.151.1
本方法this method	93.293.2	90.490.4	91.891.8

It can be seen from the data in Table 7 that using different clustering algorithms, when a fixed number of clusters is input, the precision, recall and F1 value of this method are better than the kmeans algorithm and the DBSCAN algorithm.

For data 2, the method proposed in this application is used to test the number of specified and unspecified categories respectively, and the test results are shown in Table 8:

Table 8

是否指定聚类个数Whether to specify the number of clusters	聚类结果Clustering results	精确率(％)Accuracy (%)	召回率(％)Recall (%)	F1值(％)F1 value (%)
指定聚类个数4Specify the number of clusters 4	4个类别4 categories	96.296.2	93.793.7	94.994.9
未指定聚类个数Number of clusters not specified	4个类别4 categories	96.296.2	93.793.7	94.994.9

It can be seen from the data in Table 8 that the known test document set has 4 types of data. When specifying or not specifying 4 categories of data, the clustering results of this method are all 4 categories, and the clustering effect is better.

For data 3, test the results of this method with the specified number of clusters:

Table 9

Clustering results

Accuracy (%)

Recall (%)

F1 value (%)

14 categories

93.2

93

As can be seen from the data in Table 9, when using multiple categories and a large number of document sets for testing, when a fixed number of categories is passed in, the average precision rate, average recall rate and average F1 value of the test results of this method are obtained. All are more than 90%, the effect is better.

For data 4, a total of two tests were carried out. The first test was to use all 9 categories, each category contained 300 texts, and a total of 2700 texts were tested, and the test results of kmeans and DBSCAN were compared as shown in Table 10:

Table 10

算法algorithm	平均精确率(％)Average Precision (%)	平均召回率(％)Average recall (%)	平均F1值(％)Average F1 value (%)
KmeansKmeans	65.865.8	63.963.9	64.864.8
DBSCANDBSCAN	52.352.3	49.549.5	50.8650.86
本方法this method	68.368.3	66.966.9	67.667.6

From the test data, the effect of this method and kmeans method is not particularly prominent, and the overall data value is not much different; and the overall data of each algorithm is not too high; after random inspection of text and analysis of clustering results, it is found that this data set contains Some texts of different categories are relatively similar, and there is overlap. For example, there are many overlapping texts in the Environment (environment) category and the Agriculture (agriculture) category, and the extracted keywords are relatively similar, that is, it is easy to judge based on these keywords. error category;

Based on the above analysis, this embodiment optimizes the test data for data 4, directly removes the category data with text intersection, and only uses 5 categories (Art, Economy, Politics, Space, Sports) out of 9 categories, There are 800 data in each category and a total of 4000 text data. The second test was done, and the data results are shown in Table 11:

Table 11

算法algorithm	平均精确率(％)Average Precision (%)	平均召回率(％)Average recall (%)	平均F1值(％)Average F1 value (%)
KmeansKmeans	83.383.3	79.679.6	81.481.4
DBSCANDBSCAN	63.263.2	65.865.8	64.564.5
本方法this method	89.6189.61	89.0289.02	89.3189.31

As can be seen from the data in Table 11 and Table 10, the overall effect has been improved to a certain extent. Compared with the two test results of data 4, no matter whether the data set is optimized or not, the effect of this method is higher than that of the kmeans and DBSCAN algorithms.

To sum up, this application improves on the basis of the original spectral clustering. First, clustering can be performed without specifying the number of clusters; Instead, it depends on the number of smaller values in the eigenvalues; thirdly, the category keywords can be extracted after the clustering is completed. The main process is to calculate the adjacency matrix (W), degree matrix (D) and Laplace matrix (L) by adjusting the number of parameter clusters after constructing the text similarity matrix, and then calculate the eigenvalues and features. Vector, by judging the number of eigenvalues that satisfy the condition k, reduce the eigenvector dimension to k, and construct an eigenvector matrix, and use other classical clustering algorithms (such as kmeans) to cluster the eigenvector matrix, and evaluate the eigenvector matrix. For clustering effect, select the number of clusters with better clustering effect, so that the clustering effect can still meet the requirements without inputting the number of clusters, and retain the original spectral clustering method that can specify the number of clusters. A method for clustering a set of text pairs. It is not only beneficial for users to perform clustering operations on unknown data sets, but also allows users to perform text clustering when the number of categories is known. category keywords, allowing users to judge the subject content to be expressed by this category according to the keywords. Through testing, the clustering effect of the present application also has a certain improvement in precision and recall compared with the traditional clustering algorithm.

In practical applications, the method proposed in this application can cluster the document set when the number of categories is unknown or known. It can be applied to customers who want to classify some unlabeled text sets, and extract The keywords under the category can be extended and accurately applied to clustering sensitive document sets of unknown categories, and then the keywords of these labeled sensitive documents are used for document classification, so as to use known sensitive documents to judge unknown documents. Whether it is a sensitive document and the category to which it belongs, and respond accordingly according to the sensitive category determined.

In text clustering, not only the document sets with the known number of categories can be clustered, but also the document sets with the unknown number of categories can be clustered. As long as the user has the document set data, he can complete the classification of the documents; It is very effective for sparse data and is better than traditional clustering algorithms; it also uses dimensionality reduction processing for high-dimensional data, and the complexity of clustering is also better than traditional clustering algorithms; the precision rate and recall rate of clustering results are higher than The traditional algorithm is good and widely used. It can process both unknown category document sets and known category document sets. It can perform clustering processing on document sets in specific fields (for example: known sensitive documents or confidential documents, etc.), and can also perform clustering operations on general document sets. On the basis of clustering, category keywords can also be viewed, and the general content of the category text can be viewed without looking at the content of each file. And a text classification model can be created with category text keywords and applied to text classification.

The present application realizes the improvement of the spectral clustering algorithm by setting the process of adjusting the clustering parameters, so as to provide the corresponding number of categories independently, and by evaluating the corresponding adjusted clustering results, the optimal clustering can be selected. As a result, the corresponding number of categories is determined, so as to achieve the purpose of clustering document sets with unknown number of categories, so that users only need to provide document set data, and based on the method proposed in this application, the document set can be completed. Categories distinguish work. The present application combines the clustering results and the extracted keywords, and adopts the TF-IDF algorithm to extract the category keywords corresponding to the clustering results, so that the user can intuitively view the category keywords corresponding to different categories of text, without having to look at the file content You can get the subject content of the text. The present application screens the eigenvalues based on the number of categories, and uses the number of screened eigenvalues as the number of dimensions for dimensionality reduction, so that the feature matrix corresponding to the set of documents to be clustered can be obtained by dimensionality reduction processing, which can greatly reduce the cost of subsequent clustering processing. In addition, the present application uses keywords extracted from the document set to be clustered to construct a text similarity matrix, which can effectively cluster sparse data.

Claims

A text clustering method including:

Perform word segmentation, stop word removal and keyword extraction in the document set to be clustered in turn;

According to the extracted keywords, create a text similarity matrix;

Construct an adjacency matrix based on the text similarity matrix, and construct a degree matrix based on the adjacency matrix;

Constructing a Laplacian matrix by combining the adjacency matrix and the degree matrix;

Calculate the eigenvalues and eigenvectors of the Laplacian matrix, and obtain the eigenmatrix corresponding to the document set to be clustered;

Clustering the feature matrix using a clustering method to obtain a clustering result;

When the number of categories of clusters is known, the obtained clustering result is used as the final clustering result; when the number of categories of clusters is unknown, multiple clusters are obtained by performing the following operations multiple times results, and evaluate the multiple clustering results, and select and obtain the final clustering result according to the evaluation results: adjust the clustering parameters, return to execute the construction of the adjacency matrix, the degree matrix and the Laplacian matrix, and calculate the The eigenvalues and eigenvectors of the Laplacian matrix are used to obtain a eigenmatrix, and the eigenmatrix is clustered to obtain an operation of clustering results;

Combined with the final clustering result and the extracted keywords, category keywords are extracted based on the word frequency-inverse text frequency index TF-IDF algorithm;

Output the final clustering result and the category keyword.
The method according to claim 1, wherein the parts of speech of the extracted keywords include nouns, verbs, gerunds, people's names, place names, and institutional nouns.
The method according to claim 1, wherein, creating a text similarity matrix according to the extracted keywords comprises:

Calculate the TF-IDF value of each keyword in all the texts in the document set to be clustered, and put the TF-IDF value of each keyword in the all texts into the bag of words;

According to the TF-IDF value of each keyword in all the texts stored in the word bag, the similarity between different texts is calculated, and the text similarity matrix is constructed by using the similarity between different texts.
The method according to claim 3, wherein the text similarity matrix is an N*N matrix, and each element in the text similarity matrix is the similarity between different texts.
The method according to claim 4, wherein the constructing an adjacency matrix based on the text similarity matrix, and constructing a degree matrix based on the adjacency matrix, comprising:

Based on the text similarity matrix, the adjacency matrix W is constructed by adopting the ε-proximity method, the K-proximity method or the full connection method;

According to the elements in the adjacency matrix W, a diagonal matrix is constructed, and the diagonal matrix is used as the degree matrix D.
[Correction 07.04.2022 under Rule 91]
The method of claim 5, wherein,
In the case of constructing the adjacency matrix W using the ε-proximity method, the adjacency matrix W is:

Wherein, w ij is the element of the i-th row and j-th column in the adjacency matrix W, s ij is the Euclidean distance between the element x i and the element x j in the text similarity matrix, ∈ is the set distance threshold ;
In the case of constructing the adjacency matrix W using the K-proximity method, the adjacency matrix W is:

Wherein, w ij is the element of the i-th row and the j-th column in the adjacency matrix W, element x i and element x j are the elements in the text similarity matrix, KNN(x i ) is the element of the element x i K nearest neighbors, KNN(x j ) is the K nearest neighbors of the element x j , and σ is the variance;
In the case of constructing the adjacency matrix W using the full connection method, the adjacency matrix W is:

Wherein, w ij is the element of the i-th row and j-th column in the adjacency matrix W, the element x i and the element x j are the elements in the text similarity matrix, and σ is the variance.
The method according to claim 6, wherein the degree matrix D is:

Wherein, d i is the element of the i-th row located on the main diagonal in the degree matrix D, and n is the number of all the texts.
The method of claim 7, wherein the Laplacian matrix is:

L=D-W;

where L is the Laplace matrix.
The method according to claim 1, wherein the calculating the eigenvalues and eigenvectors of the Laplacian matrix to obtain the eigenmatrix corresponding to the set of documents to be clustered, comprising:

According to the characteristic polynomial of the Laplacian matrix, the eigenvalue of the Laplacian matrix is obtained by solving;

According to the eigenvalues of the Laplacian matrix, solve to obtain the eigenvectors of the Laplacian matrix;

According to the number of categories of the clusters, the number of eigenvalues that meet the preset conditions is screened to be k, and the dimension of the eigenvectors of the Laplace matrix is reduced to k, so as to construct a dimension-reduced eigenmatrix , wherein the preset condition is that the value of the feature value is less than (1-1/m)*0.95, and m is the number of categories of the cluster.
The method of claim 1, wherein the evaluating the plurality of clustering results comprises:

The plurality of clustering results are evaluated by means of calculating a histogram.