WO2005071603A2 - Clustering method and computer system - Google Patents

Clustering method and computer system Download PDF

Info

Publication number
WO2005071603A2
WO2005071603A2 PCT/EP2005/000198 EP2005000198W WO2005071603A2 WO 2005071603 A2 WO2005071603 A2 WO 2005071603A2 EP 2005000198 W EP2005000198 W EP 2005000198W WO 2005071603 A2 WO2005071603 A2 WO 2005071603A2
Authority
WO
Grant status
Application
Patent type
Prior art keywords
cluster
clustering
baumstmktur
truest
anspmch
Prior art date
Application number
PCT/EP2005/000198
Other languages
German (de)
French (fr)
Other versions
WO2005071603A3 (en )
Inventor
Ralf Pakull
Bernd Woost
Original Assignee
Bayer Materialscience Ag
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/62Methods or arrangements for recognition using electronic means
    • G06K9/6217Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06K9/6218Clustering techniques

Abstract

The invention relates to a clustering method which is used to allocate data to a cluster from a predetermined cluster amount. Said method comprises the following steps: Data is allocated to a first cluster of the cluster amount by means of a first clustering method, data is allocated to a second cluster of the cluster amount by means of a second clustering method, and data is allocated to a third cluster of the cluster amount by means of an at least third clustering method and the at least first, second and third cluster is inputted into a voting module in order to determine, in all probability, the most appropriate cluster of the first, second and third cluster.

Description

Clustering methods and computer system

The invention relates to a Clusteringvefahren for assigning a file to a cluster of a predetermined amount of cluster as well as a corresponding digital storage medium and a computer system.

From the prior art, various clustering methods for assigning a file to a cluster of a predetermined amount of clusters are known. An important application of such clustering process is the categorization of documents, that is, the assignment of documents to vordefϊnierten Clustem, referred to in this case as categories.

common to the known clustering method is the distinction between a training and a clustering or categorization phase. The training phase is used to define the schemes, which are the basis for the subsequent categorization. These schemes gained through training are stored for example in the form of parameter values.

Common clustering methods that are used for the purposes of the classification are Zentroidvektorverfahren and decision tree method.

The Zentroidvektorverfahren build during the training phase to a vector from the most significant words of the training documents for each category, who are also on distinktivsten to the words of the other categories. During the phase of categorizing the vocabulary of the document with the vectors of each category is then compared.

Decision tree method transfer the training documents based on true-false questions about the theme in binary tree structures. Such a binary tree represents a structure of yes-no questions, each document is assigned either belonging or not belonging to the category. Then, at the stage of categorization is compared to be classified document with this decision tree.

Another method for dividing an amount of data in a given number of Clustem is the k-means method (see, Hartigan, JA;. Wong, MA (1979), A K-Means Clustering Algorithm Applied Statistics 28 and IS Dhillon and DS. .. Modha Concept decompositions for large sparse text data using clustering Machine Leaming, 42 (1): 143-175, January 2001)

An improved k-means algorithm for document categorization is from "Iterative clustering of high dimensional textData augmented by local search", data mining, 2002. Proceedings. 2002 IEEE International Conference, Dhillon, IS; Yuqiang Guan; Kogan, J., pages 131-138 known. Furthermore be used for clustering also trained neural networks. A corresponding software for the associative search is commercially available from SER Systems AG, SER brainwaregroup (www.ser.de). This program allows the associative search on the basis of exemplary passages. The associative search makes use of a previously trained in a classification mode neural network. The learning process used in this case is also called "Learning by Example".

The invention is against which the object of to provide an improved method for assigning a file clustering to a cluster of a predetermined amount of cluster as well as a corresponding digital storage medium and a computer system.

The basis of the invention will be attained in each case having the features of the independent claims. Preferred embodiments of the invention are specified in the dependent claims.

According to the invention at least three different clustering methods are used for assigning a file to a cluster of a predetermined amount cluster. The individual clustering methods have been previously trained in a training phase with the same or different example, files that Clustem the predetermined amount cluster are assigned. For the subsequent categorization of a file it is associated with a respective cluster of the cluster means of the amount of various clustering methods. It may happen that the various clustering methods lead to different results. Clustering of the various methods of the probably most appropriate cluster is selected according to the invention by a so-called voting- process from the individual results.

Voting procedures are known per se in the field of fault-tolerant control and regulation systems ( "fault tolerant Steuemngs- and control systems", Hubert Kirrmann and Karl-Erwin Großpietsch, Automatisiemngstechnik 50, 2002). Such voting methods are in the prior art by majority circuits implemented, for example for control rooms in power plants, power control, automation equipment, railway signaling and airplane electronics. the invention is based on the finding that such a voting process can also be used for clustering, and in particular for the categorization of documents.

For example, if three different clustering methods are used to categorize the same Dol u- ment, it may happen that one of the clustering process comes to a different result, than the other two clustering techniques. In this case, a majority decision is made by the voting process, that is to say the one cluster is determined as the probably most appropriate cluster which has been determined by the majority of clustering methods consistent. In the example case, the clustering process is "overruled" by deviating from the other two clustering methods result.

This procedure is a reliable clustering of files of particular advantage, since the known from the prior art clustering method not sufficiently unmixed confident work casual.

According to a preferred embodiment of the invention, the predetermined cluster is the cluster amount is structured as a hierarchical tree. Preferably, this is a binary tree in which emanate from each node of the tree, with the exception of leaf nodes, two branches. Each node of the tree structure represents one of the clusters. Furthermore, the terms "node" and "clusters" are used interchangeably.

Each of the clustering process is trained during the training phase for each of the nodes of the tree separately with corresponding training documents. The Traimngsdokumente for a particular node of the tree structure are allocated to the Clustem the next lower level of the tree, to which the node in question is connected. Thus, the clustering sclirittweise here runs along a path in the tree structure. Because of this gradual approach, a particularly high quality and safety can be achieved a correct classification of a file or document.

In further preferred embodiments of the invention will be explained with reference to the drawings. Show it:

1 shows a block diagram of a computer system with multiple clustering modules,

Figure 2 is a flowchart of a clustering method with a voting step,

Figured 3 is a block diagram of a computer system having a controller for stepwise clustering along a path in a tree structure,

Figure 4 is a hierarchical tree structure with training documents for each node,

Figure 5 is a flowchart of a clustering method for clustering stepwise along a path of the tree structure.

Figure 1 shows a computer system 100 having a clustering module 102. Clustering module 102 has a number of n clustering programs k, where 1 <k <n. The clustering programs 1 to n can thereby be based on different clustering methods, procedures such as Zentroidvektor-, decision trees, K-means method, neural networks or other methods that can assign a file to a cluster of a predetermined amount cluster. The clustering programs 1 to n are associated with a voting module 104, which is formed based on the individual clustering results of the clustering programs 1 to n, for example, to form a majority decision.

The computer system 100 furthermore has a memory 106 for storing sets of parameters Pl, P2, ... for the respective clustering programs 1, 2, .... furthermore, the computer system 100 has a Steuemngsprogramm 108 and an input 110 and an output 112. The output 112 is associated in the example considered here with a database 114th

The parameter sets for different clustering programs are obtained in a training phase. For this, the individual clustering programs, training documents with categorized SAMPLE as is known from the prior art.

During operation of computer system 100, a document is input to the clustering module 102 via the input 110th Then, the title of each of the clustering programs 1 to n is allocated to the respective clustering techniques to cluster a cluster set. Need be, the individual clustering programs on their respective sets of parameters of the memory 106th

The respective results of clustering are output from the individual clustering programs 1 to n to the voting module 104th This gebnissen determined from the individual Clusteringser-, which may be different from each other, by a majority decision probably the truest of the clusters. This probably most appropriate cluster is output via the output 112th The cluster membership of the document 110 can be stored in the database 114th

Figure 2 shows a corresponding flow chart. In step 200, a document is input. Thereafter, in parallel and independently performed in the steps 202, 204 and 206, the clustering methods 1, 2 and 3, which may be based on different algorithms each clustering. The respective clustering results, that is, the respective assignment of the document to one of the clusters of predetermined cluster quantity is evaluated in a voting- process in step 208, which means it is from the individual clustering results of steps 202, 204 and 206 through one of the likely majority decision truest cluster determined. This probably most appropriate cluster is output in step 210th

3 shows a block diagram of a computer system 300. Elements of figure 3 that correspond to elements of figure 1 are marked with increased by reference numeral 200. In contrast to the embodiment of Figure 1 the predetermined cluster amount is executed in the embodiment of Figure 3 as a tree structure. Each node of the tree structure is assigned to exactly one cluster. The clustering is then performed starting sclrrittweise from the tree root along a path to a leaf of the tree structure. For this purpose, are stored until n sets of parameters in the memory 306 for each of the clustering programs 1, for each of the nodes of the tree structure, except for the leaf nodes.

These parameter sets are obtained by training of various clustering programs with sample documents that are specific to each of the nodes. This will be explained further below with reference to Figure 4 in more detail.

In order to realize the gradual clustering along a path in the tree structure Steuemngsprogramm 308 is connected to the voting module 304th The Steuemngsprogramm 308 thus receives from the voting module 304, the result of the majority decision.

In operation of computer system 300, a document is in turn input to the clustering module 302 via the input 310th Then, a first clustering step is performed in order starting to find the root of the tree probably the truest cluster on the next level of the tree. This is done so that the controller retrieves to n specific to the tree root parameter sets of the different clustering programs 1 from the memory 306 and enters into the corresponding clustering programs 1 to N that perform on this basis, the clustering.

The voting module 304 evaluates each cluster assignments of clustering programs 1 to N by means of a majority decision, and outputs the result, that is the truest likely cluster of the next level of the tree structure to the controller 308. The controller 308 then calls again the destination specific for this probably truest cluster parameter sets of the different clustering programs 1 to n from the memory 306 and outputs them to the clustering applications 1 to n a.

This drove • then select a further clustering step for the document to determine the cluster to the next level of the tree. The individual allocation results of the clustering programs 1 to n are in turn input to the voting module 304, which makes a majority decision, and outputs to the controller 308th On this basis, the controller 308 in turn calls specific to the likely truest clusters of the current layer parameter sets of the clustering programs 1 to n from the memory 306 from, and outputs parameter sets in the clustering programs 1 to n, so that a further clustering step may be carried out, etc. This process occurs until the controller determines 308 that a sheet of the tree has been reached. The determined by the voting module 304 probably most appropriate cluster of a sheet of the tree structure is output via the output 312th The cluster membership of the document determined in turn can be stored in the database 314th

In addition it is also possible that the individual interim results of the gradual running clustering process, that is probably the truest cluster of different levels of the tree structure, also issued are.

4 shows an example of a tree structure 400. In the example case of Figure 4 is a binary tree structure. The root of the tree structure 400 repre- sents a cluster Cl l. This is on the level 1 of Baumstmktur 400th

At the level 2 of the Baumstmktur 400 are two nodes, the cluster C21 and C22 represent. With these two Clustem the cluster Cl l via respective branches of Baumstmktur 400 is connected.

On the underlying level 3, there are four nodes that represent the clusters C31, C32, C33 and C34. In this case, leads from the cluster C21 the level 2 to either the C31 or C32 Clustem the plane 3 and of the cluster C22 the level 2 to the cluster C33 or the cluster C34 the tier 3.

Accordingly, it behaves for further underlying layers of the i Baumstmktur 400, each representing cluster Cy.

Each of the nodes of the Baumstmktur 400 are training data 402, 404, 406, 408 associated with .... For example, the training data are assigned to the cluster 402 Cl l Level 1 attachment. The training data 402 include sample documents that are associated with either the cluster C21 or C22 of the cluster damnter lying tier 2. By means of this training data 402, the clustering programs 1 to n (see FIG. 3) (Figure 3 cf.) Are each trained to obtain an antibody specific for the cluster Cll parameter set for each of the clustering programs in the memory 306 stored is ,

Accordingly, the training data are assigned to the cluster 404 and include C21 example documents that are associated with the Clustem to which the cluster C21 points; that is, the training data includes 404 sample documents that are associated with either the cluster C31 or the cluster C32. Accordingly, it behaves for the further training data 406 to 408. With the aid of the training data 402, 404, 406, 408, ... is a specific parameter set is generated for each of the clusters of Baumstmktur 400 and for each of the clustering programs 1 to n, which in the memory 306 (see FIG. 3) is stored.

When a document is entered into the computer system 300 of Figure 3, first, the parameter sets of the clustering programs 1 to n, which are specific for the cluster Cl l loaded. The clustering of the document has therefore either a mapping to the cluster C21 or C22 to the cluster result. For example, if the cluster C21 has been identified as the likely most appropriate cluster, the next step is the parameter sets that are specific to the cluster C21 loaded. These parameter sets are determined based on the training data 404 for each of the clustering programs 1 to n.

In a further clustering step then the document either the cluster C31 or the cluster C32 the next level of Baumstmktur 400 is assigned. This process continues as long as from until along a path of Baumstmktur 400 a leaf node on the lowest level of Baumstmktur has been reached 400th In the example case of Figure 4, such a path may be as follows: C11-C21-C31-C42-C54, wherein it is at C54 to a leaf node on the lowest level of 5 Baumstmktur 400th

5 shows corresponding flow diagram. In step 500, a document to be classified is entered. In step 502, the indices i, j and k is set to 1. In step 504, the n sets of parameters Pl to Pn of the clustering programs 1 to n, which are specific to the node l Cl abgemfen.

On the basis of specific parameter sets are then in steps 506, 508, 510, the appropriate clustering method .... using the clustering programs 1 was carried out to n. The appropriate clustering results are evaluated at step 512 by a voting- method to determine the most likely cluster truest Cy the next underlying layer in the step 514th

From there, the Ablaufsteuemng goes back to the step 504 where the identified for the determined at step 514 cluster Cy-specific parameter sets of the clustering programs 1 are abgemfen to n, to determine the likely truest cluster for the next damnter lying plane of the Baumstmktur. This process proceeds until a sheet of the Baumstmktur is reached. LIST OF REFERENCE NUMBERS

100 computer system

102 Clustering module

104 voting module 106 memory

108 Steuemngsprogramm

110 input

112 output

114 Database 300 Computer System

302 Clustering module

304 voting module

306 memory

308 Steuemngsprogramm 310 input

312 edition

314 database

400 Baumstmktur

402 training data 404 training data

406 training data

408 training data

Claims

claims
1. Clustering method for assigning a file to a cluster of a predetermined amount cluster comprising the steps of:
Assignment of the file to a first cluster of the cluster amount by a first clustering process,
Assignment of the file to a second one of the clusters of the cluster amount by a second clustering process,
Assignment of the file to a third of the clusters of the clusters by means of an amount at least third clustering process
Entering the at least 'first, second and third cluster in a voting module for determining the likely truest the at least first, second and third cluster.
2. Clustering method according Anspmch 1, wherein at least a subset of the at least first, second and third clustering methods based on neural networks.
3. Clustering Method according to one of claims 1 or 2, wherein the parameter sets to be retrieved for Durchfühmng the at least first, second and third clustering methods, respectively, which have been previously obtained by training the at least first, second and third clustering method.
4. Clustering method of claim 1, 2 or 3, wherein the cluster comprises a quantity Baumstmktur, and starting from a root of the Baumstmktur for each level of a Baumstmktur zutreffendster likely cluster is determined, as long as until a sheet of the Baumstmktur achieved.
5. Clustering method according Anspmch 4, wherein a likely zutreffendster cluster of a damnter underlying second level of Baumstmktur is detected by a likely truest cluster of a first one of the levels of Baumstmktur starting by referring to the likely truest clusters of the first level-specific parameter sets for each of the at least first, second and third clustering method is accessed.
6. A digital storage medium for assigning a file to a cluster of a predetermined amount cluster having program means for performing the steps of:
Assignment of the file to a first of the clusters of the cluster amount by a first clustering program module, - assignment of the file to a second one of the clusters of the cluster amount by a second Clusteringprogranimmoduls,
Assignment of the file to a third of the clusters of the clusters by means of an amount at least third clustering program module,
Determining the likely the at least first, second and third cluster truest by a voting program module.
7. digital storage medium as Anspmch 6, wherein at least a subset of the Clusteringprogranrinmodule each having a neural network.
8. The digital storage medium according to Anspmch 6 or 7, wherein the program means are configured to access parameter sets of the clustering program modules, which have previously been obtained by training.
9. Digital storage medium according to Anspmch 6, 7 or 8, wherein the cluster set having a Baumstmktur and Prograr Mmean are adapted, starting from a root of the tree structure for each level of Baumstmktur to determine a likely truest cluster, until a sheet of the Baumstmktur is reached.
10. A digital storage medium according to any one of the preceding claims 6 to 9, wherein the program means are configured so that a likely zutreffendster cluster of a damnter underlying second level of Baumstmktur is detected by a likely truest cluster of a first one of the levels of the tree structure starting by referring to the truest likely cluster of the first level-specific parameter sets is accessed for each of the at least first, second and third clustering program modules.
11. A computer system for assigning a file to a cluster of a predetermined cluster quantity, comprising: a first clustering program (1), which is designed to carry out a first clustering process, a second clustering program (2), which is designed to carry out a second clustering process, at least one third clustering program (3), which is designed to Durchfülrrung at least a third clustering method - a voting module (104; 304) for determining a likely truest cluster from the determined of the at least first, second and third Clustermgprogrammen Clustem.
12. The computer system of Anspmch 11, wherein at least a subset of the clustering modules each having a neural network.
13. The computer system of Anspmch 11 or 12, with parameter sets (106; 306) for each of the clustering programs.
14. The computer system of Anspmch 11, 12 or 13, with a Steuemngsmodul (108; 308), which is linked to the voting module to receive the probably truest cluster of the voting module, said Steuemngsmodul to the arrival steuemng clustering programs configured for a step-wise clustering, in a tree structure (400) of the cluster amount, starting a likely zutreffendster cluster of a damnter lying second level is detected by a likely truest cluster of a first one of the levels of Baumstmktur, adding to the likely truest cluster of the first level specific parameter sets for each of the first, second and third clustering programs accessed.
15. The computer system of Anspmch 14, wherein the computer system for each node of Baumstmktur that is not a leaf node, and for each of the clustering programs having a specific set of parameters.
16. The computer system of Anspmch 15, wherein the total number of specific parameter sets the number of nodes of the leaf node without Baumstmktur multiplied by the number of clustering programs.
PCT/EP2005/000198 2004-01-23 2005-01-12 Clustering method and computer system WO2005071603A3 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
DE102004003496.6 2004-01-23
DE102004003496 2004-01-23
DE102004032650.9 2004-07-06
DE200410032650 DE102004032650A1 (en) 2004-01-23 2004-07-06 File assignment method for use with cluster assemblies involves use of a voting module to probabilistically determine the most appropriate of a number of clusters

Publications (2)

Publication Number Publication Date
WO2005071603A2 true true WO2005071603A2 (en) 2005-08-04
WO2005071603A3 true WO2005071603A3 (en) 2006-01-12

Family

ID=34809599

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2005/000198 WO2005071603A3 (en) 2004-01-23 2005-01-12 Clustering method and computer system

Country Status (1)

Country Link
WO (1) WO2005071603A3 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5337371A (en) * 1991-08-09 1994-08-09 Matsushita Electric Industrial Co., Ltd. Pattern classification system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5337371A (en) * 1991-08-09 1994-08-09 Matsushita Electric Industrial Co., Ltd. Pattern classification system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ISHII M ET AL: "SIMPLIFICATION OF MAJORITY-VOTING CLASSIFIERS USING BINARY DECISIONDIAGRAMS" SYSTEMS & COMPUTERS IN JAPAN, SCRIPTA TECHNICA JOURNALS. NEW YORK, US, Bd. 27, Nr. 7, 15. Juni 1996 (1996-06-15), Seiten 25-40, XP000596040 ISSN: 0882-1666 *
KITTLER J ET AL: "ON COMBINING CLASSIFIERS" IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, IEEE INC. NEW YORK, US, Bd. 20, Nr. 3, M{rz 1998 (1998-03), Seiten 226-239, XP000767916 ISSN: 0162-8828 *

Also Published As

Publication number Publication date Type
WO2005071603A3 (en) 2006-01-12 application

Similar Documents

Publication Publication Date Title
Mojena Hierarchical grouping methods and stopping rules: An evaluation
Frederickson et al. The complexity of selection and ranking in X+ Y and matrices with sorted columns
Levitin The universal generating function in reliability analysis and optimization
US5014327A (en) Parallel associative memory having improved selection and decision mechanisms for recognizing and sorting relevant patterns
US6446068B1 (en) System and method of finding near neighbors in large metric space databases
Sun Rule-base structure identification in an adaptive-network-based fuzzy inference system
Lu et al. Neurorule: A connectionist approach to data mining
Oliveira et al. A methodology for feature selection using multiobjective genetic algorithms for handwritten digit string recognition
US5799311A (en) Method and system for generating a decision-tree classifier independent of system memory size
Dittenbach et al. The growing hierarchical self-organizing map
Zhou et al. Fuzzy classifier design using genetic algorithms
US5960422A (en) System and method for optimized source selection in an information retrieval system
US5367677A (en) System for iterated generation from an array of records of a posting file with row segments based on column entry value ranges
US5471677A (en) Data retrieval using user evaluation of data presented to construct interference rules and calculate range of inputs needed for desired output and to formulate retrieval queries
US7266548B2 (en) Automated taxonomy generation
Alcalá et al. A multi-objective genetic algorithm for tuning and rule selection to obtain accurate and compact linguistic fuzzy rule-based systems
Ho Random decision forests
US5640554A (en) Parallel merge and sort process method and system thereof
US5950170A (en) Method to maximize capacity in IC fabrication
US5930392A (en) Classification technique using random decision forests
Last et al. Information-theoretic algorithm for feature selection
US20050033742A1 (en) Methods for ranking nodes in large directed graphs
González Muñoz et al. Multi-stage genetic fuzzy systems based on the iterative rule learning approach
Escalera et al. Subclass problem-dependent design for error-correcting output codes
Gallant Perceptron-based learning algorithms

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase in:

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase