WO2005071603A2 - Procede de regroupement et systeme informatique - Google Patents

Procede de regroupement et systeme informatique Download PDF

Info

Publication number
WO2005071603A2
WO2005071603A2 PCT/EP2005/000198 EP2005000198W WO2005071603A2 WO 2005071603 A2 WO2005071603 A2 WO 2005071603A2 EP 2005000198 W EP2005000198 W EP 2005000198W WO 2005071603 A2 WO2005071603 A2 WO 2005071603A2
Authority
WO
WIPO (PCT)
Prior art keywords
clustering
cluster
tree structure
probably
appropriate
Prior art date
Application number
PCT/EP2005/000198
Other languages
German (de)
English (en)
Other versions
WO2005071603A3 (fr
Inventor
Ralf Pakull
Bernd Woost
Original Assignee
Bayer Materialscience Ag
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from DE102004032650A external-priority patent/DE102004032650A1/de
Application filed by Bayer Materialscience Ag filed Critical Bayer Materialscience Ag
Publication of WO2005071603A2 publication Critical patent/WO2005071603A2/fr
Publication of WO2005071603A3 publication Critical patent/WO2005071603A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Definitions

  • the invention relates to a clustering method for assigning a file to a cluster from a predetermined cluster set, and to a corresponding digital storage medium and a computer system.
  • clustering methods for assigning a file to a cluster of a predetermined cluster set are known from the prior art.
  • An important application of such clustering methods is the categorization of documents, i.e. the assignment of documents to predefined clustem, which in this case are also referred to as categories.
  • the training phase serves to define the schemes that form the basis for the subsequent categorization. These schemes obtained through training are stored, for example, in the form of parameter values.
  • centroid vector methods and decision tree methods.
  • centroid vector methods build up a vector from the most significant words of the training documents per category, which are at the same time most distinctive from the words of the other categories.
  • the vocabulary of the document is then compared with the vectors of the respective categories.
  • Decision tree procedures convert the training documents into binary tree structures based on true-false questions regarding the topic. Such a binary tree represents a structure of yes-no questions, with each document either belonging to or not belonging to the category. Then, in the categorization phase, the document to be classified is compared with this decision tree.
  • An improved k-Means algorithm for categorizing documents is from "Iterative clustering of high dimensional text data augmented by local search", Data Mining, 2002. Proceedings. 2002 IEEE International Conference, Dhillon, IS; Yuqiang Guan; Kogan, J., pages 131-138. Trained neural networks are also used for clustering.
  • Appropriate software for associative search is commercially available from SER Systems AG, SER brainware (www.ser.de). This program enables associative searches based on exemplary text passages.
  • the associative search uses a neural network previously trained in a classification mode.
  • the learning process used is also referred to as "learning by example”.
  • the invention is based on the object of creating an improved clustering method for assigning a file to a cluster from a predetermined cluster set, as well as a corresponding digital storage medium and a computer system.
  • At least three different clustering methods are used to assign a file to a cluster from a predetermined cluster set.
  • the individual clustering methods were previously trained in a training phase with the same or different sample files that are assigned to clusters of the specified cluster set. For subsequent categorization of a file, it is assigned to a cluster of the cluster set using the various clustering methods. It can happen that the different clustering methods come to different results. According to the invention, the probably most appropriate cluster is selected from the individual results of the various clustering methods by a so-called voting method.
  • Voting methods are known per se in the field of fault-tolerant control and regulation systems ("fault-tolerant control and regulation systems", Hubert Kirrmann and Karl-Erwin fundamentalpietsch, Automatisiemngstechnik 50, 2002).
  • Such voting methods are known in the prior art by majority circuits Realized, for example for control centers in power plants, network control technology, automation technology, railway signaling and aircraft electronics
  • the invention is based on the knowledge that such voting methods can also be used for clustering and in particular for the categorization of documents.
  • the predetermined clusters of the cluster set are structured as a hierarchical tree. This is preferably a binary tree in which two branches start from each node of the tree, with the exception of the leaf nodes. Each node in the tree structure represents one of the clusters.
  • node and “cluster” are used synonymously.
  • each of the clustering methods is trained separately for each of the nodes in the tree structure with corresponding training documents.
  • the training documents for a particular node in the tree structure are assigned to the clusters of the next lower tree level to which the node in question is connected.
  • the clustering here runs step by step along a path within the tree structure. Due to this step-by-step approach, a particularly high quality and security of an appropriate categorization of a file or document can be achieved.
  • FIG. 1 shows a block diagram of a computer system with several clustering modules
  • FIG. 2 shows a flow diagram of a clustering method with a voting step
  • FIG. 3 shows a block diagram of a computer system with a controller for stepwise clustering along a path in a tree structure
  • FIG. 4 shows a hierarchical tree structure with training documents for each node
  • FIG. 5 shows a flow diagram of a clustering method for step-by-step clustering along a path of the tree structure.
  • FIG. 1 shows a computer system 100 with a clustering module 102.
  • the clustering module 102 has a number of n clustering programs k, where 1 ⁇ k ⁇ n.
  • the clustering programs 1 to n can be based on different clustering methods, such as centroid vector methods, decision tree methods, K-means methods, neural networks or other methods that can assign a file to a cluster from a predetermined cluster set.
  • the clustering programs 1 to n are linked to a voting module 104, which is designed, for example, to form a majority decision based on the individual clustering results of the clustering programs 1 to n.
  • the computer system 100 also has a memory 106 for storing parameter sets P1, P2, ... for the corresponding clustering programs 1, 2, ... Furthermore, the computer system 100 has a control program 108 as well as an input 110 and an output 112. The output 112 is linked to a database 114 in the example considered here.
  • the parameter sets for the various clustering programs are obtained in a training phase.
  • the individual clustering programs are trained with categorized example documents, as is known per se from the prior art.
  • a document is entered into the clustering module 102 via the input 110.
  • the document is then assigned to a cluster of the cluster set by each of the clustering programs 1 to n according to the respective clustering method.
  • the individual clustering programs access their respective parameter sets of the memory 106.
  • the respective results of the clusterings are output by the individual clustering programs 1 to n to the voting module 104. From the individual clustering results, which may differ from one another, this determines the most likely of the clusters by a majority decision. This cluster, which is probably the most appropriate, is output via issue 112.
  • the cluster membership of the document 110 can be stored in the database 114.
  • FIG. 2 shows a corresponding flow chart.
  • step 200 a document is entered.
  • the clustering methods 1, 2 and 3 are then carried out in parallel and independently of one another in steps 202, 204 and 206, which can each be based on different clustering algorithms.
  • the respective clustering results that is to say the respective assignment of the document to one of the clusters of the predetermined cluster set, is evaluated in a voting method in step 208, that is to say it is likely from the individual clustering results of steps 202, 204 and 206 by a majority decision most appropriate clusters determined. This probably most appropriate cluster is output in step 210.
  • FIG. 3 shows a block diagram of a computer system 300. Elements in FIG. 3 which correspond to elements in FIG. 1 are identified by reference numbers increased by 200.
  • the predetermined cluster set in the embodiment in FIG. 3 is designed as a tree structure. Each node in the tree structure is assigned to exactly one cluster. The clustering is carried out step by step starting from the tree root along a path to a leaf of the tree structure. For this purpose, 1 to n parameter sets are stored in the memory 306 for each of the clustering programs, specifically for each of the nodes of the tree structure, with the exception of the leaf nodes.
  • control program 308 is connected to the voting module 304.
  • the control program 308 thus receives the result of its majority decision from the voting module 304.
  • a document is again entered into the clustering module 302 via the input 310.
  • a first clustering step is then carried out in order to find the most likely cluster from the root of the tree structure on the next level of the tree structure. This is done in such a way that the controller retrieves the parameter sets of the various clustering programs 1 to n specific to the tree root and enters them into the corresponding clustering programs 1 to n, which carry out the clustering on this basis.
  • the voting module 304 evaluates the individual cluster assignments of the clustering programs 1 to n by means of a majority decision, and outputs the result, that is to say the probably most appropriate cluster of the next level of the tree structure, to the controller 308.
  • the controller 308 then in turn retrieves the parameter sets of the various clustering programs 1 to n which are specific to this cluster, which is probably the most appropriate, and enters these into the clustering programs 1 to n.
  • the individual assignment results of the clustering programs 1 to n are in turn input into the voting module 304, which makes a majority decision and outputs this to the controller 308.
  • the controller 308 retrieves the parameter sets of the clustering programs 1 to n that are specific for the probably most relevant cluster of the current level, and enters these parameter sets into the clustering programs 1 to n so that a further clustering step can take place. etc. This process continues until controller 308 determines that a leaf of the tree structure has been reached.
  • the most likely cluster of a leaf of the tree structure determined by the voting module 304 is output via the output 312.
  • the cluster membership of the document determined in this way can in turn be stored in the database 314.
  • FIG. 4 shows an example of a tree structure 400.
  • the example case considered in FIG. 4 is a binary tree structure.
  • the root of the tree structure 400 represents a cluster Cl 1. This is on level 1 of the tree structure 400.
  • the cluster Cl 1 is connected to these two clusters via corresponding branches of the tree structure 400.
  • cluster C31, C32, C33 and C34 there are four nodes that represent clusters C31, C32, C33 and C34. One gets from cluster C21 of level 2 either to clusters C31 or C32 of level 3 and from cluster C22 of level 2 to cluster C33 or cluster C34 of level 3.
  • Training data 402, 404, 406, 408, ... are assigned to each of the nodes of the tree structure 400.
  • the training data 402 are assigned to the cluster C1 of level 1.
  • the training data 402 contain example documents which are assigned to either the cluster C21 or the cluster C22 of the level 2 lying behind them.
  • the clustering programs 1 to n are each trained in order to obtain a parameter set specific to the cluster C11 for each of the clustering programs, which is stored in the memory 306 (cf. FIG. 3) ,
  • the training data 404 are assigned to the cluster C21 and contain example documents which are assigned to the cluster to which the cluster C21 refers; that is, the training data 404 include example documents that are assigned to either the cluster C31 or the cluster C32.
  • a specific parameter set is generated for each of the clusters of the tree structure 400 and for each of the clustering programs 1 to n, which is stored in the memory 306 (cf. FIG. 3).
  • the parameter sets of the clustering programs 1 to n that are specific for the cluster Cl 1 are loaded first.
  • the clustering of the document therefore results in either an assignment to cluster C21 or to cluster C22. If, for example, cluster C21 has been determined as the most likely cluster, the next step is to load the parameter sets that are specific to cluster C21. These parameter sets were determined on the basis of the training data 404 for the clustering programs 1 to n.
  • the document is then assigned to either the cluster C31 or the cluster C32 of the next level of the tree structure 400.
  • This process continues until a leaf node on the lowest level of the tree structure 400 has been reached along a path of the tree structure 400.
  • a path could be as follows: C11-C21-C31-C42-C54, where C54 is a leaf node on the lowest level 5 of the tree structure 400.
  • FIG. 5 shows a corresponding flow chart.
  • a document to be classified is entered.
  • the indices i, j and k are set to 1.
  • the n parameter sets Pl to Pn of the clustering programs 1 to n which are specific to the node Cl 1, are mapped.
  • the corresponding clustering methods are then carried out in steps 506, 508, 510,... Using the clustering programs 1 to n.
  • the corresponding clustering results are evaluated in step 512 by a voting method in order to determine the probably most appropriate cluster Cy of the next lower level in step 514.
  • step 504 the specific parameter sets of the clustering programs 1 to n determined for the cluster Cy determined in step 514 are measured in order to determine the probably most appropriate cluster for the next level of the tree structure. This process continues until a leaf of the tree structure is reached. LIST OF REFERENCE NUMBERS
  • voting module 106 memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé de regroupement pour l'allocation d'un fichier à un groupe (cluster) d'un ensemble de groupes prédéterminé. Ledit procédé consiste à allouer le fichier à un premier groupe de l'ensemble de groupes par un premier procédé de regroupement, à allouer ce fichier à un deuxième groupe dudit ensemble de groupes par un deuxième procédé de regroupement, à allouer ce fichier à un troisième groupe dudit ensemble par au moins un troisième procédé de regroupement puis à introduire lesdits au moins premier, deuxième et troisième groupes dans un module de vote pour déterminer le groupe vraisemblablement le plus pertinent parmi ces premier, deuxième et troisième groupes.
PCT/EP2005/000198 2004-01-23 2005-01-12 Procede de regroupement et systeme informatique WO2005071603A2 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
DE102004003496 2004-01-23
DE102004003496.6 2004-01-23
DE102004032650A DE102004032650A1 (de) 2004-01-23 2004-07-06 Clusteringverfahren und Computersystem
DE102004032650.9 2004-07-06

Publications (2)

Publication Number Publication Date
WO2005071603A2 true WO2005071603A2 (fr) 2005-08-04
WO2005071603A3 WO2005071603A3 (fr) 2006-01-12

Family

ID=34809599

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2005/000198 WO2005071603A2 (fr) 2004-01-23 2005-01-12 Procede de regroupement et systeme informatique

Country Status (1)

Country Link
WO (1) WO2005071603A2 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112822121A (zh) * 2019-11-15 2021-05-18 中兴通讯股份有限公司 流量识别方法、流量确定方法、知识图谱建立方法

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5337371A (en) * 1991-08-09 1994-08-09 Matsushita Electric Industrial Co., Ltd. Pattern classification system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5337371A (en) * 1991-08-09 1994-08-09 Matsushita Electric Industrial Co., Ltd. Pattern classification system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ISHII M ET AL: "SIMPLIFICATION OF MAJORITY-VOTING CLASSIFIERS USING BINARY DECISIONDIAGRAMS" SYSTEMS & COMPUTERS IN JAPAN, SCRIPTA TECHNICA JOURNALS. NEW YORK, US, Bd. 27, Nr. 7, 15. Juni 1996 (1996-06-15), Seiten 25-40, XP000596040 ISSN: 0882-1666 *
KITTLER J ET AL: "ON COMBINING CLASSIFIERS" IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, IEEE INC. NEW YORK, US, Bd. 20, Nr. 3, M{rz 1998 (1998-03), Seiten 226-239, XP000767916 ISSN: 0162-8828 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112822121A (zh) * 2019-11-15 2021-05-18 中兴通讯股份有限公司 流量识别方法、流量确定方法、知识图谱建立方法

Also Published As

Publication number Publication date
WO2005071603A3 (fr) 2006-01-12

Similar Documents

Publication Publication Date Title
DE69835792T2 (de) Verfahren und Apparat zum Erzeugen semantisch konsistenter Eingaben für einen Dialog-Manager
DE112009002000B4 (de) Adaptives Verfahren und Vorrichtung zur Umwandlung von Nachrichten zwischen unterschiedlichen Datenformaten
DE3901485C2 (de) Verfahren und Vorrichtung zur Durchführung des Verfahrens zur Wiedergewinnung von Dokumenten
DE60118973T2 (de) Verfahren zum abfragen einer struktur komprimierter daten
DE2712575C2 (de) Assoziatives Speichersystem in hochintegrierter Halbleitertechnik
DE602004011890T2 (de) Verfahren zur Neuverteilung von Objekten an Recheneinheiten
DE10222621A1 (de) Verfahren und Schaltungsanordnung zur Steuer- und Regelung von Photovoltaikanlagen
DE102005032734B4 (de) Indexextraktion von Dokumenten
WO2000063788A2 (fr) Reseau semantique d'ordre n, operant en fonction d'une situation
DE102005032744A1 (de) Indexextraktion von Dokumenten
DE102018212297A1 (de) Verwendung von programmierbaren Switching-Chips als künstliche neuronale Netzwerk Module
DE4210109C2 (de) Sortiervorrichtung zum Sortieren von Daten und Sortierverfahren
DE60217526T2 (de) Tertiäre cam-zelle
WO2005071603A2 (fr) Procede de regroupement et systeme informatique
DE102005032733A1 (de) Indexextraktion von Dokumenten
DE102004032650A1 (de) Clusteringverfahren und Computersystem
DE112021006636T5 (de) Entwerfen analoger Schaltungen
WO2020193481A1 (fr) Procédé et dispositif d'apprentissage et de réalisation d'un réseau neuronal artificiel
WO2020193294A1 (fr) Procédé et dispositif destinés à commander de manière compatible un appareil avec un nouveau code de programme
EP1159655B1 (fr) Systeme d'automatisation comportant des objets d'automatisation composes d'elements modulaires
WO2002061615A2 (fr) Systeme informatique
DE19531635C1 (de) Ordnungsverfahren für Zugehörigkeitsfunktionswerte lingustischer Eingangswerte in einem Fuzzy-Logic-Prozessor und Anordnungen zu deren Durchführung
DE60007333T2 (de) Trainierbares, anpassungfähiges fokussiertes replikatornetzwerk zur datenanalyse
DE102012025351A1 (de) Verarbeitung eines elektronischen Dokuments
EP0262462A2 (fr) Méthode pour l'interprétation de documents d'après des formulaires

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase