EP2277116A1 - Création d une arborescence des catégories sur le contenu d un ensemble de données - Google Patents

Création d une arborescence des catégories sur le contenu d un ensemble de données

Info

Publication number
EP2277116A1
EP2277116A1 EP08758423A EP08758423A EP2277116A1 EP 2277116 A1 EP2277116 A1 EP 2277116A1 EP 08758423 A EP08758423 A EP 08758423A EP 08758423 A EP08758423 A EP 08758423A EP 2277116 A1 EP2277116 A1 EP 2277116A1
Authority
EP
European Patent Office
Prior art keywords
word
words
database
word list
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP08758423A
Other languages
German (de)
English (en)
Inventor
Jörg Wurzer
Christian Magnus
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Iqser IP GmbH
Original Assignee
Iqser IP AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iqser IP AG filed Critical Iqser IP AG
Publication of EP2277116A1 publication Critical patent/EP2277116A1/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Definitions

  • Category tree on the content of an information objects comprehensive database the information objects of the database are indexed in an index.
  • the present invention relates to methods for automatically creating a category tree on the content of all the texts of a database.
  • the subject matter of the present invention is furthermore a data processing system with data representing information in at least one dataset accessible via at least one data source, which is designed and / or set up to at least partially execute a method according to the invention.
  • the subject of the present invention is furthermore a data processing device for the electronic processing of data, comprising a control and / or arithmetic unit, an input unit and an output unit, which is designed and / or configured to carry out a method according to the invention at least partially, preferably using at least one part a data processing system according to the invention.
  • Methods, data processing systems and / or data processing devices of the type mentioned above are used in the context of search applications or routines, for example by operating systems and / or by so-called search engines, as well as in the context of the organization, provision and / or delivery of information.
  • Methods, systems and devices for the electronic processing of data are known in the prior art in numerous embodiments, in particular from WO 2005/050471 A2, the disclosures of which are hereby explicitly referenced.
  • contents are processed as information representing data of a database by machine, in particular to be made available to users as a technical tool for solving tasks and / or serve.
  • data stocks are simple, universally usable, persistent information or data objects which contain, in particular, such as files and / or documents in operating systems or databases, structure, content and, as required, management information.
  • the dormers are a Döverarbeitverarbeitssystern and / or a
  • Data processing device usually via at least one data source, usually a present in a data processing system or connectable via a communication network or connectable data carrier, such as a hard disk or the like data recording means accessible.
  • the invention is based on the object of enabling a user or user of methods, data processing systems and / or data processing devices in a simple manner to have an overview of the contents of data files, in particular with regard to unstructured and / or poorly comprehensible data files ,
  • the present invention proposes a method for the automatic generation of a category tree via the content of a data object comprising information objects, wherein the information objects of the data are indexed in an index, which is characterized by the following method steps:
  • An index or database index in the sense of the present invention is an index structure separated from the data structure in a database or in a database.
  • the index advantageously accelerates the search and / or sorting for specific fields.
  • An index advantageously consists of a collection of pointers that define an ordering relation to one or more columns in a table. If an indexed column is used as the search criterion in a query, the database management system (DBMS) or similar systems generally searches for the desired data records on the basis of these pointers or references.
  • DBMS database management system
  • a list in the sense of the present invention is a dynamic data structure with a finite number of elements. In this case, a storage of a previously not determined number of interrelated values of simple and / or composite data types is made possible.
  • Stop words in the sense of the present invention are words which are ignored in the case of full-text indexing, since they occur very frequently and are generally of no relevance for capturing the content of a document.
  • Commonly used stop words in German-language documents are, for example, certain articles such as "der”, “die” and “das.” Stop words are distinguished in particular by the fact that they assume grammatical and / or syntactic functions in particular and therefore generally do not draw conclusions about the content
  • the search engine efficiency that is provided by the filtering out is to increase the efficiency of search engines: If you were to consider stop words in a search, the result set would include almost any document in the inventory.
  • a selection in the sense of the present invention is a selection of data objects from a data set, in particular in connection with relational databases or relational database systems.
  • An advantageous embodiment of the invention provides that in method step 3, when determining a significance value for each word in the word list, the significance value is determined from the quotient of the word frequency of the word within the information object and the word frequency of the word within the entire index.
  • a further embodiment of the invention provides that the predeterminable maximum number in method step 5 is limited to 50.
  • An advantageous embodiment of the invention provides that in method step 6, when storing the reduced word list in a table, words in the table are assigned to the significance value and in the case in which the significance value is higher than the significance value to an existing word, the higher significance value is used.
  • a further advantageous proposal of the invention provides that in method step 8, when storing the co-deposits in a database, the database comprises a table of co-occurrences (word 1 and word 2) with a frequency value in a table row and wherein the frequency value increases by a factor of 1 becomes if there is a co-competition (word 1 and word 2) in the table already.
  • the predefinable maximum number in method step 15 is limited to 20.
  • the created category tree is at least partially reproduced by a display device of a computer system, preferably in graphical form.
  • the present invention further proposes a method for the automatic generation of a category tree on the content of all texts of a data stock, which is characterized by the following method steps: 1. Creating word sets with a preferably predeterminable number of meaningful words for each text of the dataset;
  • a further embodiment of the invention provides that the word list created in method step 3 is at least partially reproduced by a display device of a computer system, preferably in graphical form.
  • a further advantageous embodiment of the invention is characterized in that the word list created in method step 3 is sorted in descending order according to the frequency of the respective words, so that the most important terms stand at the beginning of the word list.
  • a further advantageous embodiment of the invention provides that in step 5, when determining co-occurrences in the stored word list, each word in the word list is compared bit by bit with the words of each word set.
  • a further advantageous embodiment of the invention is characterized in that the word list stored in method step 6 is at least partially reproduced by a display device of a computer system, preferably in graphic form.
  • the category tree is consolidated for display by a display device, preferably with a similarity check.
  • a particularly advantageous proposal of the invention is characterized in that in the context of the similarity check words with different word endings but the same root word to the shortest possible variant (word version) are summarized.
  • two words of different length are respectively compared by shortening the longer word by two letters, then bringing the shorter word to the length of the other word and then checking the two words for a match.
  • a further advantageous embodiment of the invention is characterized in that, when determining co-competitions in method step 5 and / or in method step 8, a similarity check is carried out, whereby words with different word endings but the same word stem are combined to the shortest possible variant (word version).
  • a similarity check is carried out, whereby words with different word endings but the same word stem are combined to the shortest possible variant (word version).
  • two words of different length are respectively compared by shortening the longer word by two letters, then bringing the shorter word to the length of the other word and then checking the two words for a match.
  • the predetermined number in method step 1 is limited to up to 32.
  • the present invention further proposes a method for the automatic generation of a category tree on the content of all texts of a data stock, which is characterized by the following method steps:
  • Process step 3 created word list is at least partially reproduced by a display device of a computer system, preferably in graphical form.
  • the category tree for display by a display device is consolidated, preferably with a similarity check.
  • words with different word endings but the same root word are combined into the shortest possible variant (word version).
  • a further advantageous embodiment of the invention is characterized in that, in the context of the similarity check, two words of different length are respectively compared by shortening the longer word by two letters, then bringing the shorter word to the length of the other word and then the two words checked for a match.
  • the subject matter of the present invention is furthermore a data processing system with data representing information in at least one dataset accessible via at least one data source, which is designed and / or set up to at least partially execute a method according to the invention.
  • a further advantageous embodiment of the invention is characterized by a graphical user interface for inputting and / or reproducing word lists, links and / or at least one level of at least one category tree.
  • the graphical user interface continues to input, change and / or reproduction of information representing data in at least one database formed and / or set up.
  • the user interface advantageously provides a graphical user interface that enables action-oriented navigation.
  • the inventively created category tree is implemented in the user interface by a tree structure in the first reproduced or displayed the generic terms and the user can bring the associated sub-concepts for display by pointing to a provided for this purpose by the user interface button, which with the preamble displayed is selected or activated by so-called Ankück.
  • the user can advantageously also move or navigate in further levels of the category tree.
  • a search engine or a search engine system advantageously uses a full-text search via the index with all terms of the selected path in the category tree, for example a generic term whose sub-term and again its sub-concept. It is also advantageously possible to select only a generic term for the search.
  • the reproduction takes place at least partially in a selectable form, that is, the reproduced categories of inventively created category tree are, for example, itself as a menu item for action options and / or linked in the manner of a link, and by selection, for example by so-called "Click", usable accordingly.
  • the data processing system preferably in the context of running on a computer software, used for the dynamic organization of information and / or processes.
  • the data processing system according to the invention is part of a database application or at least usable together with a database application.
  • the present invention furthermore relates to a data processing device for the electronic processing of data, comprising a control and / or arithmetic unit, an input unit and an output unit, which is designed and / or A method according to the invention is set up at least partially, preferably using at least part of a data processing system according to the invention.
  • a data processing device for the electronic processing of data, with a control and / or computing unit, an input unit and an output unit, provided, which is characterized by a use of a data processing system according to the invention.
  • the data processing device is designed as a mobile terminal, preferably as a usable or operable in mobile networks mobile terminal. Particularly preferred is an embodiment of the data processing device as a mobile phone.
  • a category tree With a category tree according to the invention, the user gets an overview of the contents of a data stock, advantageously via unstructured data stocks, which otherwise can not be surveyed easily.
  • facts and / or relationships become transparent. For example, that the texts of one or more databases are about philosophy and that ethics is a discipline within philosophy.
  • it is according to the invention for example, in or out of a stock of philosophical publications, who published in the field of ethics and thus sometimes counts among the philosophers.
  • the result of an automatic analysis of the terms in a database is a category tree according to the invention. At the top of the list are generally terms that form upper categories. The respective upper categories are assigned to subcategories, these in turn to further subcategories.
  • the ramification of the category tree according to the invention can advantageously be continued arbitrarily until all significant terms from a database have experienced one or more assignments.
  • the user can now according to the invention select categories and subcategories in the tree and receives a corresponding selection of the data.
  • the selection is advantageously based on a search query that touches or affects the terms from the selected path of the category tree.
  • a taxonomy is advantageously created on the basis of co-occurrence, that is, the simultaneous occurrence of words.
  • FIG. 1 shows in a flowchart an embodiment of a creation of a category tree according to the invention over the content of a data stock
  • FIG. 2 shows in a flowchart a further embodiment of a creation of a category tree according to the invention over the content of a
  • FIG. 3 shows in a flow chart a further embodiment of a creation of a category tree according to the invention over the content of a data stock.
  • stop words are filtered out using a list and a word list is created. There is a significance value for each word. This results from the quotient of word frequency within the document and the word frequency in the entire index.
  • the word list is sorted by significance and reduced to the top 50.
  • This value 50 can be configured.
  • the top 50 are stored in a table. Words are assigned to the significance value. If the value is higher than an existing one, the higher value is taken.
  • the co-competitions are derived and stored in a database.
  • word 1 and word 2 There is a table of co-occurrences (word 1 and word 2) with a frequency value in a table line. If there is a co-occurrence already in the table, the frequency value is increased by 1.
  • the search is for words in the co-occurrence table that have the highest significance but do not form co-occurrence (among each other). They form the first level of the category tree. For all other levels of the category tree, the determined words of the first level are gradually iterated. For each word, the words are selected from the co-occurrence table, which stands with the word co-competition. From this, the words are selected that have an above-average frequency. This list is limited to 20 and sorted by frequency.
  • word sets with the 32 least significant words are created and stored in a database.
  • the word set is stored in a relational database in the form of a word list whose words are each linked to an ID for the word set. From these word sets, a word list is created which can be displayed. It forms the first level of the conceptual tree. It is possible to sort this list of words in descending order of frequency, so that the most important terms are at the beginning. It may happen that words with the same meaning but different case (case) or inflection (inflection) form separate categories. The term tree can therefore be subsequently consolidated for display. Words with different endings but the same root are combined to the shortest variant. Two words of different lengths are compared by shortening the longer word by two letters. The shorter word is then made the length of the other word and checked for a match.
  • the word combinations can again be selected as the starting point.
  • the 32 least words are extracted and stored in a database. From the word sets, a word list is extracted that corresponds to the first level of the category tree. As described in the first procedure, the list can be consolidated.
  • the word list is iterated and each word is compared with all words, in each case all word sets. If there are two words including the similarity check, a link with the weight 0.1 is made between the one word and all the others of the word set. If this link already exists, the weighting of the link is increased by 0.1. If the value exceeds 1, it is reset to 0.9 and all other links are reduced to a value of 90%.
  • the links that are linked to both the first and second terms are selected.
  • the links that are linked to both the first, second, and third terms are selected.
  • the illustrated in the figures of the drawing and in connection with the description of embodiments of the invention are only illustrative of the invention and are not limiting for this.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé de création automatique d’une arborescence des catégories sur le contenu d’un ensemble de données. Selon l’invention, une taxinomie de l’ensemble de données est établie sur la base des cooccurrences. L’objet de la présente invention est en outre un système de traitement de données avec des données représentant des informations dans au moins un ensemble de données accessible par le biais d’au moins une source de données, lequel est configuré et/ou conçu pour mettre en œuvre au moins partiellement un procédé selon l’invention. L’objet de la présente invention est en plus de cela un dispositif de traitement de données pour le traitement électronique de données, comprenant une unité de contrôle et/ou de calcul, une unité de saisie et une unité de sortie, lequel est configuré et/ou conçu pour mettre en œuvre au moins partiellement un procédé selon l’invention, de préférence en utilisant au moins une partie d’un système de traitement de données selon l’invention.
EP08758423A 2008-05-08 2008-05-08 Création d une arborescence des catégories sur le contenu d un ensemble de données Withdrawn EP2277116A1 (fr)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2008/003723 WO2009135511A1 (fr) 2008-05-08 2008-05-08 Création d’une arborescence des catégories sur le contenu d’un ensemble de données

Publications (1)

Publication Number Publication Date
EP2277116A1 true EP2277116A1 (fr) 2011-01-26

Family

ID=40010800

Family Applications (1)

Application Number Title Priority Date Filing Date
EP08758423A Withdrawn EP2277116A1 (fr) 2008-05-08 2008-05-08 Création d une arborescence des catégories sur le contenu d un ensemble de données

Country Status (3)

Country Link
US (1) US8745069B2 (fr)
EP (1) EP2277116A1 (fr)
WO (1) WO2009135511A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102012109096A1 (de) 2012-09-26 2014-03-27 Iqser Ip Ag Verfahren zur sequenziellen Bereitstellung von personalisierte Informationen repräsentierenden Daten, insbesondere in Form von Videos und dergleichen, insbesondere für ein personalisiertes Fernsehprogramm

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050149873A1 (en) * 2003-12-15 2005-07-07 Guido Patrick R. Methods, systems and computer program products for providing multi-dimensional tree diagram graphical user interfaces

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6999959B1 (en) * 1997-10-10 2006-02-14 Nec Laboratories America, Inc. Meta search engine
US7047242B1 (en) * 1999-03-31 2006-05-16 Verizon Laboratories Inc. Weighted term ranking for on-line query tool
US6411962B1 (en) * 1999-11-29 2002-06-25 Xerox Corporation Systems and methods for organizing text
US7284191B2 (en) * 2001-08-13 2007-10-16 Xerox Corporation Meta-document management system with document identifiers
US20060004732A1 (en) * 2002-02-26 2006-01-05 Odom Paul S Search engine methods and systems for generating relevant search results and advertisements
US7280957B2 (en) * 2002-12-16 2007-10-09 Palo Alto Research Center, Incorporated Method and apparatus for generating overview information for hierarchically related information
WO2005050471A2 (fr) 2003-11-22 2005-06-02 Wurzer Joerg Systeme et dispositif pour traiter des donnees
US7698267B2 (en) * 2004-08-27 2010-04-13 The Regents Of The University Of California Searching digital information and databases
US7865495B1 (en) * 2004-10-06 2011-01-04 Shopzilla, Inc. Word deletion for searches
US7630980B2 (en) * 2005-01-21 2009-12-08 Prashant Parikh Automatic dynamic contextual data entry completion system
WO2009030246A1 (fr) * 2007-09-03 2009-03-12 Iqser Ip Ag Détection de corrélations entre des données qui représentent des informations

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050149873A1 (en) * 2003-12-15 2005-07-07 Guido Patrick R. Methods, systems and computer program products for providing multi-dimensional tree diagram graphical user interfaces

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of WO2009135511A1 *

Also Published As

Publication number Publication date
US8745069B2 (en) 2014-06-03
WO2009135511A1 (fr) 2009-11-12
US20110113043A1 (en) 2011-05-12

Similar Documents

Publication Publication Date Title
DE69433165T2 (de) Assoziatives textsuch- und wiederauffindungssystem
DE69811066T2 (de) Datenzusammenfassungsgerät.
DE69834386T2 (de) Textverarbeitungsverfahren und rückholsystem und verfahren
EP2188742A1 (fr) Détection de corrélations entre des données qui représentent des informations
WO2006018041A1 (fr) Dispositif d'analyse vocale et textuelle et procede correspondant
EP2193456A1 (fr) Détection de corrélations entre des données représentant des informations
DE69719641T2 (de) Ein Verfahren, um Informationen auf Bildschirmgeräten in verschiedenen Grössen zu präsentieren
DE10028624A1 (de) Verfahren und Vorrichtung zur Dokumentenbeschaffung
EP2221735A2 (fr) Procédé de classification automatisée d'un texte par un programme informatique
EP2193455A1 (fr) Détection de corrélations entre des données qui représentent des informations
EP1685505B1 (fr) Systeme de traitement de donnees
WO2009135511A1 (fr) Création d’une arborescence des catégories sur le contenu d’un ensemble de données
DE10218905A1 (de) Verfahren und Vorrichtung zur Zugriffssteuerung in Wissensnetzen
EP2193457A1 (fr) Détection de corrélations entre des données représentant des informations
EP1412875A2 (fr) Procede de traitement de texte dans un ordinateur et ordinateur
WO2005116867A1 (fr) Procede et systeme de generation automatisee de dispositifs de commande et d'analyse assistes par ordinateur
WO2012025439A1 (fr) Procédé de recherche d'une pluralité d'ensembles de données et moteur de recherche
WO2021204849A1 (fr) Procédé et système informatique pour déterminer la pertinence d'un texte
DE69132678T2 (de) Ein textverwaltungssystem
DE102006043158A1 (de) Verfahren zum Ermitteln von Elementen eines einer Suchanfrage zugeordneten Suchergebnisses in einer Reihenfolge und Suchmaschine
DE60106209T2 (de) Prozess zum Extrahieren von Schlüsselwörtern
DE10331817A1 (de) Computergestütztes Verfahren zur automatischen Abfrage von Informationen aus als sematisches Netz strukturierten Datenbanken
WO2011044864A1 (fr) Procédé et système de classification d'objets
DE102009028601A1 (de) Elektronisches Recherchensystem
DE10229598A1 (de) Datenverarbeitungssystem und Verfahren zur Durchführung von Datenrecherchen

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MT NL NO PL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL BA MK RS

17P Request for examination filed

Effective date: 20101202

RIN1 Information on inventor provided before grant (corrected)

Inventor name: MAGNUS, CHRISTIAN

Inventor name: WURZER, JOERG

DAX Request for extension of the european patent (deleted)
17Q First examination report despatched

Effective date: 20160527

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

APBK Appeal reference recorded

Free format text: ORIGINAL CODE: EPIDOSNREFNE

APBN Date of receipt of notice of appeal recorded

Free format text: ORIGINAL CODE: EPIDOSNNOA2E

APBR Date of receipt of statement of grounds of appeal recorded

Free format text: ORIGINAL CODE: EPIDOSNNOA3E

APAF Appeal reference modified

Free format text: ORIGINAL CODE: EPIDOSCREFNE

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: IQSER IP GMBH

APBT Appeal procedure closed

Free format text: ORIGINAL CODE: EPIDOSNNOA9E

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20210325