EP2277116A1 - Création d une arborescence des catégories sur le contenu d un ensemble de données - Google Patents
Création d une arborescence des catégories sur le contenu d un ensemble de donnéesInfo
- Publication number
- EP2277116A1 EP2277116A1 EP08758423A EP08758423A EP2277116A1 EP 2277116 A1 EP2277116 A1 EP 2277116A1 EP 08758423 A EP08758423 A EP 08758423A EP 08758423 A EP08758423 A EP 08758423A EP 2277116 A1 EP2277116 A1 EP 2277116A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- word
- words
- database
- word list
- list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 claims abstract description 82
- 238000004904 shortening Methods 0.000 claims description 7
- 238000001914 filtration Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 238000005162 X-ray Laue diffraction Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
Definitions
- Category tree on the content of an information objects comprehensive database the information objects of the database are indexed in an index.
- the present invention relates to methods for automatically creating a category tree on the content of all the texts of a database.
- the subject matter of the present invention is furthermore a data processing system with data representing information in at least one dataset accessible via at least one data source, which is designed and / or set up to at least partially execute a method according to the invention.
- the subject of the present invention is furthermore a data processing device for the electronic processing of data, comprising a control and / or arithmetic unit, an input unit and an output unit, which is designed and / or configured to carry out a method according to the invention at least partially, preferably using at least one part a data processing system according to the invention.
- Methods, data processing systems and / or data processing devices of the type mentioned above are used in the context of search applications or routines, for example by operating systems and / or by so-called search engines, as well as in the context of the organization, provision and / or delivery of information.
- Methods, systems and devices for the electronic processing of data are known in the prior art in numerous embodiments, in particular from WO 2005/050471 A2, the disclosures of which are hereby explicitly referenced.
- contents are processed as information representing data of a database by machine, in particular to be made available to users as a technical tool for solving tasks and / or serve.
- data stocks are simple, universally usable, persistent information or data objects which contain, in particular, such as files and / or documents in operating systems or databases, structure, content and, as required, management information.
- the dormers are a Döverarbeitverarbeitssystern and / or a
- Data processing device usually via at least one data source, usually a present in a data processing system or connectable via a communication network or connectable data carrier, such as a hard disk or the like data recording means accessible.
- the invention is based on the object of enabling a user or user of methods, data processing systems and / or data processing devices in a simple manner to have an overview of the contents of data files, in particular with regard to unstructured and / or poorly comprehensible data files ,
- the present invention proposes a method for the automatic generation of a category tree via the content of a data object comprising information objects, wherein the information objects of the data are indexed in an index, which is characterized by the following method steps:
- An index or database index in the sense of the present invention is an index structure separated from the data structure in a database or in a database.
- the index advantageously accelerates the search and / or sorting for specific fields.
- An index advantageously consists of a collection of pointers that define an ordering relation to one or more columns in a table. If an indexed column is used as the search criterion in a query, the database management system (DBMS) or similar systems generally searches for the desired data records on the basis of these pointers or references.
- DBMS database management system
- a list in the sense of the present invention is a dynamic data structure with a finite number of elements. In this case, a storage of a previously not determined number of interrelated values of simple and / or composite data types is made possible.
- Stop words in the sense of the present invention are words which are ignored in the case of full-text indexing, since they occur very frequently and are generally of no relevance for capturing the content of a document.
- Commonly used stop words in German-language documents are, for example, certain articles such as "der”, “die” and “das.” Stop words are distinguished in particular by the fact that they assume grammatical and / or syntactic functions in particular and therefore generally do not draw conclusions about the content
- the search engine efficiency that is provided by the filtering out is to increase the efficiency of search engines: If you were to consider stop words in a search, the result set would include almost any document in the inventory.
- a selection in the sense of the present invention is a selection of data objects from a data set, in particular in connection with relational databases or relational database systems.
- An advantageous embodiment of the invention provides that in method step 3, when determining a significance value for each word in the word list, the significance value is determined from the quotient of the word frequency of the word within the information object and the word frequency of the word within the entire index.
- a further embodiment of the invention provides that the predeterminable maximum number in method step 5 is limited to 50.
- An advantageous embodiment of the invention provides that in method step 6, when storing the reduced word list in a table, words in the table are assigned to the significance value and in the case in which the significance value is higher than the significance value to an existing word, the higher significance value is used.
- a further advantageous proposal of the invention provides that in method step 8, when storing the co-deposits in a database, the database comprises a table of co-occurrences (word 1 and word 2) with a frequency value in a table row and wherein the frequency value increases by a factor of 1 becomes if there is a co-competition (word 1 and word 2) in the table already.
- the predefinable maximum number in method step 15 is limited to 20.
- the created category tree is at least partially reproduced by a display device of a computer system, preferably in graphical form.
- the present invention further proposes a method for the automatic generation of a category tree on the content of all texts of a data stock, which is characterized by the following method steps: 1. Creating word sets with a preferably predeterminable number of meaningful words for each text of the dataset;
- a further embodiment of the invention provides that the word list created in method step 3 is at least partially reproduced by a display device of a computer system, preferably in graphical form.
- a further advantageous embodiment of the invention is characterized in that the word list created in method step 3 is sorted in descending order according to the frequency of the respective words, so that the most important terms stand at the beginning of the word list.
- a further advantageous embodiment of the invention provides that in step 5, when determining co-occurrences in the stored word list, each word in the word list is compared bit by bit with the words of each word set.
- a further advantageous embodiment of the invention is characterized in that the word list stored in method step 6 is at least partially reproduced by a display device of a computer system, preferably in graphic form.
- the category tree is consolidated for display by a display device, preferably with a similarity check.
- a particularly advantageous proposal of the invention is characterized in that in the context of the similarity check words with different word endings but the same root word to the shortest possible variant (word version) are summarized.
- two words of different length are respectively compared by shortening the longer word by two letters, then bringing the shorter word to the length of the other word and then checking the two words for a match.
- a further advantageous embodiment of the invention is characterized in that, when determining co-competitions in method step 5 and / or in method step 8, a similarity check is carried out, whereby words with different word endings but the same word stem are combined to the shortest possible variant (word version).
- a similarity check is carried out, whereby words with different word endings but the same word stem are combined to the shortest possible variant (word version).
- two words of different length are respectively compared by shortening the longer word by two letters, then bringing the shorter word to the length of the other word and then checking the two words for a match.
- the predetermined number in method step 1 is limited to up to 32.
- the present invention further proposes a method for the automatic generation of a category tree on the content of all texts of a data stock, which is characterized by the following method steps:
- Process step 3 created word list is at least partially reproduced by a display device of a computer system, preferably in graphical form.
- the category tree for display by a display device is consolidated, preferably with a similarity check.
- words with different word endings but the same root word are combined into the shortest possible variant (word version).
- a further advantageous embodiment of the invention is characterized in that, in the context of the similarity check, two words of different length are respectively compared by shortening the longer word by two letters, then bringing the shorter word to the length of the other word and then the two words checked for a match.
- the subject matter of the present invention is furthermore a data processing system with data representing information in at least one dataset accessible via at least one data source, which is designed and / or set up to at least partially execute a method according to the invention.
- a further advantageous embodiment of the invention is characterized by a graphical user interface for inputting and / or reproducing word lists, links and / or at least one level of at least one category tree.
- the graphical user interface continues to input, change and / or reproduction of information representing data in at least one database formed and / or set up.
- the user interface advantageously provides a graphical user interface that enables action-oriented navigation.
- the inventively created category tree is implemented in the user interface by a tree structure in the first reproduced or displayed the generic terms and the user can bring the associated sub-concepts for display by pointing to a provided for this purpose by the user interface button, which with the preamble displayed is selected or activated by so-called Ankück.
- the user can advantageously also move or navigate in further levels of the category tree.
- a search engine or a search engine system advantageously uses a full-text search via the index with all terms of the selected path in the category tree, for example a generic term whose sub-term and again its sub-concept. It is also advantageously possible to select only a generic term for the search.
- the reproduction takes place at least partially in a selectable form, that is, the reproduced categories of inventively created category tree are, for example, itself as a menu item for action options and / or linked in the manner of a link, and by selection, for example by so-called "Click", usable accordingly.
- the data processing system preferably in the context of running on a computer software, used for the dynamic organization of information and / or processes.
- the data processing system according to the invention is part of a database application or at least usable together with a database application.
- the present invention furthermore relates to a data processing device for the electronic processing of data, comprising a control and / or arithmetic unit, an input unit and an output unit, which is designed and / or A method according to the invention is set up at least partially, preferably using at least part of a data processing system according to the invention.
- a data processing device for the electronic processing of data, with a control and / or computing unit, an input unit and an output unit, provided, which is characterized by a use of a data processing system according to the invention.
- the data processing device is designed as a mobile terminal, preferably as a usable or operable in mobile networks mobile terminal. Particularly preferred is an embodiment of the data processing device as a mobile phone.
- a category tree With a category tree according to the invention, the user gets an overview of the contents of a data stock, advantageously via unstructured data stocks, which otherwise can not be surveyed easily.
- facts and / or relationships become transparent. For example, that the texts of one or more databases are about philosophy and that ethics is a discipline within philosophy.
- it is according to the invention for example, in or out of a stock of philosophical publications, who published in the field of ethics and thus sometimes counts among the philosophers.
- the result of an automatic analysis of the terms in a database is a category tree according to the invention. At the top of the list are generally terms that form upper categories. The respective upper categories are assigned to subcategories, these in turn to further subcategories.
- the ramification of the category tree according to the invention can advantageously be continued arbitrarily until all significant terms from a database have experienced one or more assignments.
- the user can now according to the invention select categories and subcategories in the tree and receives a corresponding selection of the data.
- the selection is advantageously based on a search query that touches or affects the terms from the selected path of the category tree.
- a taxonomy is advantageously created on the basis of co-occurrence, that is, the simultaneous occurrence of words.
- FIG. 1 shows in a flowchart an embodiment of a creation of a category tree according to the invention over the content of a data stock
- FIG. 2 shows in a flowchart a further embodiment of a creation of a category tree according to the invention over the content of a
- FIG. 3 shows in a flow chart a further embodiment of a creation of a category tree according to the invention over the content of a data stock.
- stop words are filtered out using a list and a word list is created. There is a significance value for each word. This results from the quotient of word frequency within the document and the word frequency in the entire index.
- the word list is sorted by significance and reduced to the top 50.
- This value 50 can be configured.
- the top 50 are stored in a table. Words are assigned to the significance value. If the value is higher than an existing one, the higher value is taken.
- the co-competitions are derived and stored in a database.
- word 1 and word 2 There is a table of co-occurrences (word 1 and word 2) with a frequency value in a table line. If there is a co-occurrence already in the table, the frequency value is increased by 1.
- the search is for words in the co-occurrence table that have the highest significance but do not form co-occurrence (among each other). They form the first level of the category tree. For all other levels of the category tree, the determined words of the first level are gradually iterated. For each word, the words are selected from the co-occurrence table, which stands with the word co-competition. From this, the words are selected that have an above-average frequency. This list is limited to 20 and sorted by frequency.
- word sets with the 32 least significant words are created and stored in a database.
- the word set is stored in a relational database in the form of a word list whose words are each linked to an ID for the word set. From these word sets, a word list is created which can be displayed. It forms the first level of the conceptual tree. It is possible to sort this list of words in descending order of frequency, so that the most important terms are at the beginning. It may happen that words with the same meaning but different case (case) or inflection (inflection) form separate categories. The term tree can therefore be subsequently consolidated for display. Words with different endings but the same root are combined to the shortest variant. Two words of different lengths are compared by shortening the longer word by two letters. The shorter word is then made the length of the other word and checked for a match.
- the word combinations can again be selected as the starting point.
- the 32 least words are extracted and stored in a database. From the word sets, a word list is extracted that corresponds to the first level of the category tree. As described in the first procedure, the list can be consolidated.
- the word list is iterated and each word is compared with all words, in each case all word sets. If there are two words including the similarity check, a link with the weight 0.1 is made between the one word and all the others of the word set. If this link already exists, the weighting of the link is increased by 0.1. If the value exceeds 1, it is reset to 0.9 and all other links are reduced to a value of 90%.
- the links that are linked to both the first and second terms are selected.
- the links that are linked to both the first, second, and third terms are selected.
- the illustrated in the figures of the drawing and in connection with the description of embodiments of the invention are only illustrative of the invention and are not limiting for this.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/EP2008/003723 WO2009135511A1 (fr) | 2008-05-08 | 2008-05-08 | Création d’une arborescence des catégories sur le contenu d’un ensemble de données |
Publications (1)
Publication Number | Publication Date |
---|---|
EP2277116A1 true EP2277116A1 (fr) | 2011-01-26 |
Family
ID=40010800
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP08758423A Withdrawn EP2277116A1 (fr) | 2008-05-08 | 2008-05-08 | Création d une arborescence des catégories sur le contenu d un ensemble de données |
Country Status (3)
Country | Link |
---|---|
US (1) | US8745069B2 (fr) |
EP (1) | EP2277116A1 (fr) |
WO (1) | WO2009135511A1 (fr) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE102012109096A1 (de) | 2012-09-26 | 2014-03-27 | Iqser Ip Ag | Verfahren zur sequenziellen Bereitstellung von personalisierte Informationen repräsentierenden Daten, insbesondere in Form von Videos und dergleichen, insbesondere für ein personalisiertes Fernsehprogramm |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050149873A1 (en) * | 2003-12-15 | 2005-07-07 | Guido Patrick R. | Methods, systems and computer program products for providing multi-dimensional tree diagram graphical user interfaces |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6999959B1 (en) * | 1997-10-10 | 2006-02-14 | Nec Laboratories America, Inc. | Meta search engine |
US7047242B1 (en) * | 1999-03-31 | 2006-05-16 | Verizon Laboratories Inc. | Weighted term ranking for on-line query tool |
US6411962B1 (en) * | 1999-11-29 | 2002-06-25 | Xerox Corporation | Systems and methods for organizing text |
US7284191B2 (en) * | 2001-08-13 | 2007-10-16 | Xerox Corporation | Meta-document management system with document identifiers |
US20060004732A1 (en) * | 2002-02-26 | 2006-01-05 | Odom Paul S | Search engine methods and systems for generating relevant search results and advertisements |
US7280957B2 (en) * | 2002-12-16 | 2007-10-09 | Palo Alto Research Center, Incorporated | Method and apparatus for generating overview information for hierarchically related information |
WO2005050471A2 (fr) | 2003-11-22 | 2005-06-02 | Wurzer Joerg | Systeme et dispositif pour traiter des donnees |
US7698267B2 (en) * | 2004-08-27 | 2010-04-13 | The Regents Of The University Of California | Searching digital information and databases |
US7865495B1 (en) * | 2004-10-06 | 2011-01-04 | Shopzilla, Inc. | Word deletion for searches |
US7630980B2 (en) * | 2005-01-21 | 2009-12-08 | Prashant Parikh | Automatic dynamic contextual data entry completion system |
WO2009030246A1 (fr) * | 2007-09-03 | 2009-03-12 | Iqser Ip Ag | Détection de corrélations entre des données qui représentent des informations |
-
2008
- 2008-05-08 EP EP08758423A patent/EP2277116A1/fr not_active Withdrawn
- 2008-05-08 WO PCT/EP2008/003723 patent/WO2009135511A1/fr active Application Filing
-
2010
- 2010-11-08 US US12/941,818 patent/US8745069B2/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050149873A1 (en) * | 2003-12-15 | 2005-07-07 | Guido Patrick R. | Methods, systems and computer program products for providing multi-dimensional tree diagram graphical user interfaces |
Non-Patent Citations (1)
Title |
---|
See also references of WO2009135511A1 * |
Also Published As
Publication number | Publication date |
---|---|
US8745069B2 (en) | 2014-06-03 |
WO2009135511A1 (fr) | 2009-11-12 |
US20110113043A1 (en) | 2011-05-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE69433165T2 (de) | Assoziatives textsuch- und wiederauffindungssystem | |
DE69811066T2 (de) | Datenzusammenfassungsgerät. | |
DE69834386T2 (de) | Textverarbeitungsverfahren und rückholsystem und verfahren | |
EP2188742A1 (fr) | Détection de corrélations entre des données qui représentent des informations | |
WO2006018041A1 (fr) | Dispositif d'analyse vocale et textuelle et procede correspondant | |
EP2193456A1 (fr) | Détection de corrélations entre des données représentant des informations | |
DE69719641T2 (de) | Ein Verfahren, um Informationen auf Bildschirmgeräten in verschiedenen Grössen zu präsentieren | |
DE10028624A1 (de) | Verfahren und Vorrichtung zur Dokumentenbeschaffung | |
EP2221735A2 (fr) | Procédé de classification automatisée d'un texte par un programme informatique | |
EP2193455A1 (fr) | Détection de corrélations entre des données qui représentent des informations | |
EP1685505B1 (fr) | Systeme de traitement de donnees | |
WO2009135511A1 (fr) | Création d’une arborescence des catégories sur le contenu d’un ensemble de données | |
DE10218905A1 (de) | Verfahren und Vorrichtung zur Zugriffssteuerung in Wissensnetzen | |
EP2193457A1 (fr) | Détection de corrélations entre des données représentant des informations | |
EP1412875A2 (fr) | Procede de traitement de texte dans un ordinateur et ordinateur | |
WO2005116867A1 (fr) | Procede et systeme de generation automatisee de dispositifs de commande et d'analyse assistes par ordinateur | |
WO2012025439A1 (fr) | Procédé de recherche d'une pluralité d'ensembles de données et moteur de recherche | |
WO2021204849A1 (fr) | Procédé et système informatique pour déterminer la pertinence d'un texte | |
DE69132678T2 (de) | Ein textverwaltungssystem | |
DE102006043158A1 (de) | Verfahren zum Ermitteln von Elementen eines einer Suchanfrage zugeordneten Suchergebnisses in einer Reihenfolge und Suchmaschine | |
DE60106209T2 (de) | Prozess zum Extrahieren von Schlüsselwörtern | |
DE10331817A1 (de) | Computergestütztes Verfahren zur automatischen Abfrage von Informationen aus als sematisches Netz strukturierten Datenbanken | |
WO2011044864A1 (fr) | Procédé et système de classification d'objets | |
DE102009028601A1 (de) | Elektronisches Recherchensystem | |
DE10229598A1 (de) | Datenverarbeitungssystem und Verfahren zur Durchführung von Datenrecherchen |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MT NL NO PL PT RO SE SI SK TR |
|
AX | Request for extension of the european patent |
Extension state: AL BA MK RS |
|
17P | Request for examination filed |
Effective date: 20101202 |
|
RIN1 | Information on inventor provided before grant (corrected) |
Inventor name: MAGNUS, CHRISTIAN Inventor name: WURZER, JOERG |
|
DAX | Request for extension of the european patent (deleted) | ||
17Q | First examination report despatched |
Effective date: 20160527 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
APBK | Appeal reference recorded |
Free format text: ORIGINAL CODE: EPIDOSNREFNE |
|
APBN | Date of receipt of notice of appeal recorded |
Free format text: ORIGINAL CODE: EPIDOSNNOA2E |
|
APBR | Date of receipt of statement of grounds of appeal recorded |
Free format text: ORIGINAL CODE: EPIDOSNNOA3E |
|
APAF | Appeal reference modified |
Free format text: ORIGINAL CODE: EPIDOSCREFNE |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: IQSER IP GMBH |
|
APBT | Appeal procedure closed |
Free format text: ORIGINAL CODE: EPIDOSNNOA9E |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN |
|
18W | Application withdrawn |
Effective date: 20210325 |