BE1018334A5

BE1018334A5 - METHOD AND SYSTEM FOR INTELLIGENTLY INDEXING DOCUMENTS OR TEXT IN A COMPLEX DATABASE BY AVOIDING "NOISE".

Info

Publication number: BE1018334A5
Application number: BE2008/0604A
Authority: BE
Original assignee: Group Dado 13 Bv Met Beperkte
Priority date: 2008-11-04
Filing date: 2008-11-04
Publication date: 2010-09-07
Also published as: BE1018996A3

Abstract

Werkwijze voor het sneller weervinden van documenten of teksten in een complexe database, daardoor gekenmerkt dat het weergeven van waardeloze informatie of "ruis" wordt vermeden door indexen aan te maken op basis van één of meer "causale" relaties tussen de tekstgegevens van de betreffende documenten of teksten.Method for faster finding of documents or texts in a complex database, characterized in that the display of worthless information or "noise" is avoided by creating indexes based on one or more "causal" relationships between the text data of the documents in question or texts.

Description

Werkwijze en systeem voor het intelligent indexeren van documenten of teksten in een complexe database.Method and system for intelligent indexing of documents or texts in a complex database.

Deze uitvinding heeft betrekking op een werkwijze voor het intelligent indexeren van documenten of teksten in een complexe database, alsmede op een systeem om deze werkwijze te realiseren.This invention relates to a method for intelligently indexing documents or texts in a complex database, as well as to a system for realizing this method.

In de eerste plaats is de uitvinding bedoeld om te worden aangewend in datasystemen waarin documenten of teksten zijn opgenomen, die later aan de hand van zoekcriteria moeten kunnen worden teruggevonden. Meer algemeen echter kan de uitvinding worden toegepast in elke complexe databank waarin documenten of teksten zijn opgeslagen.In the first place, the invention is intended to be used in data systems in which documents or texts are included, which must be able to be retrieved later on the basis of search criteria. More generally, however, the invention can be applied to any complex database in which documents or texts are stored.

In het bijzonder beoogt de uitvinding een werkwijze voor het snel weervinden van documenten of teksten door middel van een combinatie van meerdere zoekcriteria.In particular, the invention contemplates a method for quickly finding documents or texts by means of a combination of several search criteria.

Het indexeren en vervolgens weervinden van gegevens uit documenten of teksten is algemeen bekend bij het opzetten van databases.Indexing and subsequently finding data from documents or texts is generally known when setting up databases.

In het algemeen kan gesteld worden dat er drie methoden bestaan om tekstuele gegevens om te zetten in indexen.In general it can be said that there are three methods for converting textual data into indexes.

De eerste methode is de simpele of de automatische "niet-intelligente" indexeermethode. Volgens deze methode worden automatisch woorden uit een tekst weerhouden door middel van een evaluatiesysteem en worden deze woorden in een index geïntegreerd.The first method is the simple or automatic "non-intelligent" indexing method. According to this method, words are automatically extracted from a text by means of an evaluation system and these words are integrated into an index.

De tweede methode is de “manueel-intelligente" indexeermethode. Deze methode bestaat erin dat de persoon die de documenten of teksten indexeert, aan ieder document één of meer labels toekent, aan de hand van dewelke achteraf hét document kan worden teruggevonden.The second method is the "manually-intelligent" indexing method. This method consists in that the person who indexes the documents or texts assigns one or more labels to each document, on the basis of which the document can be found afterwards.

De derde methode is de automatische of “intelligente” indexeermethode. Hierbij is het toevoegen van labels van de tweede methode vervangen door een automatisch systeem.The third method is the automatic or "intelligent" indexing method. The addition of labels from the second method has been replaced by an automatic system.

Het is duidelijk dat de kwaliteit van de werkwijze waarmee achteraf in een minimum van tijd de juiste documenten of teksten kunnen worden teruggevonden, afhankelijk is van de criteria die worden aangewend bij het indexeren. Onder deze criteria kan men hoofdzakelijk twee basiscriteria herkennen.It is clear that the quality of the method with which the correct documents or texts can be found afterwards in a minimum of time depends on the criteria used for indexing. Among these criteria, one can mainly recognize two basic criteria.

Een eerste basiscriterium heeft betrekking op de "uitputbaarheid", waarmee bedoeld wordt in hoeverre de inhoud van een bepaald document wel volledig door middel van de index wordt vastgelegd. Een tweede basiscriterium is de specificiteit, die bepaald is voor de precisie waarmee gezochte documenten of teksten kunnen worden teruggevonden.A first basic criterion relates to the "exhaustibility", which means to what extent the content of a certain document is fully recorded by means of the index. A second basic criterion is the specificity, which is determined for the precision with which requested documents or texts can be found.

De tijd nodig om de juiste documenten of teksten weer te vinden, hangt immers af van de werkwijze waarop de indexen zijn gekozen. Om de zoektijd te verminderen, is het dan ook noodzakelijk een optimale balans te maken tussen de mogelijkheid om de documenten of teksten terug te vinden en de precisie waarmee zij kunnen worden teruggevonden. Hierbij is het dan ook zeer belangrijk de indexen niet uitputtend te creëren om te vermijden dat tijdens het zoeken van bepaalde documenten of teksten die betrekking hebben op een bepaald onderwerp, veel waardeloze informatie naar voor zal komen. In zulk geval spreekt men ervan dat de weerhouden documentatie veel "ruis" bevat. Een hoge precisie betekent dat uitsluitend nuttige informatie wordt geïndexeerd door zeer precieze labels hieraan toe te kennen.After all, the time needed to find the right documents or texts depends on the method in which the indexes are chosen. To reduce the search time, it is therefore necessary to strike an optimal balance between the possibility of finding the documents or texts and the precision with which they can be found. It is therefore very important not to exhaustively create the indexes in order to avoid that a lot of worthless information will come to the fore when searching for certain documents or texts that relate to a certain subject. In such a case it is said that the retained documentation contains a lot of "noise". High precision means that only useful information is indexed by assigning very precise labels.

Het indexeren gebeurt meestal ofwel met behulp van de "single-term indexing", waarbij indexen toegekend worden aan enkelvoudige termen, dus woorden, ofwel door middel van "term relationship indexing", waarbij indexen worden toegekend die rekening houden met relaties tussen verschillende concepten.The indexing is usually done either by means of "single-term indexing", whereby indexes are assigned to single terms, that is to say words, or by means of "term relationship indexing", whereby indexes are assigned that take into account relationships between different concepts.

De bekende systemen voor het indexeren van documenten of teksten hebben als nadeel dat zij alle voornamelijk gebaseerd zijn op statistische formules en dat zij geen gebruik maken van de toevoeging van indexen die op kennis zijn gebaseerd.The known systems for indexing documents or texts have the disadvantage that they are all primarily based on statistical formulas and that they do not use the addition of indexes based on knowledge.

De huidige uitvinding beoogt een werkwijze om de indexen van documenten of teksten op een dusdanige wijze automatisch te creëren dat ze het weervinden van documenten of teksten toelaat op een meer efficiënte wijze, zodat de user de gezochte informatie zeer snel kan verkrijgen met als bijkomend voordeel dat het zoeken gebeurt met zeer grote precisie, zonder noemenswaardige nutteloze informatie of "ruis".The present invention contemplates a method for automatically creating the indexes of documents or texts in such a way that it allows the retrieval of documents or texts in a more efficient manner, so that the user can obtain the requested information very quickly with the additional advantage that: the search is done with great precision, without any noticeable useless information or "noise".

Om dit doel te bekomen, voorziet de huidige uitvinding in de eerste plaats in een werkwijze voor het weervinden van documenten of teksten in een complexe database, waarbij voor het weervinden criteria worden aangewend die één of meer relaties leggen tussen de tekstgegevens van de betreffende documenten of teksten, daardoor gekenmerkt dat de voornoemde relaties bestaan uit causale relaties.To achieve this goal, the present invention first of all provides a method for finding documents or texts in a complex database, whereby criteria are applied for finding one or more relationships between the text data of the documents or documents concerned. texts, characterized in that the aforementioned relationships consist of causal relationships.

De voornoemde causale relaties worden benut om indexen aan de ingebrachte documenten of teksten toe te kennen en dezelfde causale relaties worden gebruikt bij het opzoeken aan de hand van deze indexen automatisch naar causale of andere relaties wordt gezocht.The aforementioned causal relationships are used to assign indexes to the documents or texts entered and the same causal relationships are used when searching on the basis of these indexes for causal or other relationships.

Door het leggen van relaties en meer speciaal causale relaties, wordt het voordeel verkregen dat de semantische rijkdom van een thesaurus optimaal kan worden aangewend voor het indexeren van documenten of teksten en/of voor het weerhouden van documenten of teksten uit een database bij een opzoeking.By establishing relationships and more specifically causal relationships, the advantage is obtained that the semantic richness of a thesaurus can be optimally used for indexing documents or texts and / or for retaining documents or texts from a database during a search.

Bij voorkeur zal in samenhang met de huidige uitvinding gebruik worden gemaakt van één of meer onderwerpgerichte thesauri, meer speciaal thesauri die betrekking hebben op een welbepaald vakgebied.Preferably, in connection with the present invention, use will be made of one or more subject-oriented thesauri, more particularly thesauri relating to a specific field.

In een voorkeurdragende uitvoeringsvorm zal, naast de voornoemde thesaurus of thesauri, ook een bestand worden opgebouwd en/of aangewend waarin causale relaties zijn vastgelegd. Hierdoor worden eindgebruikers geholpen in het zoeken naar oorzaken en/of relaties in bepaalde contexten.In a preferred embodiment, in addition to the aforementioned thesaurus or thesauri, a file will also be built up and / or applied in which causal relationships are recorded. This helps end users to look for causes and / or relationships in certain contexts.

Het basisconcept van de voornoemde werkwijze van de uitvinding kan in de praktijk op verschillende wijzen worden gerealiseerd.The basic concept of the aforementioned method of the invention can be realized in practice in various ways.

Met het inzicht de kenmerken van de uitvinding beter aan te tonen, wordt hierna als voorbeeld een praktische, alsmede voorkeurdragende uitvoeringsvorm, beschreven. Volgens deze voorkeurdragende uitvoeringsvorm wordt gebruik gemaakt van een structuur waarin hoofdzakelijk vijf basisbestanddelen herkenbaar zijn.With the insight to better demonstrate the characteristics of the invention, a practical and preferred embodiment is described below as an example. According to this preferred embodiment, use is made of a structure in which essentially five basic components are recognizable.

Het eerste bestanddeel is een synoniemen-database met synoniemen en gerelateerde woorden. Deze synoniemen-database is aanpasbaar en laat toe om nieuwe woorden, alsook nieuwe synoniemen en equivalente termen op te nemen.The first component is a synonym database with synonyms and related words. This synonym database is customizable and allows you to record new words, as well as new synonyms and equivalent terms.

Het tweede bestanddeel wordt gevormd door een taalontleder die toelaat een syntactische analyse te realiseren. Dit onderdeel heeft als doel nieuwe documenten of teksten te ontleden om ze automatisch te indexeren op basis van semantische relaties in functie van de welbepaalde specialisatie van de documenten of teksten.The second component is formed by a language parser that allows for a syntactic analysis. The purpose of this section is to analyze new documents or texts in order to automatically index them on the basis of semantic relationships in function of the specific specialization of the documents or texts.

De taalontleder zal automatisch relevante indexen genereren voor ieder document of iedere tekst.The language parser will automatically generate relevant indexes for each document or text.

Het derde bestanddeel wordt gevormd door het interactief ondervragingsonderdeel. Dit onderdeel laat toe dat de gebruiker een aantal vragen kan inbrengen. Deze vraagstellingsmiddelen zorgen ervoor dat het systeem telkens kan nagaan hoeveel documenten of teksten worden weerhouden bij het inbrengen van een bepaalde index, alsook hoeveel hits worden teruggevonden bij de combinatie van verschillende vragen.The third component is the interactive interrogation component. This section allows the user to enter a number of questions. These questioning means ensure that the system can always check how many documents or texts are retained when entering a certain index, as well as how many hits are found when combining different questions.

Het vierde bestanddeel wordt gevormd een zoekmethode gebaseerd op “causale” of relatie-indexen. Relatie-indexen zijn zoektermen die zijn ingebracht door een gebruiker op basis van kennis van het vakdomein. De relatie-indexen zijn samenstellingen van indexen die met causale relaties zijn verbonden. Deze zorgen ervoor dat een meer specifieke vraag kan worden gesteld en nutteloze documenten of teksten uit het resultaat elimineert. Het opzoekingsresultaat heeft aldus minder “ruis".The fourth component is formed a search method based on "causal" or relationship indexes. Relationship indexes are search terms entered by a user based on knowledge of the subject area. The relationship indexes are compositions of indexes that are associated with causal relationships. These ensure that a more specific question can be asked and eliminates useless documents or texts from the result. The search result thus has less "noise".

Het vijfde bestanddeel wordt gevormd door het interactief ondervragingsonderdeel op basis van de relatieve verbanden. Dit onderdeel laat toe dat de gebruiker een aantal vragen kan inbrengen waarbij deze vraagstellingsmiddelen ervoor zorgen dat het systeem telkens kan nagaan hoeveel documenten of teksten worden weerhouden bij het inbrengen van een bepaalde index, alsook hoeveel hits worden teruggevonden bij de combinatie van verschillende vragen. Zij zorgen ervoor dat de gebruiker op basis van relatieve verbanden kän zoeken tussen de documenten of teksten die geïndexeerd zijn op basis van relatieve verbanden.The fifth component is formed by the interactive interrogation component based on the relative relationships. This section allows the user to enter a number of questions, whereby these questioning means ensure that the system can keep track of how many documents or texts are retained when a certain index is entered, as well as how many hits are found when combining different questions. They ensure that the user can search between documents or texts that are indexed on the basis of relative relationships based on relative relationships.

De gekende zoekalgoritmen toegepast op dit geheel van bestanddelen zorgen ervoor dat de zoekoperaties snel en efficiënt kunnen worden uitgevoerd in het specifieke vakdomein. Een toepassing van de methode wordt gedetailleerd beschreven, ter verduidelijking van de uitvinding.The known search algorithms applied to this set of components ensure that the search operations can be performed quickly and efficiently in the specific subject area. An application of the method is described in detail to clarify the invention.

Vermits synoniemen en gerelateerde termen taalgebonden zijn, wordt de thesaurus opgemaakt in één welbepaalde taal. Het is dan ook nodig een thesaurus op te bouwen voor het vakdomein en in de vooraf bepaalde taal. In deze thesaurus worden belangrijke termen bewaard uit het vakdomein en per term een aantal synoniemen of gerelateerde termen. Zo zal bij het zoeken maar in een beperkt bestand van zoektermen moeten worden gezocht, hetgeen de zoeksnelheid verhoogt.Since synonyms and related terms are language specific, the thesaurus is drawn up in one specific language. It is therefore necessary to build a thesaurus for the discipline and in the predetermined language. In this thesaurus important terms from the discipline are stored and a number of synonyms or related terms per term. For example, when searching, only a limited file of search terms must be searched for, which increases the search speed.

Op een tekst of document die met behulp van deze uitvinding moet worden geïndexeerd, worden achtereenvolgens volgende processen toegepast: (a) indexatie op basis van de thesaurus, (b) tussenkomst van de operator bij het indexeren en (c) causale indexatie.The following processes are applied to a text or document to be indexed using this invention: (a) thesaurus-based indexation, (b) operator intervention in indexing, and (c) causal indexation.

De indexatie op basis van de thesaurus (a) gebeurt automatisch, door een programma de tekst of het document te laten doorlopen en de termen die in de thesaurus voorkomen te identificeren en als index te bewaren. Hierbij worden de synoniemen of gerelateerde termen niet als afzonderlijke index bewaard.The indexing based on the thesaurus (a) is done automatically by letting a program run through the text or document and identifying the terms that occur in the thesaurus and saving them as an index. Here the synonyms or related terms are not saved as a separate index.

De indexen die nog niet in de thesaurus zijn opgenomen worden aan de operator getoond en vereisen dan de tussenkomst van de operator bij het indexeren (b). Deze zal nagaan of de indexen relevant zijn. De niet relevante worden niet gebruikt. De relevante termen gaat hij, ofwel als synoniem, of gerelateerde term aan een bestaande index toevoegen in de thesaurus ofwel als nieuwe index. Op deze manier groeit de thesaurus in aantal indexen, maar ook in synoniemen en gerelateerde termen.The indexes that have not yet been included in the thesaurus are shown to the operator and then require the intervention of the operator when indexing (b). This will check whether the indexes are relevant. The irrelevant are not used. He will add the relevant terms, either as a synonym, or related term to an existing index in the thesaurus or as a new index. In this way the thesaurus grows in number of indexes, but also in synonyms and related terms.

Bijkomend bij de gebruikelijke indexering op basis van woorden, wordt een causale indexatie (o) toegevoegd. Deze indexatie gaat uit van relaties en verbanden die in de tekst worden beschreven en die verbanden van het type “uit A en B volgt C” of “uit A en niet B volgt D” zal identificeren in de tekst of het document. Deze indexatie zal achteraf toelaten om op verbanden te zoeken.In addition to the usual word-based indexing, a causal indexation (o) is added. This indexation is based on relationships and relationships that are described in the text and that will identify relationships of the type "from A and B follows C" or "from A and not B follows D" in the text or document. This indexation will allow you to search for relationships afterwards.

De indexatie gebeurt, enerzijds, op basis van de volgorde van bepaalde voegwoorden en zinsdelen in een zin, waaraan een verband kan worden herkend, zoals : - Als/indien/bij ... en ... dan ... .The indexation takes place, on the one hand, on the basis of the order of certain conjunctions and phrases in a sentence, to which a connection can be recognized, such as: - If / if / at ... and ... then ....

- Bij gebruik van ... en ... dan ... .- When using ... and ... then ....

- wanneer gebruik wordt gemaakt van ... en ... dan ... .- when use is made of ... and ... then ....

-... en ... heeft als gevolg dat... .-... and ... has the consequence that ....

-... en ... genereert... .-... and ... generates ....

Zulke verbanden worden als meerledige index bewaard in een relatie-thesaurus. Hierbij wordt de meerledige index gebaseerd op de indexen uit de thesaurus, dit om bij het zoeken ook de synoniemen en gerelateerde termen in het resultaat te kunnen betrekken van elk van de onderdelen van de index.Such relationships are stored as a multiple index in a relationship thesaurus. The multiple index is hereby based on the indexes from the thesaurus, so that the search can also include the synonyms and related terms in the result of each of the components of the index.

De operator heeft de mogelijkheid om zelf eigen verbanden te herkennen en toe te kennen aan een tekst of document.The operator has the option of recognizing his own relationships and assigning them to a text or document.

Bij het zoeken in de databank van teksten of documenten, heeft de gebruiker aldus de mogelijkheid om een search te doen naar woorden uit de tekst, waarbij in het resultaat ook de synoniemen en gerelateerde termen worden betrokken. Daarnaast kan hij beschikken over een zoekoptie die toelaat op relaties tussen termen te zoeken. Deze optie laat de gebruiker de mogelijkheid om een vraag te stellen van het type “als A en B, dan C”, ... Het systeem zal dan op basis van de thesaurus de indexen eruit halen en op basis van deze indexen in de relatie-thesaurus een resultaat zoeken.When searching the database of texts or documents, the user thus has the option of doing a search for words from the text, whereby the result also includes the synonyms and related terms. In addition, he can have a search option that allows to search for relationships between terms. This option allows the user to ask a question of the type "if A and B, then C", ... The system will then extract the indexes based on the thesaurus and based on these indexes in the relationship -thesaurus search for a result.

Bij de weergave van het resultaat kan geopteerd worden om enkel de teksten of documenten met de meerledige indexen in weer te geven of ook de teksten en documenten waarbij deels de meerledige index slechts deels is toegekend. Deze nevenoptie is nodig om de gebruiker de mogelijkheid te geven om, naast de teksten en documenten over een onderwerp, ook antwoorden op open vragen te kunnen vinden in de databank.When displaying the result, you can opt to display only the texts or documents with the multiple indexes or also the texts and documents for which partly the multiple index is only partly assigned. This ancillary option is needed to give the user the possibility to find answers to open questions in the database in addition to the texts and documents on a topic.

De thesaurus is in het beschreven voorbeeld in één taal opgemaakt. Door van de thesaurus een vertaling in een andere taal te maken, worden de teksten en documenten ook beschikbaar in andere talen, zonder bijkomende indexatie van de teksten en documenten.In the example described, the thesaurus is laid out in one language. By turning the thesaurus into a translation in another language, the texts and documents also become available in other languages, without additional indexing of the texts and documents.

Deze uitvinding is niet beperkt tot het beschreven voorbeeld maar is ook van toepassing op systemen die van de werkwijze gebruik maken.This invention is not limited to the example described, but also applies to systems that use the method.

Claims

Method for faster finding of documents or texts in a complex database, characterized in that the display of worthless information or "noise" is avoided by creating indexes based on one or more "causal" relationships between the text data of the concerning documents or texts.

Method according to claim 1, characterized in that an additional index is built up in which the frequency of the consultation of the information is kept, so that the reproduction can take place in function of this frequency.

Method according to claim 1, characterized in that the manager of the information assigns an additional index to each document or text on the basis of importance, so that the information can be displayed in function of this additional criterion.

Method according to claim 1, characterized in that the indexes and search criteria are based on the complete texts of said documents or texts.

Method according to claim 4, characterized in that at least a filtering on the text data is carried out by eliminating stop words and determining explicit index terms with the aid of the unigrams and / or bigrams and / or trigrams occurring in the text.

Method according to claim 5, characterized in that the bigrams and / or trigrams are built up by determining, after the stop words have been removed, and starting from the retained unigrams, which terms are adjacent thereto.

Method according to claim 2, characterized in that use is made of an indexation on the basis of the complete text excluding the stop words.

Method according to claim 7, characterized in that use is made of an additional index of the importance of the documents or texts.

Method according to claim 1, characterized in that an additional index is made on the basis of the complete documents or texts, but limited to explicit index terms by comparing them with the content of a thesaurus.

Method according to claim 9, characterized in that a list is created for updating the thesaurus of the terms that do not occur in the thesaurus.

Method according to one of claims 5 to 8, characterized in that the indexing uses implicit index terms that are added to the explicit index terms, which added terms are taken from the thesaurus, these terms being both narrower and broader. terms can be.

Method according to claim 10, characterized in that use is made of means that allow an interactive update by the user.

Method according to one of claims 2 to 12, characterized in that when indexing a document, the number of index terms is limited to a maximum of five.

Method according to one of claims 1 to 13, characterized in that use is made of the combination of a thesaurus and a limitation of the search terms to keywords included in the thesaurus.