NL1015151C2

NL1015151C2 - Device and method for cataloging textual information.

Info

Publication number: NL1015151C2
Application number: NL1015151A
Authority: NL
Inventors: Barend Mons
Original assignee: Collexis B V
Priority date: 2000-05-10
Filing date: 2000-05-10
Publication date: 2001-12-10
Also published as: AU5686201A; WO2001086499A3; WO2001086499A2

Description

Inrichting en werkwijze voor het catalogiseren van tekstuele informatieDevice and method for cataloging textual information

De uitvinding heeft betrekking op een werkwijze voor het catalogiseren van, of zoeken in, tekstuele informatie, alsmede een inrichting daarvoor.The invention relates to a method for cataloging, or searching in, textual information, as well as a device therefor.

Een dergelijke werkwijze is algemeen bekend. Zo 5 is het bijvoorbeeld bekend om tekstuele informatie middels zogenaamde "natural language processing" te catalogiseren. Daarbij wordt het onderlinge verband tussen woorden geanalyseerd op basis van geldende grammaticaregels, en worden op die manier verbanden tussen woorden die samen begrippen 10 vormen herkend. Een dergelijke analyse kost echter veel rekentijd, is complex en is taal-afhankelijk.Such a method is generally known. For example, it is known to catalog textual information by means of so-called "natural language processing". The interrelation between words is analyzed on the basis of applicable grammar rules, and in this way connections between words that together form concepts 10 are recognized. However, such an analysis takes a lot of computing time, is complex and is language dependent.

Daarnaast zijn werkwijzen bekend waarbij alle woorden, met uitzondering van stopwoorden, per stuk geïndexeerd worden. Daarbij gaat echter veel informatie 15 verloren, vooral met betrekking tot begrippen die samengesteld zijn uit meerdere woorden.In addition, methods are known in which all words, with the exception of stop words, are indexed individually. However, a great deal of information is lost in this process, especially with regard to terms composed of several words.

Andere bekende werkwijzen, zoals onder meer beschreven in US-5.931.907, US-5.754.938, EP-A-860.785 of WO-A-OO/17781, maken gebruik van sleutelwoorden. Nadeel 20 daarvan is dat het gebruik van verkeerde sleutelwoorden bij het zoeken kan leiden tot het missen van informatie. Daarnaast kan een sleutelwoord gebruikt worden in bijvoorbeeld een document dat niets met gezochte documenten te maken heeft. Zo kan bijvoorbeeld het gebruik van het 101 51 51 2 sleutelwoord "xenotransplantatie" bij zoeken leiden tot het missen van referenties waarin het woord "xenografische procedure" gebruikt wordt. Daarnaast kan truncatie tot de zoekterm "xeno" leiden tot veel te veel irrelevante hits.Other known methods, such as described in US-5,931,907, US-5,754,938, EP-A-860,785 or WO-A-OO / 17781, among others, use keywords. The disadvantage of this is that the use of wrong keywords in the search can lead to missing information. In addition, a keyword can be used in, for example, a document that has nothing to do with searched documents. For example, the use of the 101 51 51 2 keyword "xenotransplantation" in search may result in missing references using the word "xenographic procedure". In addition, truncation to the search term "xeno" can lead to far too many irrelevant hits.

5 Weer andere bekende werkwijzen, zoals onder meer beschreven in WO-A-98/38560, maken gebruik van automatisch gegenereerde woordclusters en termen. bij dergelijke werkwijzen wordt door het bewerken van zeer vele teksten samenhang en verbanden tussen woorden herkend. Wanneer 10 bepaalde woorden vaak samen voorkomen, kunnen deze woorden worden herkend als behorende tot één begrip.Still other known methods, such as described in WO-A-98/38560, among others, use automatically generated word clusters and terms. in such methods, the editing of many texts recognizes coherence and relationships between words. When 10 certain words often occur together, these words can be recognized as belonging to one concept.

De bekende werkwijzen zijn daardoor ofwel te traag, ofwel te onnauwkeurig, om bijvoorbeeld interactief en eventueel door een onervaren gebruiker toegepast te 15 worden.The known methods are therefore either too slow or too inaccurate, for instance to be applied interactively and possibly by an inexperienced user.

Het is een doel van de uitvinding een werkwijze van de in de aanhef genoemde soort te verschaffen die onder meer geschikt is voor interactieve toepassing, voor toepassing door een onervaren gebruiker, of voor toepas-20 sing in een gedistribueerde omgeving, zoals bijvoorbeeld het internet of een intranet.It is an object of the invention to provide a method of the type mentioned in the preamble, which is suitable, inter alia, for interactive application, for application by an inexperienced user, or for application in a distributed environment, such as for instance the internet or an intranet.

Deze doelen worden bereikt, en andere voordelen worden behaald, met een werkwijze voor het catalogiseren van tekstuele informatie dan wel het genereren van een 25 kennisprofiel daaruit, waarbij: een gebruiker tekstuele informatie invoert in een computer, verder voorzien van programmatuur; De programmatuur voorzien is van een routine die de tekstuele informatie opdeeit in woorden, en 3 0 van een routine die de woorden van de tekstuele informatie opzoeken in ten minste één gestructureerd databestand, aanwezig in geheugenmiddelen in de computer, welk gestructureerd databestand woorden omvat met per woord verwijzingen naar 3 5 begrippen; de programmatuur voorzien is van een routine die per woord in de tekstuele informatie alle over- 101 51 51 3 eenkomstige woorden in het gestructureerde databestand zoekt en vervolgens per woord de daarmee gerelateerde begrippen uit het gestructureerde databestand koppelt; 5 en de programmatuur voorzien is van een routine die vervolgens middels clustering begrippen clustert tot een lijst van sleutelwoorden dan wel overkoepelende sleutelwoorden; en de programmatuur voorzien is van een routine 10 die de lijst van sleutelwoorden vervolgens in teractief als kennisprofiel dan wel categorie aan de gebruiker presenteert.These goals are achieved, and other advantages are achieved, with a method of cataloging textual information or generating a knowledge profile therefrom, wherein: a user enters textual information into a computer, further provided with software; The software includes a routine that divides the textual information into words, and a routine that searches the words of the textual information in at least one structured data file contained in memory means in the computer, which structured data file comprises words containing per word references to 3 5 concepts; the software is provided with a routine that searches word by word in the textual information for all corresponding words in the structured data file and subsequently links the related concepts from the structured data file per word; 5 and the software is provided with a routine which then clusters concepts into a list of keywords or umbrella keywords by means of clustering; and the software is provided with a routine 10 which subsequently presents the list of keywords to the user in interactive as a knowledge profile or category.

De uitvinding heeft voorts betrekking op een computersysteem voorzien van invoermiddelen, uitvoermidde-15 len en verbindingsmiddelen voor verbinding met andere computersystemen, waarbij het computersysteem ingericht is voor het uitvoeren van de werkwij ze volgens één of meer der voorgaande conclusies.The invention further relates to a computer system provided with input means, output means and connection means for connection to other computer systems, wherein the computer system is arranged for carrying out the method according to one or more of the preceding claims.

Omdat de werkwijze volgens de uitvinding bij het 20 catalogiseren gebruik maakt van gestructureerde databestanden, is het mogelijk tekstuele informatie te catalogiseren op een snelle, efficiënte manier die eventueel geschikt in voor interactieve toepassing door een onervaren gebruiker, bijvoorbeeld via een gedistribueerde omge-25 ving zoals het internet of een intranet.Because the method according to the invention uses structured data files in the cataloging, it is possible to catalog textual information in a fast, efficient manner, which is possibly suitable for interactive application by an inexperienced user, for instance via a distributed environment such as the internet or an intranet.

Met catalogiseren wordt volgens de uitvinding ondermeer bedoeld het indexeren van tekstuele informatie en het vindbaar maken van tekstuele informatie. De begrippen of sleutelwoorden die gekoppeld worden aan de tekstue-30 le informatie kunnen bijvoorbeeld abstracte representaties zijn van meerdere woorden. Een voorbeeld daarvan is onder meer het Internationale Octrooiclassificatie systeem (IPC), waarin begrippen, vaak meerdere woorden lang, aangegeven zijn met behulp van een code.Cataloging according to the invention includes indexing textual information and making textual information findable. For example, the terms or keywords associated with the textual information may be abstract representations of multiple words. An example of this is the International Patent Classification System (IPC), in which terms, often several words long, are indicated using a code.

35 In een uitvoeringsvorm van de onderhavige uit vinding verwijzen de begrippen in het gestructureerde databestand naar sleutelwoorden, waarbij de routine die 101 51 51 4 clustert de sleutelwoorden vervolgens clustert. Daarbij kunnen meerdere begrippen in het gestructureerde databestand verwijzen naar hetzelfde sleutelwoord. Ook kan een begrip verwijzen naar meerdere sleutelwoorden.In one embodiment of the present invention, the terms in the structured data file refer to keywords, with the routine clustering the keywords then clustering the keywords. In addition, several terms in the structured data file can refer to the same keyword. A term can also refer to several keywords.

5 Bij voorkeur is het gestructureerde databestand een gestructureerde woordenlijst zoals een thesaurus of metathesaurus. In de verdere tekst zal wanneer het woord thesaurus gebruikt is impliciet ook metathesaurus bedoeld worden. In thesaurie zijn begrippen gerangschikt volgens 10 een hiërarchisch systeem van overkoepelende of generieke begrippen met daaronder steeds specifiekere begrippen. Er ontstaat daardoor een soort boomstructuur van hogergele-gen, overkoepelende begrippen, met vertakkingen naar steeds specifiekere begrippen. In de werkwijze volgens de 15 uitvinding kan desgewenst gebruik gemaakt worden van thesaurie uit verschillende kennisgebieden en deze thesaurie kunnen gecombineerd worden tot één grote thesaurus.Preferably, the structured data file is a structured glossary such as a thesaurus or metathesaurus. In the further text, when the word thesaurus is used, implicitly also metathesaurus is meant. In treasury, concepts are arranged according to a hierarchical system of umbrella or generic concepts, including increasingly specific concepts. This creates a kind of tree structure of higher-level, overarching concepts, with branches to increasingly specific concepts. In the method according to the invention, if desired, use can be made of treasury from different knowledge areas and this treasury can be combined into one large thesaurus.

Bij voorkeur vindt clustering plaats op basis van verbanden binnen zinnen. Hierdoor is een zeer snelle 20 werkwijze mogelijk, terwijl uit proefnemingen gebleken is dat toch een zeer goede nauwkeurigheid gehaald werd. Bij de clustering wordt gezocht naar gemeenschappelijke begrippen, en eventueel overeenkomstige overkoepelende begrippen waarnaar woorden in een zin naar terugverwijzen. 25 Dit wordt net zolang herhaald totdat er geen gemeenschappelijke begrippen meer gevonden worden.Clustering preferably takes place on the basis of connections within sentences. This makes a very fast method possible, while experiments have shown that very good accuracy was nevertheless achieved. Clustering searches for common concepts, and possibly corresponding umbrella concepts to which words in a sentence refer back. 25 This is repeated until no more common concepts are found.

Om te kunnen bepalen welke begrippen in de lijst van begrippen getoond gaan worden aan een gebruiker, worden alle begrippen in de lijst van begrippen voorzien 30 van gewichten die het onderlinge belang aangeven. Bij voorkeur omvatten de gewichten grootheden betreffende de frequentie waarmee de begrippen in de tekstuele informatie voorkomen, de specificiteit van de begrippen en een maat voor de zekerheid van voorkomen van het begrip (i.e. de 35 sensitiviteit) in de tekstuele informatie. De gewichten geven statistische eigenschappen weer van elk begrip. Hierin worden onder meer meegenomen de specificiteit, de 101 51 51 5 sensitiviteit, het aantal alternatieven dat in de tekstuele informatie voorkomt en de tekstuele simulariteit. Aan de hand van de gewichten kan bijvoorbeeld worden bepaald welke begrippen uit de lijst van begrippen getoond wordt 5 aan de gebruiker. Om de nauwkeurigheid nog verder te vergroten verdient het de voorkeur wanneer de gebruiker de gewichten interactief kan aanpassen, specifiek de relatieve gewichten van de hem getoonde begrippen ten opzichte van elkaar. Om de tekstuele informatie eenvoudig terug-10 zoekbaar te maken is in de lijst van begrippen per begrip informatie opgenomen is met betrekking tot de vindplaats van de tekstuele informatie. Hierdoor kan de lijst van begrippen met de vindplaats opgenomen worden in een gegevensbestand, en kan de tekstuele informatie eenvoudig 15 teruggevonden worden. Door het opnemen van verwijzingen hoeft bovendien niet de volledige tekstuele informatie opgenomen te worden. Bij voorkeur is de informatie met betrekking tot de vindplaats een hyperlink. Hierdoor ontstaat een gedistribueerd gegevensbestand, waarbij de 20 tekstuele informatie zelfs over zeer vele verschillende computers verdeeld kan zijn.In order to determine which terms in the list of terms are to be shown to a user, all terms in the list of terms are provided with weights which indicate the mutual importance. Preferably, the weights include quantities relating to the frequency with which the terms occur in the textual information, the specificity of the terms and a measure of the certainty of the occurrence of the term (i.e. the sensitivity) in the textual information. The weights represent statistical properties of each concept. This includes the specificity, the 101 51 51 5 sensitivity, the number of alternatives that appear in the textual information and the textual simularity. On the basis of the weights, it can be determined, for example, which terms are shown to the user from the list of terms. In order to increase the accuracy even further, it is preferable if the user can interactively adjust the weights, specifically the relative weights of the terms shown to each other. In order to make the textual information easily searchable, the list of terms per concept includes information with regard to the location of the textual information. As a result, the list of terms with the location can be included in a database, and the textual information can easily be retrieved. Moreover, by including references, it is not necessary to include the complete textual information. Preferably, the information related to the location is a hyperlink. This creates a distributed data file, whereby the textual information can even be distributed over many different computers.

De uitvinding heeft verder betrekking op een werkwijze voor het opbouwen en onderhouden van kennis-en/of interessenetwerken, waarbij tekstuele informatie 25 gecatalogiseerd wordt volgens de werkwijze voor het catalogiseren van tekstuele informatie zoals boven beschreven, en waarbij de lijst van begrippen gekoppeld wordt met informatie ter identificatie van de gebruiker, bij voorkeur een hyperlink of e-mail adres.The invention further relates to a method for building and maintaining knowledge and / or interest networks, in which textual information is cataloged according to the method for cataloging textual information as described above, and wherein the list of terms is linked to information identifying the user, preferably a hyperlink or e-mail address.

30 Daarnaast heeft de uitvinding betrekking op een werkwijze voor het zoeken in tekstuele databestanden, waarbij tekstuele invoer gecatalogiseerd wordt volgens de werkwijze voor het catalogiseren van tekstuele informatie zoals beschreven in het voorgaande, waarna gezocht wordt 35 naar een positie in het tekstuele databestand die statistisch de grootste overeenstemming vertoont met de lijst van begrippen.In addition, the invention relates to a method for searching textual data files, wherein textual input is cataloged according to the method for cataloging textual information as described above, after which a position is searched in the textual data file that is statistically is most consistent with the list of terms.

101 51 51 6101 51 51 6

De werkwijze volgens de uitvinding blijkt zeer geschikt voor toepassing in interactieve omgevingen en interactieve toepassingen, en in het bijzonder voor interactieve inter- of intranettoepassingen. Voor interactieve 5 toepassingen, en. nog specifieker voor bijvoorbeeld internet- en intranettoepassingen, is de snelheid en de hoeveelheid gegevens die verschillende computers, bijvoorbeeld een server en een gebruiker, met elkaar moeten uitwisselen om tot een gewenste resultaat te komen van 10 belang.The method according to the invention proves to be very suitable for use in interactive environments and interactive applications, and in particular for interactive inter or intranet applications. For interactive 5 applications, and. even more specific for internet and intranet applications, for example, the speed and amount of data that different computers, such as a server and a user, must exchange with each other to achieve a desired result is important.

Het is voor een gebruiker mogelijk om na analyse van een eerste deel tekstuele informatie een tweede stuk tekstuele informatie te bewerken. De twee lijsten die dan ontstaan worden vervolgens gecombineerd tot één lijst door 15 de begrippen in de lijsten te combineren op basis van de gewichten.It is possible for a user to edit a second piece of textual information after analyzing a first part of textual information. The two lists that then arise are then combined into one list by combining the terms in the lists based on the weights.

Een mogelijke toepassing waarvoor de werkwijze volgens de uitvinding in het bijzonder geschikt is, is het interactieve opbouwen en onderhouden van kennis- en/of 20 interessenetwerken, in het bijzonder via het int ra- of internet. Daarbij is programmatuur voor het uitvoeren van de werkwijze volgens de uitvinding aanwezig op een server. Een gebruiker kan de programmatuur op de server benaderen. Na invoeren van persoonsgegevens wordt de gebruiker in 25 staat gesteld door hem geselecteerde tekstuele bestanden over te zetten naar de server. Dit kunnen bestanden van zijn eigen hand zijn, zoals bijvoorbeeld een curriculum vitae, maar beter nog een langere tekst, zoals rapporten, een scriptie, een dissertatie, artikelen of dergelijke. In 30 dat geval ontstaat een kennisprofiel. De tekstuele bestanden kunnen ook artikelen zijn die het interessegebied van de gebruiker weergeven. In dat geval betreft het een interesseprofiel. Door nu de interesseprofielen of kennisprofielen van zeer vele personen op te slaan en afzoek-35 baar te maken, bijvoorbeeld door opslag in een database, ontstaat een kennis- dan wel interessenetwerk.A possible application for which the method according to the invention is particularly suitable is the interactive building and maintenance of knowledge and / or interest networks, in particular via the internet or the internet. In addition, software for carrying out the method according to the invention is present on a server. A user can access the software on the server. After entering personal data, the user is enabled to transfer selected textual files to the server. These can be files of his own hand, such as a curriculum vitae, but better still a longer text, such as reports, a thesis, a dissertation, articles or the like. In that case a knowledge profile is created. The textual files can also be articles that represent the user's area of interest. In that case it is an interest profile. By now storing the interest profiles or knowledge profiles of many people and making them searchable, for example by storage in a database, a knowledge or interest network is created.

Op de server indexeert de programmatuur de 101 51 51 7 tekstuele informatie volgens de werkwijze volgens de uitvinding en presenteert een lijst met begrippen aan de gebruiker. De gebruiker wordt vervolgens in staat gesteld om de lijst aan te passen, bijvoorbeeld door de toegekende 5 gewichten per begrip te veranderen. Dit aanpassen kan op diverse interactieve wijzen. Zo kan bijvoorbeeld gebruik gemaakt worden van spinneweb-diagrammen. Daarbij zijn de diverse begrippen radieël om een gemeenschappelijk middelpunt gerangschikt. Door nu een begrip langs de radieële as 10 te verslepen, bijvoorbeeld met behulp van invoermiddelen zoals een muis, toetsen, een trackball, een touchpad of dergelijke, kan het relatieve gewicht van een geselecteerd begrip veranderd worden. Een andere mogelijkheid is on de begrippen op een staafdiagram uit te zetten, en middels de 15 reeds genoemde invoermiddelen de gebruiker in staat te stellen de lengte van de verschillende staven in te stellen. Daarna kan de gebruiker de lijst met begrippen en verbindingen met zijn tekstuele informatie lokaal op zijn t eigen computer dan wel op de server opslaan. De lijst kan 20 desgewenst toegevoerd en toegevoegd worden aan een groter databestand met gegevens van andere gebruikers.On the server, the software indexes the 101 51 51 7 textual information according to the method according to the invention and presents a list of terms to the user. The user is then enabled to edit the list, for example by changing the assigned 5 weights per concept. This can be done in various interactive ways. For example, spider web diagrams can be used. The various concepts are arranged radially around a common center. By now dragging a term along the radial axis 10, for example using input means such as a mouse, keys, a trackball, a touch pad or the like, the relative weight of a selected term can be changed. Another possibility is to plot the terms on a bar chart, and to enable the user to set the length of the different bars by means of the input means already mentioned. The user can then save the list of terms and connections with his textual information locally on his own computer or on the server. The list can optionally be added and added to a larger database with data from other users.

Een gebruiker kan ook met behulp van de lijst van begrippen en hun gewichten zoeken in een databestand. Dit zoeken kan interactief gebeuren. Daarbij ziet de 25 gebruiker bijvoorbeeld het aantal treffers. Door nu interactief het gewicht van de verschillende begrippen te veranderen, bijvoorbeeld op een wijze zoals hierboven aangegeven, ziet de gebruiker onmiddelijk het aantal treffers veranderen.A user can also search the database using the list of terms and their weights. This search can be done interactively. The user sees, for example, the number of hits. By interactively changing the weight of the different terms, for example in a manner as indicated above, the user immediately sees the number of hits change.

30 Een specifieke uitvoeringsvorm van de uitvinding zal nader toegelicht worden aan de hand van de figuren. De figuren dient ter illustratie van één of meer uitvoeringsvormen van de uitvinding, en dienen niet opgevat te worden als beperking daarvan of daartoe.A specific embodiment of the invention will be further elucidated with reference to the figures. The figures serve to illustrate one or more embodiments of the invention, and are not to be construed as or limitation thereto.

35 Figuur 1 toont een schematisch overzicht van een specifieke uitvoeringsvorm van de werkwijze volgens de uitvinding.Figure 1 shows a schematic overview of a specific embodiment of the method according to the invention.

101 51 51 8101 51 51 8

Figuur 2 toont de relatie tussen identificatiegegevens van gebruikers.Figure 2 shows the relationship between user identification data.

Figuur 3 toont een mogelijk wijze voor het opbouwen van kennis- en/of interessenetwerk.Figure 3 shows a possible way of building up a knowledge and / or interest network.

5 In figuur 1 is een voorbeeld te zien van imple mentatie van de werkwijze volgens de uitvinding. Hierbij wordt een tekstfragment, of een heel artikel, of een reeks artikelen, ingevoerd (1) . Dit invoeren kan bijvoorbeeld • door het markeren van tekstonderdelen of het selecteren 10 van tekstbestanden op de computer van de gebruiker door de gebruiker middels invoermiddelen zoals de muis, en de slepen van de geselecteerde tekstbestanden of gemarkeerde tekstonderdelen naar een invoerscherm dat door de programmatuur op de server weergegeven is op het computerbeeld-15 scherm van de gebruiker.Figure 1 shows an example of implementation of the method according to the invention. This enters a text fragment, or an entire article, or a series of articles (1). This can be done, for example • by highlighting text parts or selecting text files on the user's computer by the user by means of input means such as the mouse, and dragging the selected text files or marked text parts to an input screen that is displayed by the software on the user's computer. server is displayed on the user's computer screen.

Het tekstfragment of tekstbestand, de tekstuele informatie, wordt vervolgens eerst genormaliseerd (2). Hierbij worden stopwoorden verwijderd (3), en worden woorden teruggebracht tot hun stam (in het engels "stem-20 ming"). Hierdoor ontstaan een lijst (4) van genormaliseerde woorden. Deze lijst wordt vervolgens vergeleken (5) met begrippen in een thesaurus of een metathesaurus (6) . Door gebruik te majcen van een van de thesaurus of metathesaurus afgeleide, alfabetische lijst woorden uit de the-25 saurus of metathesaurus die teruggebracht zijn tot hun stam, een stamwoordenlijst bleek het mogelijk het vergelijken zeer snel te maken. Bij voorkeur zijn de stamwoorden gerangschikt in een n-aire boom om snel opzoeken van de stamwoorden in de stamwoordenlij st met elk woord in 30 de lijst van genormaliseerde woorden mogelijk te maken. Bij elk woord in de lijst van genormaliseerde woorden worden alle mogelijke thesaurus-begrippen gezocht. Er wordt op die manier een lijst (7) geproduceerd met voor elk woord van de tekstuele informatie 'alle mogelijke 35 thesaurusbegrippen. Woorden die niet in de thesaurus voorkomen worden weggelaten. Vervolgens vindt analyse plaats van de resultaten (8) . Er wordt nagegaan, voor elk 101 51 51.The text fragment or text file, the textual information, is then first normalized (2). Here stop words are removed (3), and words are reduced to their root ("stem-20 ming" in English). This creates a list (4) of normalized words. This list is then compared (5) with terms in a thesaurus or a metathesaurus (6). By using one of the thesaurus or metathesaurus derived alphabetical list words from the the-25 saurus or metathesaurus that have been reduced to their root, a tribal wordlist made it possible to make the comparison very quickly. Preferably, the root words are arranged in an n-ary tree to enable quick search of the root words in the root word list with each word in the list of normalized words. Every word in the list of normalized words searches for all possible thesaurus concepts. In this way a list (7) is produced with for each word of the textual information all possible thesaurus concepts. Words that do not appear in the thesaurus are omitted. The results are then analyzed (8). It is checked, for each 101 51 51.

9 woord per zin, of er een verband is met een ander woord in de zin, i.e. of twee of meer woorden samen deel uitmaken van een begrip dat in de thesaurus of metathesaurus voorkomt. Hierbij worden ook bovenliggende thesaurusbegrippen 5 gezocht en, wanneer meerdere onderliggende begrippen naar een zelfde bovenliggend begrip verwijst, vervangen door het bovenliggende begrip. Dit proces wordt clustering genoemd. De clustering is het zoeken naar gemeenschappelijke begrippen waar door aanpalende woorden in de tekstu-10 ele informatie naar terugverwezen wordt. De clustering wordt ook weer toegepast op gevonden clusters, totdat geen gemeenschappelijke begrippen gevonden worden. Desgewenst kan eerst binnen zinnen geclusterd worden. Dat wil zeggen gekeken worden of er gemeenschappelijke begrippen voorko-15 men. Dit kan herhaald worden totdat er geen verandering meer optreed. Daarna kan eventueel geclusterd worden op basis van aangrenzende zinnen. Daarna kan geanalyseerd worden of er begrippen voorkomen die een specius zijn van hetzelfde bovenliggende begrip, i.e. een gemeenschappelij-20 ke genus hebben. In dat geval kan de genus ingezet worden. Eventueel kan dat ook slechts gedaan worden voor de begrippen die aan de gebruiker gepresenteerd worden.9 word per sentence, whether there is a connection with another word in the sentence, i.e. whether two or more words together form part of a term that occurs in the thesaurus or metathesaurus. Parent thesaurus concepts 5 are also searched for and, if several child concepts refer to the same parent concept, replaced by the parent concept. This process is called clustering. The clustering is the search for common concepts that are referred back to by adjoining words in the textual information. The clustering is also applied to found clusters, until no common concepts are found. If desired, clustered within sentences. That is to say whether common concepts are used. This can be repeated until no more change occurs. After that it is possible to cluster based on adjacent sentences. It can then be analyzed whether concepts exist that are a specius of the same parent concept, i.e. have a common genus. In that case the genus can be used. If necessary, this can only be done for the concepts presented to the user.

Elk gevonden begrip wordt voorzien van een gewicht. Dit gewicht is onder meer samengesteld uit een 25 waarde die aangeeft waar een begrip zich bevindt op de schaal van specifiek naar algemeen. Deze waarden zijn vooraf aan de begrippen in de thesaurus gegeven. Daarnaast is het gewicht samengesteld uit de frequentie waarmee het begrip voorkomt in de tekstuele informatie. Het gewicht is 30 verder samengesteld uit een waarschijnlijkheidsgetal dat aangeeft hoe zeker de programmatuur is dat het begrip overeenkomt met de woorden in de tekstuele informatie. Aan de hand van de gewichten wordt bepaald welke begrippen in de lijst van begrippen gepresenteerd worden aan de gebrui-35 ker. Het selectiecriterium daarvoor is instelbaar.Every concept found is given a weight. This weight is composed, among other things, of a value that indicates where a concept is located on the scale from specific to general. These values have been given to the concepts in the thesaurus beforehand. In addition, the weight is composed of the frequency with which the term occurs in the textual information. The weight is further composed of a probability number which indicates how certain the software is that the concept corresponds to the words in the textual information. On the basis of the weights it is determined which terms are presented to the user in the list of terms. The selection criterion for this is adjustable.

Vervolgens wordt een lijst van voorgestelde begrippen (9) gepresenteerd aan de gebruiker. De gebruiker 101 51 51 10 is daarop in staat om de lijst interactief aan te passen (10). De gebruiker kan vervolgens de aangepaste lijst (11) terugsturen. In de aangepaste lijst wordt een verbinding (12) opgenomen maar de oorspronkelijke tekstuele informa-5 tie, in de vorm van een hyperlink naar de tekst, eventueel op een andere computer, een adres of e-mail adres van de gebruiker, of op een andere manier. Bij voorkeur echter een verbinding naar de tekstuele informatie op een andere computer. Daardoor hoeven niet veel gegevens op de server 10 opgeslagen te worden. Hierdoor volstaat een relatief lichte server, waardoor het computersysteem volgens de uitvinding licht uitgevoerd kan worden.A list of suggested terms (9) is then presented to the user. The user 101 51 51 10 is then able to edit the list interactively (10). The user can then return the modified list (11). In the adapted list a connection (12) is included but the original textual information, in the form of a hyperlink to the text, possibly on another computer, an address or e-mail address of the user, or on a other way. Preferably, however, a connection to the textual information on another computer. As a result, not much data has to be stored on the server 10. As a result, a relatively light server is sufficient, whereby the computer system according to the invention can be lightly constructed.

De werkwijze volgens de uitvinding kan op zeer voordelige wijze ingezet worden voor het ontwikkelen, 15 onderhouden en opbouwen van kennis- en interessenetwerken van personen binnen organisaties, van organisaties onderling en/of van personen onderling. Figuren 2 en 3 hebben daarop betrekking. Om een dergelijk netwerk op te bouwen en onderhouden moeten kennis- en interesseprofielen van 20 personen en organisaties gegenereerd worden en met elkaar verbonden worden. De werkwijze volgens de uitvinding kan dit ondersteunen en implementeren.The method according to the invention can be used in a very advantageous manner for the development, maintenance and building up of knowledge and interest networks of persons within organizations, of organizations themselves and / or of persons themselves. Figures 2 and 3 relate to this. To build and maintain such a network, knowledge and interest profiles of 20 people and organizations must be generated and connected to each other. The method according to the invention can support and implement this.

In figuur 2 is schematisch weergegeven welke informatieonderdelen daartoe ingevoerd moet worden, en wat 25 de onderlinge verbanden tussen de verschillende informatieonderdelen zijn. Zo is het mogelijk gegevens betreffende personen (20) zoals de naam, bij welke organisatie ze werkzaam zijn, een e-mail adres en andere gegevens op te nemen. Daarnaast kunnen gegevens betreffende de organisa-30 tie (21) opgenomen worden, zoals contactgegevens, maar ook een interesse- of kennisprofiel (22) . Dit profiel kan gegenereerd zijn met behulp van de werkwijze volgens de uitvinding. Daarnaast of in plaats daarvan kan een kennisof interesseprofiel van de persoon (23) opgenomen zijn, 35 met verbindingen naar tekstuele informatie (24) . Dit interesse-of kennisprofiel kan gegenereerd zijn met behulp van de werkwijze volgens de uitvinding.Figure 2 schematically shows which information items must be entered for this purpose, and what the interrelationships between the different information items are. For example, it is possible to include information about persons (20) such as the name, the organization with which they work, an e-mail address and other information. In addition, information about the organization (21) can be included, such as contact details, but also an interest or knowledge profile (22). This profile can be generated using the method according to the invention. In addition or instead, a knowledge or interest profile of the person (23) may be included, 35 with links to textual information (24). This interest or knowledge profile can be generated using the method according to the invention.

101 51 51 11101 51 51 11

Figuur 3 toont een mogelijkheid waarop een werkwij ze voor het opbouwen en onderhouden van een kennisnetwerk volgens de uitvinding geïmplementeerd kan worden. Een gebruiker (31) voert ten eerste tekstuele 5 informatie (bronnen) in die volgens de gebruiker betrekking heeft op diens expertise, zoals door hem geschreven artikelen en rapporten, of diens interesse, zoals artikelen die direct betrekking hebben op het interessegebied. Deze tekstuele informatie wordt verwerkt volgens de werk-10 wijze volgens de uitvinding, bij voorbeeld volgens het schema van figuur 1. Uit deze bewerking volgt een kennisof interesseprofiel, gekoppeld aan de gebruikersgegevens en aan de bronnen. De gebruiker past het profiel interactief (32) aan. Vervolgens wordt het profiel in een wacht-15 rij (33) gezet. Een autorisatie eenheid (34), zijnde een geautomatiseerd systeem dan wel een persoon, controleert de gegevens en het profiel op volledigheid en voert een validatie uit, alvorens de gegevens en het profiel in te voeren in een databestand, i.e. een database (35) . Bij 20 invoer in de database ontvangt de gebruiker een .bevestigingsbericht (36). De database kan door gebruikers geraadpleegd worden.Figure 3 shows a possibility on which a method for building and maintaining a knowledge network according to the invention can be implemented. A user (31) firstly enters textual information (sources) which, according to the user, relate to his expertise, such as articles and reports written by him, or his interest, such as articles directly related to the area of interest. This textual information is processed according to the method according to the invention, for instance according to the scheme of figure 1. From this processing follows a knowledge or interest profile, linked to the user data and to the sources. The user adjusts the profile interactively (32). The profile is then placed in a queue (33). An authorization unit (34), either an automated system or a person, checks the data and profile for completeness and performs a validation before entering the data and profile in a database, i.e. a database (35). When entering 20 into the database, the user receives a .confirmation message (36). The database can be consulted by users.

101 51 51101 51 51

Claims

1. Method for cataloging textual information or generating a knowledge profile therefrom, wherein: a user enters textual information into a computer, further provided with software; The software includes a routine that divides the textual information into words, and a routine that searches the words of the textual information into at least one structured data file contained in memory means in the computer, which structured data file comprises words having word-by-word references to concepts; the software is provided with a routine which searches per text in the textual information for all corresponding words in the structured data file and subsequently links the related concepts from the structured data file per word; 20. and the software is provided with a routine which then clusters concepts into a list of keywords or umbrella keywords by means of clustering; and the software is provided with a routine 25 which subsequently presents the list of keywords to the user in interactive as a knowledge profile or category.

2. The method of claim 1, wherein the terms in the structured data file refer to keywords, the clustering routine then clustering the keywords. 1 015151

The method of claim 1, wherein the terms in the structured data file refer to keyword representations, the routine clustering then clustering the representations.

The method of claim 1, 2 or 3, wherein the structured data file is a thesaurus or metathesaurus.

5. A method according to any one of the preceding claims, wherein clustering takes place on the basis of relationships within sentences.

6. A method according to any one of the preceding claims, wherein the terms in the list of terms are provided with weights which indicate the mutual interest.

Method according to claim 6, wherein the weights comprise quantities concerning the frequency with which the terms occur in the textual information, the specificity of the terms and a measure of the certainty of occurrence of the term in the textual information.

The method of claim 6 or 7, wherein the user can adjust the weights interactively.

9. Method as claimed in any of the foregoing claims, wherein in the list of terms per concept information is included with regard to the location of the textual information.

The method of claim 9, wherein the location information is a hyperlink.

A method for building and maintaining knowledge and / or interest networks, in which textual information is cataloged according to one or more of the preceding claims, and in which the list of terms is linked with information identifying the user, preferably a hyperlink or e-mail address.

12. Method for building up and maintaining knowledge of knowledge and / or interest networks 1 01 51 51, in which a knowledge profile is determined from textual information according to any one of the preceding claims, and in which the list of keywords is coupled with information for identifying 5 cation of the user or the location of the textual information, preferably by means of a hyperlink or e-mail address.

13. Method according to claim 12, wherein a collection of knowledge profiles is stored in memory means of a computer which is remote from the user.

14. Method for searching in textual data files, wherein textual input is cataloged according to the method according to any one of the preceding claims, after which a position is searched in the textual data file which is statistically most similar to the list of terms.

15. Computer system comprising input means, output means and connection means for connection to other computer systems, wherein the computer system is provided with software provided with routines for performing the steps for the method according to any one of the preceding claims.

16. Software for controlling a computer, the software comprising routines for carrying out the method steps according to any one of the preceding claims.

17. Carrier provided with software for carrying out the method according to one or more of the preceding claims.

18. Device comprising one or more of the characterizing measures described in the description and / or shown in the drawings.

A method comprising one or more of the characterizing measures described in the description and / or shown in the drawings. 1015151