FI120807B

FI120807B - Filtering of data units

Info

Publication number: FI120807B
Application number: FI20065591A
Authority: FI
Inventors: Juha Makkonen
Original assignee: Whitevector Oy
Priority date: 2006-09-26
Filing date: 2006-09-26
Publication date: 2010-03-15
Also published as: EP2080126A1; WO2008037848A1; FI20065591A; EP2080126A4; FI20065591A0

Description

Tietokohteiden suodatusData object filtering

KEKSINNÖN ALAFIELD OF THE INVENTION

Esillä oleva keksintö liittyy tiedonkäsittelyyn, ja erityisesti menetelmään tietokohteiden suodattamiseksi ja vastaavaan järjestelmään.The present invention relates to data processing, and more particularly to a method for filtering data objects and a corresponding system.

5 KEKSINNÖN TAUSTABACKGROUND OF THE INVENTION

Tieto viittaa tässä informaatioon, joka on käsiteltävissä informaatiojärjestelmällä. Vastaavasti tietokohde viittaa joukkoon tietoa, jonka informaatiojärjestelmä pystyy tunnistamaan keskenään yhteenliittyväksi ja siten prosessoimaan erikseen käsiteltävänä yksikkönä. Tietokohteeseen kohdistettaviin 10 toimenpiteisiin kuuluu suodatus siten, että suodatuksen aikana tietokohteen määrätty ominaisuus analysoidaan ja analyysin perusteella tietokohteeseen kohdistetaan ennalta määrättyjä toimintosarjoja. Suodatus on tyypillisesti vas-teellinen yhden tai useamman suodatuskriteerin ja analysoitavana olevan tietokohteen väliseen samanlaisuuteen. Kun analyysi ilmaisee yhteensopivuuden 15 määrätyn suodatuskriteerin kanssa, tietokohde tulkitaan relevantiksi tämän suodattimen suhteen.The information herein refers to information that can be processed by the information system. Similarly, a data object refers to a set of information that the information system can identify as interconnected and thus processed as a separate entity. The actions to be applied to the data object include filtering such that during filtering, a specific property of the data object is analyzed and, based on the analysis, predetermined procedures are applied to the data object. Filtering is typically consistent with the similarity between one or more filtering criteria and the data item being analyzed. When the analysis indicates compliance with the 15 specified filtering criteria, the data item is interpreted as relevant to this filter.

Tavanomaisissa järjestelmissä tietokohteet ovat usein tekstitiedostoja, jotka sisältävät sanoja, ja suodatuskriteerit muodostuvat informaatiojärjestelmien käyttäjien antamista hakutermeistä. Analysoitavan tekstitiedoston rele-’*’**·* 20 vanssi määräytyy hakusanojen ja tekstitiedostojen sanojen välisten osumien !,*:*: mukaisesti. Käytettävissä olevien ja käytettäville tulevien digitaalisten infor- : maatiolähteiden runsaudessa tällaiset tavanomaiset järjestelmät tulevat tehot- • :*· tomammaksi ja tehottomammaksi, erityisesti sen suhteen miten ne soveltuvat • I· · . .·. uusien aiheiden jatkuvaan online-seurantaan.In conventional systems, data objects are often text files containing words, and filtering criteria consist of search terms provided by users of information systems. The relay - '*' ** · * 20 of the text file to be analyzed is determined by the match between the search words and the words in the text files!, *: *:. With the plethora of available and accessible digital information sources, such conventional systems will become •: * · inefficient and inefficient, especially as they apply • I · ·. . ·. continuous online monitoring of new topics.

.···, 25 Vuonna 2005 julkisesti käytettävissä olevien verkkosivujen määrän • · arvioitiin lähestyvän 12 miljardia, mikä on nousu 800 miljoonan sivun arvioon ... vuodelle 1999. Pääsyrajoitettujen ei-julkisten sivujen, kuten intranetin, vieras- • · verkon (extranet) ja tilatun materiaalin, määrän arvioitiin olevan 400-550-ker- • · *···* täinen julkiseen verkkoon verrattuna. Vauhdin, jolla uusia sivuja tuodaan verk- :\j 30 koon arvioidaan viikkotasolla olevan 8%, mikä käytännössä tarkoittaa, että ·:··· verkon koko kaksinkertaistuu alle kolmessa kuukaudessa. Samaan aikaan si- ·. vuja katoaa verkosta, ja vuoden jälkeen niistä on jäljellä vain 20%.· · · · · · · · · · · · · · · · · · · · · · · · · · · · · and the amount of material ordered was estimated to be 400-550 times the public network. The speed at which new pages are brought to the web is estimated at 8% per week, which in practice means that: ·: ··· doubles the size of the network in less than three months. At the same time, the · ·. vuja disappears from the network, and after a year only 20% of them remain.

« · ·«· ·

Julkiseen materiaaliin päästään käsiksi portaalien ja hakukoneiden * * avulla, ja siten indeksointimahdollisuuksilla on ollut merkittävä vaikutus infor- 35 maation saatavuuteen. Indeksointiin liittyy kuitenkin viive, joka vaihtelee ope- 2 raattorikohtaisesti, päivitysnopeus on tyypillisesti vuorokauden ja joidenkin kymmenien vuorokausien välillä. Parhaimmillaan tällaiset viiveiset lähestymistavat toimivat arkistoituihin informaatiovirtoihin, mutta ne eivät tosiasiallisesti täytä dynaamisemman seurannan tarvetta, erimerkiksi päivittäistä seurantaa. 5 Kuitenkin käytännöllisesti katsoen kaikki uutistoimistot ja lehdet julkaisevat uutiset suoraan (online) ja uusia juttuja julkaistaan läpi vuorokauden. Osa uutismateriaalista myydään erityisesti verkkosyötteenä, kun taas suurin osa päivittäisestä materiaalista julkaistaan verkkosivulla - digitaalinen analogia perinteisestä sanomalehdestä. Jotkut verkkoportaalit sisältävät uutisia uutistoimistois-10 ta ja uutissivustoista muodostaen siten kokoelman päivittäisiä reportaaseja. Keskustelufoorumeita päivitetään, kun osallistujat lisäävät kommentteja, mikä voi tapahtua useita kertoja vuorokaudessa.Public material is accessed through portals and search engines * *, and thus indexing capabilities have had a significant impact on the availability of information. However, there is a delay associated with indexing, which varies from operator to operator, the refresh rate is typically between 24 hours and a few tens of days. At best, such delayed approaches work for archived information flows, but do not actually meet the need for more dynamic monitoring, such as daily monitoring. 5 However, virtually all news agencies and magazines publish news online (online) and new stories are published around the clock. Some of the news material is sold specifically as a web feed, while most of the daily material is published on a website - a digital analogue of a traditional newspaper. Some web portals contain news from news agencies and news sites, creating a collection of daily reportages. Discussion forums are updated when participants post comments, which can happen several times a day.

Näin ollen on selvää, että tietoa tuotetaan nopeammin kuin sitä indeksoidaan. Tämä tarkoittaa, että relevantti informaatio jää havaitsematta jok-15 sikin aikaa ennen kuin se asianmukaisesti indeksoidaan, tai indeksoinnin jälkeen relevantti informaation jopa hautautuu suureen joukkoon vähemmän merkityksellisiä ja täysin merkityksettömiä asiakirjoja.Thus, it is clear that information is produced faster than it is indexed. This means that the relevant information goes undetected for some time before it is properly indexed, or even after the indexing, the relevant information is even buried in a large number of less relevant and completely irrelevant documents.

Julkaisu Makkonen J. Et al.:”Simple semantics in topic detection and tracking information retrieval” vuodelta 2004 kuvaa alustavan menettelyn, 20 jonka avulla saadaan selville liittyykö yksittäinen dokumentti aiempaan aiheeseen.Makkonen, J. et al.: Simplified Semantics in Subject Detection and Tracking Information Retrieval, 2004, describes a preliminary procedure 20 to determine whether a single document is related to a previous topic.

• ·• ·

KEKSINNÖN YHTEENVETOSUMMARY OF THE INVENTION

• · Esillä olevan keksinnön tavoitteena on siten tarjota ratkaisu edellis- : ten haittapuolien helpottamiseksi. Keksinnön tavoitteet saavutetaan menetel- : :*· 25 mätiä, järjestelmällä, tietokoneen ohjelmatuotteella ja tietokoneohjelman jake- ··· lutietovälineellä, joille on ominaista se mitä itsenäisissä vaatimuksissa mainitaan. Keksinnön edulliset suoritusmuodot kuvataan epäitsenäisissä vaatimuk- ... sissa.It is therefore an object of the present invention to provide a solution to alleviate the above disadvantages. The objects of the invention are achieved by: * 25 roes, with a system, a computer program product and a computer program distribution medium characterized by what is stated in the independent claims. Preferred embodiments of the invention are described in the dependent claims.

• · ·• · ·

Keksintö perustuu ajatukseen, jonka mukaan kohdetietokohteen se- • '·;·* 30 manttinen relevanssi määritetään vertaamalla sitä joukkoon ennalta määrättyjä vertailutermejä. Termit liitetään semanttisiin luokkiin ja vertailu suoritetaan kai-·:··: kissa sovellettavissa olevissa semanttisissa luokissa käyttäen semanttiselle luokalle ominaista samanlaisuusfunktiota. Vertailun tulos tuottaa tulosarvon, • « · jota käytetään valitsemaan funktio ennalta määritetystä joukosta tiettyihin tu-35 losarvoihin liittyviä tulosarvoja.The invention is based on the idea that the object relativity of a subject information object is determined by comparing it to a set of predetermined reference terms. The terms are mapped to semantic classes and the comparison is made in the semantic classes applicable to all ·: · ·: using the semantic class specificity function. The result of the comparison produces a conversion value, which is used to select a function from a predetermined set of conversion values associated with particular value-values.

33

Keksinnön etuna on, että se mahdollistaa tietokohteiden nopean semanttisen suodatuksen ja parantaa suodatuspäätösten tarkkuutta verrattuna tekniikan tason mukaisiin ratkaisuihin.An advantage of the invention is that it enables rapid semantic filtering of data objects and improves the accuracy of filtering decisions as compared to prior art solutions.

KUVIOIDEN LYHYT KUVAUSBRIEF DESCRIPTION OF THE FIGURES

5 Keksintöä kuvataan seuraavassa yksityiskohtaisemmin keksinnön edullisilla suoritusmuodoilla viitaten seuraaviin piirroksiin, joissaThe invention will now be described in more detail with preferred embodiments of the invention with reference to the following drawings, in which:

Kuvio 1 esittää keksinnönmukaisen järjestelmän suoritusmuotoa; Kuvio 2 esittää termiavaruutta, joka käsittää semanttisia luokkia; Kuvio 3A esittää termien asemoinnin tasohierarkkisessa rakentees- 10 sa;Figure 1 shows an embodiment of a system according to the invention; Figure 2 illustrates a term space comprising semantic classes; Figure 3A shows the positioning of terms in a planar hierarchical structure;

Kuvio 3B näyttää vastaavat yksinkertaistetun hierarkkisen rakenteen kuvion 3A paikkatermeille;Figure 3B shows the corresponding simplified hierarchical structure for the position terms in Figure 3A;

Kuvio 4 esittää samanlaisuuspäätöksen perustumisen etäisyyden hypertasosta etumerkkiin; 15 Kuvio 5 esittää suoritusmuotona esitetyn järjestelmän toiminnon toimintosarjamallin;Figure 4 shows the distance from the hyper level to the sign based on the similarity decision; Figure 5 illustrates a procedure pattern for a system function shown as an embodiment;

Kuvio 6 esittää suoritusmuotona esitetyn keksinnönmukaisen menetelmän vaiheita;Fig. 6 shows the steps of the method according to the invention as an embodiment;

Kuvio 7 esittää suodatuksen dynaamista säätöä.Figure 7 shows the dynamic control of the filtration.

20 KEKSINNÖN YKSITYISKOHTAINEN KUVAUSDETAILED DESCRIPTION OF THE INVENTION

*·« v : Seuraavat suoritusmuodot ovat esimerkinomaisia toteutuksia esillä • ♦ : olevasta keksinnöstä. Vaikka selitysosa viittaisi ’’yhteen” suoritusmuotoon tai ’’eräisiin” suoritusmuotoihin, viittaus ei välttämättä kohdistu samaan suoritusta*: muotoon tai samoihin suoritusmuotoihin ja/tai tunnusmerkki ei liity pelkästään 25 yhteen suoritusmuotoon. Tämän selitysosan erilaisten suoritusmuotojen yksit- ·*♦ täisiä tunnusmerkkejä voidaan yhdistää lisäsuoritusmuotojen aikaansaamisek-.*!·. si.* · «V: The following embodiments are exemplary embodiments of the present invention. Even if the description refers to "one" embodiment or "some" embodiments, the reference does not necessarily refer to the same embodiment *: form or same embodiments and / or the characteristic is not solely related to one embodiment. Individual features of the various embodiments of this specification may be combined to provide further embodiments. *! ·. si.

t · ·t · ·

Kuvio 1 esittää erään keksinnönmukaisen järjestelmän suoritusmuo- T* don. Kuvion 1 rakenteellinen lohkokaavio edustaa jäijestelmää 10, joka sovel- • · \*·: 30 tuu tietokohteiden vastaanottamiseen, vastaanotettujen tietokohteiden ana- *:··: lysointiin ja toiminnon määrittämiseen analyysin perusteella tietokohteen jatko- . *·. käsittelemiseksi. Tietokohde viittaa tässä joukkoon tietoa, jonka informaatiojär- • · · jestelmä pystyy tunnistamaan keskenään yhteenliittyväksi ja siten prosessoimaan erikseen käsiteltävänä yksikkönä. Tietokohde voi käsittää esimerkiksi 4 yhdestä tai useammasta bitistä muodostuvan bittivirran, yhden tai useamman bittivirran joukon, tiedostossa olevan tietueen, tiedostossa olevien tietueiden joukon, kokonaisen tiedoston tai vastaavan. Tieto voi olla mitä tahansa mediatyyppiä, mukaan lukien tekstiä, ääntä, videota ja vastaavaa.Figure 1 shows an embodiment of a system according to the invention. The structural block diagram of Figure 1 represents a rigid system 10 applicable to receiving data, analyzing received data, and analyzing the function of the data object after analysis. * ·. for processing. A data object here refers to a set of data that the information system can recognize as being • interconnected and thus processed as a separate entity. For example, the data object may comprise 4 bit streams of one or more bits, a set of one or more bit streams, a record in a file, a set of records in a file, an entire file, or the like. The information can be any type of media, including text, audio, video and the like.

5 Järjestelmä 10 käsittää liitäntäyksikön 11, joka tarjoaa syöttöliitän- nän 12 tietokohteiden syöttämiseksi järjestelmässä 10 prosessointia varten, ja antoliitännän 13 tiedon tulostamiseksi järjestelmästä 10 liitettyihin järjestelmiin ja/tai prosesseihin. Liitäntäyksikön 11 toteutus vaihtelee sovelluskohtaisesti, ja voi olla yksinkertainen rajapinta esimerkiksi fyysiselle liitynnälle, tai monipuoli-10 nen toiminnallinen elementti, joka tarjoaa liitynnän useisiin verkkosolmuihin erilaisten viestintäliityntätyyppien yli. Verkkojärjestelmissä asiakas/palvelinmalli tarjoaa sopivan tavan kytkeä keskenään tietokohteiden lähteitä ja järjestelmän eri sijainteihin jakautuneita käyttäjiä. Edulliset järjestelmäarkkitehtuurit käsittävät ainakin asiakaspalvelu-, isäntä/orja- ja vertaisrakenteet.The system 10 comprises an interface unit 11 which provides an input interface 12 for inputting data items in the system 10 for processing, and an output interface 13 for outputting information from the system 10 to the connected systems and / or processes. The implementation of the interface unit 11 varies from application to application, and can be a simple interface, for example, to a physical interface, or a versatile functional element that provides access to multiple network nodes over different types of communication interfaces. In networked systems, the client / server model provides a convenient way to connect data object sources and users distributed across different system locations. Preferred system architectures include at least customer service, host / slave, and peer structures.

15 Tämä järjestelmä 10 käsittää lisäksi ohjausyksikön 14, elementin jo ka käsittää aritmeettisen logiikkayksikön, useita erikoisrekistereitä ja ohjauspiirejä. Ohjausyksikköön 14 on liitetty muistiyksikkö 15, tietoväline, johon tietokoneella luettavissa olevaa tietoa tai ohjelmia tai sovellustietoa voidaan tallentaa. Muistiyksikkö käsittää tyypillisesti tietovarastoja, joita voidaan sekä lukea ja 20 joihin voidaan kirjoittaa (RAM) ja muistin, jonka sisältöä voidaan vain lukea (ROM).This system 10 further comprises a control unit 14, an element comprising an arithmetic logic unit, a plurality of special registers and control circuits. Connected to the control unit 14 is a memory unit 15, a medium on which computer-readable data or programs or application data can be stored. The memory unit typically comprises data stores that can be both read and write (RAM) and read-only memory (ROM).

Järjestelmä 10 käsittää edullisesti myös käyttöliityntäyksikön 16, jos- • · · /*/ sa on syöttöyksikkö 17 käyttäjätiedon ottamiseksi järjestelmän sisäisiin pro- ί*· : sesseihin, ja tulostusyksikkö 18 käyttäjätiedon antamiseksi järjestelmän 10 : 25 prosesseista. Esimerkkeihin mainituista syöttöyksiköistä kuuluvat näppäimistö, kosketusnäyttö, mikrofoni, tai vastaava, tai niiden yhdistelmä. Esimerkkeihin mainituista tulostusyksiköistä kuuluvat näyttö, kosketusnäyttö, mikrofoni, tai vastaava, tai niiden yhdistelmä.Preferably, system 10 also comprises a user interface unit 16, provided with an input unit 17 for capturing user information in the system's internal processes, and output unit 18 for providing user information about the processes of the system 10:25. Examples of said input units include a keyboard, touch screen, microphone, or the like, or a combination thereof. Examples of said print units include a display, touch screen, microphone, or the like, or a combination thereof.

Liitäntäyksikkö 11, ohjausyksikkö 14, muistiyksikkö 15 ja käyttöliityn-.···. 30 täyksikkö 16 on sähköisesti kytketty toisiinsa vastaanotettuun ja/tai tallennet- tuun tietoon kohdistuvien toimenpiteiden järjestelmälliseksi suorittamiseksi en-*V*: naita määrättyjen, olennaisesti ohjelmoitujen toimintosarjojen mukaisesti. Kek- sinnön mukaisissa ratkaisuissa nämä toimenpiteet sisältävät toiminnallisuudet . järjestelmän 10 toimintojen ja liitäntöjen toteuttamiseksi. Näitä toimenpiteitä .!!!: 35 kuvataan yksityiskohtaisemmin kuvioissa 2-6.Interface unit 11, control unit 14, memory unit 15 and user interface ···. The complete unit 16 is electrically connected to each other for systematically performing operations on received and / or stored information according to specified, substantially programmed procedures. In the solutions of the invention, these measures include functionalities. system 10 to perform functions and interfaces. These procedures. !!!: 35 are described in more detail in Figures 2-6.

t · 5 Järjestelmän toiminta perustuu syötettyjen tietokohteiden semanttiseen analyysiin valitun semanttisen ympäristön pohjalta. Järjestelmän prosessoitava tietokohde käsittää yhdestä tai useammasta termistä muodostuvan joukon termejä. Termi on tietoyksikkö, jolla on merkitys, viesti jonka termin 5 esiintyminen tarkoittaa tai ilmaisee tai painottaa. Järjestelmään on järjestetty yhdestä tai useammasta vertailutermistä muodostuva joukko vertailutermejä. Järjestelmän toiminnon aikana tietokohteessa olevien termien, joita jatkossa kutsutaan kohdetermeiksi, merkitystä vertaillaan vertailutermien merkityksiin, ja seuraava reititystoiminto määritetään vertailutuloksen perusteella.t · 5 System performance is based on a semantic analysis of the input data items based on the selected semantic environment. The data object to be processed by the system comprises a set of terms consisting of one or more terms. A term is a unit of information that has a meaning, a message that the occurrence of term 5 denotes or expresses or emphasizes. A set of reference terms consisting of one or more reference terms is provided in the system. During system operation, the meaning of terms in the data object, hereinafter referred to as target terms, is compared with the meanings of the reference terms, and the next routing function is determined based on the result of the comparison.

10 Erilaisissa semanttisissa ympäristöissä termeillä voi olla erilainen merkitys. Käyttäen analogiaa korttipeleistä yleisesti tiedetään, että kortilla on merkitys, jonka määrää pelin säännöt. Kortin merkitys ei ole suoraan riippuvainen sen nimestä vaan sen suhteesta pelin muihin kortteihin. Pelin säännöt määrittävät korttien väliset suhteen ja värien, maiden ja arvojen rooli voi vaih-15 della pelistä toiseen. Peli muodostaa siten semanttisen ympäristön, korttikäsi edustaa tietokohdetta, ja yksittäinen kortti vastaa termiä. On selvää, että erilaisten sääntöjen ansiosta hyvä käsi yhdessä pelissä voi olla fiasko toisessa pelissä. Vastaavasti termin tulkinta voi perustua erilaisiin kriteereihin, ja termin merkitys riippuu siitä miten sitä tulkitaan kussakin semanttisessa ympäristössä. 20 Kuviossa 1 suoritusmuotona käytetty esimerkinomainen järjestelmä toimii tietokohteilla, jotka syötetään järjestelmään bittivirran muodossa. Tieto-kohde voidaan ladata järjestelmään liitäntäyksikön välityksellä ulkoisesta koh- • · · teestä, kuten esimerkiksi ulkoisesta tietokannasta, tai siirrettävästä viestintätie- J tovälineestä, tai se voidaan hakea muistiyksikköön tallennetusta sisäisestä : 25 tietokannasta. Ohjausyksikkö tulkitsee tietokohteen bittivirran järjestelmän en- naita määrätyn tietoprotokollan mukaisesti ääni-, video-, tai tekstitietokohteek- si, tai vastaavaksi, ja tunnistaa näiden tietokohteiden muodostamat termit.10 In different semantic environments terms can have different meanings. By analogy with card games, it is generally known that the card has a meaning determined by the rules of the game. The meaning of the card does not directly depend on its name but on its relation to other cards in the game. The rules of the game determine the relationship between the cards, and the role of colors, countries and values can vary from game to game. The game thus forms a semantic environment, the card hand represents a data object, and a single card corresponds to the term. Obviously, with different rules, a good hand in one game can be a fiasco in another. Similarly, the interpretation of a term may be based on different criteria, and the meaning of the term depends on how it is interpreted in each semantic context. The exemplary system used in the embodiment of Figure 1 operates on data objects that are input to the system in the form of a bit stream. The data object may be downloaded to the system via an interface unit from an external object, such as an external database, or a removable communication medium, or retrieved from an internal: 25 database stored in the memory unit. The control unit interprets the bitstream of the data object according to a predetermined data protocol of the system as an audio, video, or text data object, or the like, and recognizes the terms formed by these data objects.

Esimerkiksi bittivirta, joka vastaa tekstitietotiedostoa voidaan tyypillisesti kuva- ta merkeiksi. Suoritusmuotona käytetty järjestelmä analysoi yhden tai useam- .···. 30 man merkin merkkijonoja tekstitietotiedostossa ja erottelee termit, jotka vastaa- ]*\ vat tunnistettavia merkkijonoja. Välilyönti voidaan tulkita merkiksi ja termi voi vastata merkkijonoa, joka sisältää yhden tai useampia välilyöntejä. Jos sana on yhtä kuin merkkijono, jossa ei ole välilyöntejä, termi voi vastata yhtä samaa .tai sanojen, välilyöntien ja muiden niiden välillä käytettyjen symbolien yhdis- ..II: 35 telmää. Kielissä, jotka eivät käytä tyhjää tilaa sanojen rajana, esimerkiksi kiina, • · japani ja korea, sanat voidaan jakaa ja yhdistää termeiksi eri tavoin. Termien 6 erottelua asiakirjoista käytetään rutiininomaisesti asiakirjojen indeksoinnissa, ja se on yleisesti alan ammattilaiselle tuttua. Valittu toimintosarja termien tuottamiseksi ei ole sinänsä relevantti keksinnön kannalta.For example, a bit stream corresponding to a text data file can typically be described as characters. The system used as an embodiment analyzes one or more ···. 30 man character strings in a text file and separates terms that correspond to] * \ corresponding to recognizable strings. A space can be interpreted as a character and the term can correspond to a string that contains one or more spaces. If the word is equal to a string without spaces, the term may correspond to the same combination of. Or words, spaces, and other symbols used between them. II: 35. In languages that do not use white space as a border between words, such as Chinese, • Japanese, and Korean, words can be divided and combined in different ways. The separation of the terms 6 from documents is routinely used for indexing documents and is well known to those skilled in the art. The procedure chosen to produce the terms is not, in itself, relevant to the invention.

On huomattava, että annetut esimerkit on tarkoitettu havainnollista-5 maan mahdollisia toteutuksia, eikä niitä tulisi käyttää rajoittamaan suojapiirin tulkintaa. Tietokohteiden ja termien ei tarvitse olla bittivirtojen muodossa, vaan niitä voidaan tallentaa ja siirtää missä tahansa muussa muodossa, jonka järjestelmä tunnistaa. Tekstitiedon lisäksi mikä tahansa viestintämediatyyppiä, esimerkiksi ääni- ja videotietoa, voidaan myös hyödyntää järjestelmän toteu-10 tuksessa.It should be noted that the examples given are intended to illustrate possible embodiments, and should not be construed as limiting the scope of the scope. Data objects and terms do not need to be in the form of bit streams, but can be stored and transmitted in any other format that the system recognizes. In addition to text information, any type of communication media, such as audio and video information, can also be utilized in system implementation.

Suoritusmuotona käytetyssä järjestelmässä, joka käyttää merkkijonoja, termien erottelu voi käyttää tekniikan tason jäsennyksiä, jotka kykenevät syntaktisen, morfologisen ja toiminnallisen riippuvuuden jäsentämiseen. Esimerkiksi englanninkielisiä dokumentteja varten on käytettävissä erilaisia termi-15 erottimia.In an embodiment of a system that uses strings, the use of prior art parsing, which is capable of parsing syntactic, morphological, and functional dependencies, may be employed in term separation. For example, there are different term-15 separators available for English-language documents.

Helpoin vertailu kahden termin välillä on yksi yhteen -vertailu. Tavanomaisissa hakukoneissa tekstidokumentti on jäsennetty sanoiksi ja tarvittaessa sanat palautetaan johonkin perusmuotoon (esimerkiksi ’Helsinkiin’, ’Helsingissä’ - Helsinki). Sanat ja niiden yhdistelmät (esimerkiksi ’Helena’ ja ’St. 20 Helena’) kuvataan soveltuviksi termeiksi ja yhteensopivuus hakutermin ja koh-dedokumentissa olevan termin välillä voidaan päätellä näiden kahden termin merkkijonojen yksinkertaisella vertailulla. Tämä on suoraviivainen menetelmä, • · t **/ mutta se tuottaa tyypillisesti merkittävän määrän osumia ja siten myös dokuni ί mentteja käyttäjän suorittamaa analyysiä varten. Toisaalta missä tahansa se- t · : 25 manttisessa ympäristössä on termien joukkoja (esimerkiksi ’cat’ ja ’tabby’) joita ei voida yhdistää tämän tapaisella merkkijonojen yksi yhteen -samanlaisuu-:)**: della, mutta niillä on samantapainen merkitys. Näin ollen dokumentit, jotka kä sittelevät samanlaisia aiheita, mutta jotka on kirjoitettu toisen tyyppistä sanas-toa käyttäen eivät tule sopivalla tavalla esiin tavanomaisien hakujen välityksel- ,···. 30 lä.The easiest comparison between two terms is the one-to-one comparison. In conventional search engines, the text document is structured into words and, if necessary, the words are returned to some basic form (for example, 'Helsinki', 'Helsinki' - Helsinki). Words and combinations thereof (e.g., "Helena" and "St. 20 Helena") are described as suitable terms, and compatibility between the search term and the term in the target document can be deduced by a simple comparison of the strings of these two terms. This is a straightforward method, but it typically produces a significant amount of hits and thus docunts for user analysis. On the other hand, there are sets of terms (for example, 'cat' and 'tabby') in any of these ·: 25 mantile environments that cannot be combined with this one-to-one similarity of strings -:) **, but have a similar meaning. Thus, documents that deal with similar topics but are written using a different type of word will not appear appropriately through conventional searches, ···. 30 leaves

• *• *

Kun toimitaan tekstitiedolla termin merkitys määräytyy sen mukaan •V\: miten ihmiset käyttävät termiä kommunikoidessaan ja välittäessään onnis- tuneesti ideoitaan. Ihmisten välinen kommunikointi perustuu kielien semantti-. seen ympäristöön, kullakin kielellä on sisäsyntyinen logiikka, joka määrittää sa- 35 nojen oikean ja väärän käytön. Kielien säännöt muodostavat ennalta määrätyn • · sopimuksen, jonka kielen tuntevat jäsenet tuntevat. Käytetyn kielen kielioppi 7 tarjoaa joukon syntaktisia yhdistämissääntöjä joita käytetään jossakin määrin tavanomaisissa hakukoneissa. Suoritusmuotona käytetyssä järjestelmässä tunnistetaan myös termien välinen ontologinen samanlaisuus.When working with textual information, the meaning of the term depends on • V \: how people use the term to communicate and convey their ideas successfully. Communication between people is based on the semantic of languages. In each environment, each language has an innate logic that determines the right and wrong use of words. The language rules form a predefined • · agreement, the language of which is known to the members who know it. Used Language Grammar 7 provides a set of syntactic merge rules that are used to some extent in conventional search engines. The system used as an embodiment also recognizes ontological similarity between the terms.

Kuten kuviossa 2 esitetään, semanttinen ympäristö liittyy termiava-5 ruuteen W, joka sisältää useita termejä t, jotka edustavat tiettyä merkitystä ja ovat siten relevantteja valitussa semanttisessa ympäristössä. Semanttista ympäristöä soveltavan järjestelmän on kyettävä havaitsemaan ja tulkitsemaan näitä termejä tulevissa tietokohteissa. Esimerkiksi, jos kuvion 1 järjestelmä on järjestetty toimimaan englanninkielisten tekstidokumenttien perusteella, se pys-10 tyy tunnistamaan merkkijonoja, kuten ’week’ tai ’last week’ tietokohteessa ja kuvaamaan nämä merkkijonot sopiviksi termeiksi W:ssä.As shown in Fig. 2, the semantic environment is associated with the term space W, which contains a plurality of terms t which represent a particular meaning and are thus relevant in the selected semantic environment. A system that applies the semantic environment must be able to detect and interpret these terms in future data items. For example, if the system of Figure 1 is arranged to operate on the basis of English text documents, it will be able to identify strings such as 'week' or 'last week' in a data object and describe these strings as appropriate terms in W.

Keksinnönmukaisessa järjestelmässä termiavaruus W on jaettu yhteen tai useampaan semanttiseen luokkaan. Tyypillisesti semanttiset luokat määritetään siten, että termit jakavat tietyn semanttisen ominaisuuden. Kuvi-15 ossa 2 tätä on havainnollistettu neljällä esimerkinomaisella luokalla N (nimet), L (paikat), ja M (sekalaista). Semanttisessa luokassa olevat termit voivat olla täysin identtisiä (vrt. ’cat’ ja ’cat'), lähes samanlaisia (vrt. ’cat’ ja ’kitten’) tai etäisesti samanlaisia (vrt. ’cat’ ja ’tiger’) siten, että samassa semanttisessa luokassa olevalle termiparille voidaan antaa reaaliarvo, joka ilmaisee termien 20 merkitysten samanlaisuutta tässä nimenomaisessa semanttisessa luokassa. Semanttiseen luokkaan liittyy tietty samanlaisuusfunktio, joka semanttiselle • · ... luokalle sopivalla tavalla palauttaa kahden syöttötermin perusteella arvon, joka /V edustaa näiden kahden termin välistä samanlaisuutta.In the system according to the invention, the term space W is divided into one or more semantic classes. Typically, semantic classes are defined so that terms share a particular semantic property. In Figure 2 of Figure 15, this is illustrated by four exemplary classes N (names), L (positions), and M (miscellaneous). The terms in the semantic class can be completely identical (cf. 'cat' and 'cat'), almost identical (cf. 'cat' and 'kitten') or remotely similar (cf. 'cat' and 'tiger') such that a pair of terms in the same semantic class can be given a real value that expresses the similarity of the meanings of the terms 20 in this particular semantic class. The semantic class has a certain similarity function which, in a manner appropriate to the semantic class, · · ... returns, on the basis of two input terms, a value of A / V representing the similarity between the two terms.

* · · :*·* : Näin ollen me voimme määrittää semanttisen luokan Si termikohtai- • · : 25 sen samanlaisuusmitan funktiosta & : & x St JR. Termikohtainen samanlai- :.:V suus kuvaa jokaisen semanttisessa luokassa olevan termiparin reaaliakselille.* · ·: * · *: Thus, we can determine the semantic class Si for • •:: 25 of its similarity dimension function &: & x St JR. Term-specific Similarity:.: V describes each pair of terms in the semantic class for the real axis.

Mitä vahvemmin nämä kaksi termiä liittyvät toisiinsa, sitä korkeampi on niiden keskinäinen samanlaisuus. Esimerkiksi tavanomainen yksi yhteen -vertailu ·’·*· voitaisiin siten ilmaista §ί*(α b)~ ( 1 a 38 ^ T ,n ’ { 0 otherwise, • · ου • · · *· *· Jos merkitsemme semanttiset luokat Si, S2.....Sn ja samanlaisuus- funktiot σι, θ2,..., ση käsittävää kieltä E:lla, näitä luokkia ja funktioita käyttävää • tietokohdetta voidaan koota Σ-rakenteeksi. Tämä ilmaisee, että järjestelmässä joukko vertailutermejä ja joukko kohdetermejä on rakenteeltaan saman se-35 manttisen ympäristön mukaisia (Σ-rakenne); termit on luokiteltu samojen yk- 8 sioperandisten (unary) suhteiden (Si, S2,..., Sn) mukaisesti ja niiden samankaltaisuus määräytyy samojen samanlaisuusfunktioiden (σι, σ2, ..., ση) perusteella. Kahden Σ-rakenteen tietokohteiden vertailu on siten mahdollista, koska niillä on sama kielen Σ määrittämä rakenne.The more closely the two terms are related, the higher their similarity. For example, the standard one-to-one comparison · '· * · could thus be expressed as §ί * (α b) ~ (1 a 38 ^ T, n' {0 otherwise, if we denote the semantic classes Si , S2 ..... Sn and the similarity functions σι, θ2, ..., ση with E, a data object using these classes and functions can be assembled into a Σ structure, indicating that the system has a set of reference terms and a set of target terms are of the same se-35 structure (rak-structure); the terms are classified according to the same unary relations (Si, S2, ..., Sn) and their similarity is determined by the same similarity functions (σι, σ2, ..., ση), so it is possible to compare the data objects of the two Σ structures because they have the same structure defined by the language Σ.

5 Esimerkiksi esimerkinomaisessa luokassa L voidaan käyttää useita eri tyyppisiä samanlaisuusfunktioita. Tietyn Siperiassa sijaitsevan paikan suhteen relevantit tietokohteet voivat sisältää termejä ’Russia’, ’Lena’, Yilyuy’, ’Lensk’ ja ’Yakutsk’. Näillä termeillä ei selvästikään ole mitään yhteistä termien yksi yhteen -vertailussa; niiden relevanssia ei voi ymmärtää ilman määrättyä 10 maantieteellistä ontologiaa. Tässä tapauksessa samanlaisuusfunktio perustuu tiettyyn ennalta määrättyyn rakenteeseen, jota voidaan käyttää kuvaamaan termit arvoksi, joka edustaa termien keskinäistä suhdetta semanttisessa luokassa.5 For example, in the exemplary class L, several different types of similarity functions can be used. Relevant data items for a specific Siberian location may include the terms 'Russia', 'Lena', 'Yilyuy', 'Lensk' and 'Yakutsk'. These terms clearly have nothing in common in the one-to-one comparison; their relevance cannot be understood without a definite 10 geographical ontology. In this case, the similarity function is based on a particular predefined structure that can be used to describe the terms to a value that represents the relationship between the terms in the semantic class.

Samanlaisuusfunktioiden ja vastaavien samanlaisuusmittojen arvo-15 alue vaihtelee sovelluskohtaisesti. Seuraavassa kuvataan lyhyesti joitakin samanlaisuusfunktioita kuitenkaan rajoittumatta näkökulmaa tiettyihin esimerkinomaisiin mittoihin ja arvoihin.The value-15 range of similarity functions and corresponding similarity measurements varies from application to application. The following briefly describes some of the similarity functions without limiting the perspective to certain exemplary dimensions and values.

TFIDFTFIDF

20 Laajalti käytetty menetelmä termikohtaisten samanlaisuuksien las- kemiseksi on termitaajuus - käännetty dokumenttitaajuus (term-frequency - !..* inverted document frequency, TFIDF), jossa termitaajuus tietokohteessa kerro- • · · /*t* taan termin informatiivisuudella. Informatiivisuus, tai termin tarkkuus ilmaistaan *·ί ί dokumenttitaajuuden käänteisarvon logaritmina. Jos merkitsemme A:lla tieto- • · i.: i 25 kohteen vektoria, joka sisältää termit ai, a2.....an semanttisessa luokassa Si, vektorissa A esiintyvän termin ai paino on »rF(a) = ^W-Ios(^) :V: jossa fA(a) on niiden kertojen määrä, jolloin termi a( esiintyy vektorissa A, Nd on ·**·. dokumenttien kokonaismäärä ja d(a) ilmaisee niiden dokumenttien määrän, ,* , 30 joissa termi a\ esiintyy. Termikohtainen samanlaisuusfunktio on yksinkertai- • · · '· *! sesti termien painojen tulo, eli *·ί·: TFIDF perustuu identiteettiin σ|(1 siten, että termikohtainen samanlaisuus on positiivinen jos, ja vain jos a=b.A widely used method for calculating term-specific similarities is term-frequency -! .. * inverted document frequency (TFIDF), in which the term-frequency in a data object is multiplied by the term informative. Informativeness, or term accuracy, is expressed as the logarithm of the inverse of the document frequency * · ί ί. If we denote by A · i: i 25 vectors of the object containing terms ai, a2 ..... an in semantic class Si, the weight of ai in vector A is »rF (a) = ^ W-Ios (^): V: where fA (a) is the number of times that a (occurring in vector A, Nd is · ** ·. Total number of documents, and d (a) represents the number of documents,, *, 30 where a) The term-specific similarity function is simply a product of • · · '· *! the weights of the terms, i.e. * · ί ·: TFIDF is based on the identity σ | (1 such that the term-specific similarity is positive if and only if a = b.

99

COVERCOVER

Lähtien termien taksonomiasta, COVER on samanlaisuusmitta, joka perustuu kahden termin yhteiseen polkuun suhteessa termien polkujen pituuteen. COVER on funktio * U(d) · fe(b) 5 jossa B on vertailutermien vektori, joka sisältää termit b1, b2, bn semanttisessa luokassa Sb, fA(a) ja fB(b) ovat termien a ja b taajuudet ja j(a,b) on kahden polun pituuden Jaccardin kerroin .. t(a n h) 3{a,b)" KafuW-kanb) 10 Yksinkertaistettuna esimerkkinä suoritusmuotona käytettävän järjes telmän maantieteelliset sijainnit voidaan jakaa monitasoiseen ontologiaan. Kuvio 3A näyttää yksinkertaistetun esimerkin siitä miten termin semanttisessa luokassa L sijoittuvat nelitasoisen hierarkian rakenteessa. Kun järjestelmä havaitsee termin tj luokassa L, se myös määrittää paikan tyypin ja sijoittaa termin 15 oikealle tasolle kuvion 3A rakenteessa. Ensimmäinen taso vastaa yhtä solmua, maailmaa; toinen taso vastaa maanosia, kolmas taso maita ja neljäs taso kaupunkeja. On selvää, että suojapiirin sisällä voidaan käyttää mitä tahansa yksityiskohtaisuuden tasoa ja muunlaisia informaatiorakenteita.Starting from the taxonomy of terms, COVER is a measure of similarity based on the common path of two terms relative to the length of the terms' paths. COVER is a function * U (d) · fe (b) 5 where B is a vector of reference terms containing the terms b1, b2, bn in the semantic class Sb, fA (a) and fB (b) are the frequencies of the terms a and b and j ( a, b) is a Jaccard coefficient of two paths. t (Anh) 3 {a, b) "KafuW-kanb) 10 The geographical locations of the system used as a simplified example can be divided into a multilevel ontology. Figure 3A shows a simplified example of how the term semantic class L are placed in a four-level hierarchy structure.When the system detects term tj in class L, it also determines the type of place and places the term 15 to the right level in the structure of Figure 3A. It is clear that any level of detail and other types of information structures can be used within the scope.

Kuvio 3B näyttää vastaavan yksinkertaistetun hierarkkisen raken-20 teen kuvion 3A paikkatermeille. Kukin puussa oleva solmu vastaa paikkaa.Figure 3B shows a corresponding simplified hierarchical structure for the position terms in Figure 3A. Each node in the tree corresponds to the location.

• · *• · *

Ottaen huomioon kuvion 3B yksinkertaisen maantieteellisen taksonomian, ter- :*V mien ’Paris’ ja ’Lyon’ yhteinen polku alkaa juuresta ja ylittää solmut ’Europe’ ja * · · i,:t: ’France’. Yhteisen polun pituus on siten 2 ja yksittäisten termien pituus juuresta mitattuna on 3. Termien Jaccard-kerroin on siten 2/(3+3-2)=0,5. Lisäesimerk- ··· :...! 25 kinä samanlaisuus termien ’China’ ja ’Paris’ antaa arvon 0/(2+3-0)=0, saman laisuus termien ’Paris’ ja ’Germany’ antaa arvon 1/(3+2-1)=0,25, ja samanlai- :*:*.* suus termien ’Paris’ ja ’France’ välillä antaa arvon 2/(2+3-2)=0,67. Selvästi, mitä korkeampi Jaccard-kerroin on, sitä lähempänä termit ovat taksonomisesti; «·· ne peittävät (eng. ’cover’) saman polun.Taking into account the simple geographical taxonomy of Figure 3B, the common path for ter: * V 'Paris' and 'Lyon' starts at the root and crosses the nodes 'Europe' and * · · i: 'France'. The length of the common path is thus 2 and the length of the individual terms measured at the root is 3. The Jaccard coefficient of the terms is thus 2 / (3 + 3-2) = 0.5. More example- ···: ...! 25 similarities between the terms 'China' and 'Paris' give the value 0 / (2 + 3-0) = 0, the similarity between the terms 'Paris' and 'Germany' gives the value 1 / (3 + 2-1) = 0.25 , and similar: *: *. * relationship between 'Paris' and 'France' gives a value of 2 / (2 + 3-2) = 0.67. Clearly, the higher the Jaccard coefficient, the closer the terms are taxonomically; «·· they cover the same path.

• \· * : 3o• \ · *: 3o

Resnik ϊ Jotta voitaisiin määrittää termien semanttinen samanlaisuus ’on’- ··· ·:··· taksonomiassa, on ehdotettu käytettäväksi informaatiosisältöä, joka lähtee in formaatioteoriasta. Termin informaatiosisältö on negatiivinen todennäköisyys, 10 -log p(a). Jos termi on hyvin todennäköinen (usein toistuva), sen informaatiosisältö on pieni, ja päinvastoin.Resnik otta In order to determine the semantic similarity of terms in 'on'- ··· ·: ··· taxonomy, it has been proposed to use information content that derives from information theory. The information content of the term is a negative probability, 10 -log p (a). If a term is very likely (often repeated), its information content is small, and vice versa.

Tässä kontekstissa esimerkiksi paikkatermien todennäköisyydet ovat yksinkertaisia suhteellisia taajuuksia: s m - m/r.In this context, for example, probabilities of place terms are simple relative frequencies: s m - m / r.

joissa f(a) on paikkatermin taajuus taustakieliaineistossa ja T on kaikkien paikkatermien määrä. Kun lasketaan todennäköisyyksiä, kaikkien paikkatermien esiintyminen on myös kaikkien sen yläpuolella olevien paikkatermien esiintyminen matkalla juureen. Esimerkiksi havaittaessa ’termi Paris’, pidämme sitä 10 myös termien ’France’ ja ’Europe’ esiintymisenä. Tällä tavalla juurisolmun (’World1) taajuus on T ja siten sen informaatiosisältö on nolla. Lisäksi todennäköisyys voi vain nousta Qa informaatiosisältö laskea) kun siirrytään ylöspäin taksonomiapuussa.where f (a) is the frequency of the place term in the background language material and T is the number of all place terms. When calculating probabilities, the occurrence of all place terms is also the occurrence of all place terms above it on its way to the root. For example, when we discover the term 'Paris' we also consider it to be the occurrence of the terms 'France' and 'Europe'. In this way, the root node ('World1) has a frequency of T and thus its information content is zero. In addition, the probability can only increase with Qa information content counting) as we move up the taxonomy tree.

Kuten COVERin yhteydessä, kahden termin samanlaisuus perustuu 15 yhteiseen solmuun kauimpana juuresta, yhteiseen solmuun, jolla on pienin todennäköisyys ja siten suurin informaatiosisältöä. Jos S(a,b) käsittää paikka-termien a ja b yhteisen polun, termikohtainen samanlaisuus on funktio a B rawest- tQgp(e)] /a (a) !a{b) jossa fA(a) ja fa(b) ovat a:n ja b:n raakatermitaajuudet.As with COVER, the similarity between the two terms is based on the 15 common nodes farthest from the root, the common node with the lowest probability and thus the largest information content. If S (a, b) comprises a path common to the place terms a and b, the term-specific similarity is a function a B rawest-tQgp (e)] / a (a)! A {b) where fA (a) and fa (b) are the raw thermal frequencies of a and b.

20 Kuten edellä keskusteltiin, samanlaisuusfunktio semanttiselle luokal- *·*'*·’ le heijastaa edullisesti samanlaisuutta tavalla, joka on merkityksellinen termien : V: tarkoitukselle ja/tai aiheelle semanttisessa luokassa. Esimerkiksi edellä esite- • tyn puumaisen taksonomian sijasta maantieteellisten nimien samanlaisuus voi ··· · : perustua niiden absoluuttiseen geospatiaaliseen etäisyyteen. Siten termien . .·. 25 ’New York’ ja ’Newark’ vertailu tuottaisi suuremman samanlaisuuden kuin ter- • · · min ’New York’ vertaaminen termeihin ’York’ tai ’Toronto’, sillä ’Newark’ sijait- ♦ « *** see lähempänä kuin nämä kaksi muuta. Tästä johtuen samanlaisuus voi pe- rustua esimerkiksi tiettyjen alojen ontologiaan (esimerkiksi bakteerien, musiik- • · · kilajien taksonomiaan), fyysisiin mittoihin (esimerkiksi avaruudellinen etäisyys, 30 ajallinen etäisyys), tilastollisiin malleihin (esimerkiksi bigram-mallit, jakautuman samanlaisuus), tai tuotenimiin (esimerkiksi EMT64T voisi olla prosessori joka on ’melkein’ samanlainen kuin IA64). Edellä esitetyn perusteella toteutukset, \ joissa käytetään eri tyyppisiä samanlaisuusfunktioita ovat ilmeisiä alan ammatti·1 tilaiselle.As discussed above, the similarity function for the semantic class * · * '* · preferably reflects the similarity in a way that is relevant to the purpose and / or subject of the terms: V: in the semantic class. For example, instead of the tree-like taxonomy presented above, the similarity of geographical names may be ··· ·: based on their absolute geospatial distance. Thus the terms. . ·. 25 Comparison of 'New York' and 'Newark' would produce greater similarity than the comparison of the term 'New York' with 'York' or 'Toronto', since 'Newark' is located ♦ «*** see closer than these two. more. As a result, similarity may be based on, for example, ontology of certain disciplines (e.g., taxonomy of bacteria, • · · germs), physical dimensions (e.g., spatial distance, 30-time distance), statistical models (e.g., bigram models, distribution similarity), or product names. (for example, EMT64T could be a processor 'almost' similar to IA64). From the foregoing, implementations using different types of similarity functions will be apparent to one skilled in the art.

35 Suoritusmuotona käytetyssä järjestelmässä tietokohde voi käsittää tunnistettavissa olevia termejä yhdessä tai useammassa semanttisessa luo- 11 kassa, ja analyysiä silmälläpitäen termit voidaan kirjoittaa vektorien yhdistelmänä, jolloin kukin vektori on erillinen tietylle semanttiselle luokalle. Tätä tieto-kohdetta vastaavien vektorien yhdistelmää kutsutaan jatkossa multivektoriksi. Kun kullakin semanttisella luokalla on tietty samanlaisuusmitta, kahden multi-5 vektorin vertailu toteutetaan kussakin semanttisessa luokassa erikseen. Kun kahden perinteisen dokumenttivektorin vertailu tuottaa yhden reaaliarvoisen samanlaisuuden, kahden multivektorin luokkakohtaisen vertailun tulos on sa-manlaisuusvektori v.In the system used in the embodiment, the data object may comprise identifiable terms in one or more semantic classes, and for analysis purposes, the terms may be written as a combination of vectors, each vector being distinct for a particular semantic class. The combination of vectors corresponding to this data object is hereinafter referred to as a multivector. When each semantic class has a certain degree of similarity, the comparison of the two multi-5 vectors is performed separately for each semantic class. When a comparison of two traditional document vectors produces one real-valued similarity, the result of the comparison of two multivectors by class is the similarity vector v.

Semanttisen luokan Sk kahden vektorin Ak ja Bk samanlaisuus Δκ 10 Voidaan siten määrittää yhtälöstä: Λ..Μ. *.Λ__ s/ΣΖ, StThe similarity of the two vectors Ak and Bk of the semantic class Sk Δκ 10 can thus be determined from the equation: Λ..Μ. * .Λ__ s / ΣΖ, St

Jossa m on termien määrä k:n semanttisessa luokassa ja ak(ai,bi) on termien a, ja bj luokkakohtainen parittainen samanlaisuus. Jos ok perustuu identiteettiin, funktio Ak vastaa kahden vektorin kosinia.Where m is the number of terms in the semantic class of k and ak (ai, bi) is a pairwise similarity of the terms a, and bj in the class. If ok is based on identity, the function Ak corresponds to the cosine of two vectors.

15 Jatkaen kuvion 2 esimerkkiä, kuvio 4 näyttää yksinkertaisen esityk sen kohdetermien multivektorista A, ja vertailutermien multivektorista B. Semanttinen ympäristö käsittää kolme luokkaa N (nimet), L (paikat) ja M (sekalaiset). Keksinnön mukaisesti kahden multivektorin termit jaetaan näihin semanttisiin luokkiin ja kahden multivektorin vertailu suoritetaan luokittain. Kussakin 20 semanttisessa luokassa käytetään määrättyä samanlaisuusfunktiota. Esimer-kiksi semanttisessa luokassa N voitaisiin käyttää TFIDF:ää, semanttisessa luokassa L COVERia tai Resnikiä, ja semanttisessa luokassa M TFIDF:ää. Jos » i « !‘V oletetaan, että nimien samanlaisuuden arvoksi saadaan 0,39, paikkojen sa- **Y manlaisuuden arvoksi saadaan 0,35 ja sekalaiset termien samanlaisuuden • · · 25 arvoksi 0,73, samanlaisuusvektori olisi v=(0,39; 0,35; 0,73).Continuing from the example of Figure 2, Figure 4 shows a simple representation of the multivector A of the target terms and the multivector B of the reference terms The semantic environment comprises three classes of N (names), L (places), and M (miscellaneous). According to the invention, the terms of two multivectors are divided into these semantic classes and the comparison of the two multivectors is performed by classes. Each of the 20 semantic classes uses a specific similarity function. For example, in semantic class N, TFIDF could be used, in semantic class L COVER or Resnik, and in semantic class M, TFIDF. If »i«! 'V is assumed to have a value of 0.39 for similarity of names, 0.35 for similarity of positions, and 0.33 for miscellaneous, · · · 25 of terms of similarity of terms, v = (0, 39; 0.35; 0.73).

*···: Jotta saataisiin päätös, joka perustuu vertailuihin useissa semantti sissa luokissa, kynnys on myös olennaisesti monidimensioinen. Ympäristössä, v : jossa on n semanttista luokkaa, tarvitaan n-dimensioinen kynnys, hypertaso :***: samanlaisuuden erottamiseksi erilaisuudesta. Oletuksena on, että semantti- «ti 1 X ; 30 sessa ympäristössä samanlaiset dokumentit ovat perimmiltään samanlaisia • · · j samalla tavoin, ja samanlaiset ja erilaiset vektorit muodostavat opittavia kluste- *. * reita määrätyssä vektoriavaruudessa. Analysoitavan opetusmateriaalin, joka :,:Y sisältää multivektorien relevantteja ja ei-relevantteja pareja, perusteella voi- *:**: daan määrittää hypertaso käytettäväksi samanlaisuusvektorien seuraaviin luo- 35 kitteluihin. Useista kahden multivektorin vertailuista saadut samanlaisuusvekto- 12 rit kuvataan samanlaisuusvektorien avaruuteen ja päätös samanlaisuudesta perustuu hypertason ja samanlaisuusvektorin koordinaattien väliseen suhteeseen.* ···: For a decision based on comparisons across multiple semantic categories, the threshold is also substantially multidimensional. In an environment, v: with n semantic classes, an n-dimensional threshold, hypertension: ***: is needed to distinguish similarity from difference. It is assumed that semantics 1 X; In 30 environments, similar documents are basically similar • · · j in the same way, and similar and different vectors form learned clues *. * a path in a given vector space. Based on the teaching material to be analyzed, which:,: Y contains relevant and non-relevant pairs of multivectors, a hyper * level can be determined for use in the following categorization of similarity vectors. The similarity vectors obtained from several comparisons of two multivectors are described by the space of the similarity vectors, and the decision on the similarity is based on the relationship between the hyperplane and the coordinates of the similarity vector.

Esimerkiksi, kuten kuviossa 4 on esitetty, plusmerkkejä käytetään 5 merkitsemään koordinaatteja, jotka vastaavat kohdetermien ja vertailutermien multivektorien perusteella laskettuja samanlaisuusvektoreita, ja joissa multi-vektoreita pidetään riittävän samanlaisina relevanssin kannalta. Vastaavasti miinusmerkit kuvaavat koordinaatteja, jotka vastaavat niiden multivektorien samanlaisuusvektoria, joita pidetään erilaisina relevanssin kannalta. Suoritus-10 muotona käytetyssä tapauksessa, samanlaisuusavaruus jakautuu osiin, ja vertailu hypertason kanssa palauttaa tulosarvon R, joka ilmaisee kumpaan sa-manlaisuusavaruuden osaan samanlaisuusvektori liittyy. Suoritusmuotona käytetyssä tapauksessa hypertaso jakaa samanlaisuusavaruuden kahteen osaan, ja on vain tiedettävä kumpaan kahdesta osasta samanlaisuusvektori kuuluu. 15 Alan ammattilaiselle on selvää, että myös muuta menetelmät samanlaisuus-avaruuden jakamiseksi ovat mahdollisia, ja että useampaa kuin kahta osaa voidaan käyttää.For example, as shown in Fig. 4, plus signs are used to denote coordinates that correspond to similarity vectors calculated on the basis of multivectors of target terms and reference terms, and in which the multi-vectors are considered sufficiently similar for relevance. Similarly, the minus signs represent the coordinates that correspond to the multivector identity vector that is considered to be different in relevance. In the case of Execution-10, the similarity space is divided into parts, and comparison with the hyperspace returns the result value R, which indicates which parts of the similarity space are associated with the similarity vector. In the case used in the embodiment, the hypertension divides the identity space into two parts, and one only needs to know which of the two parts the identity vector belongs to. It will be apparent to one skilled in the art that other methods of partitioning the identity space are possible, and that more than two parts may be used.

Yleisesti, kun on harjoitettu hypertaso w, voidaan uuden vektorin ja hypertason (w,b) etäisyys Ψ(ν) laskea niiden sisätulolla:In general, when the hypertension w is practiced, the distance Ψ (ν) between the new vector and the hypertension (w, b) can be calculated by their internal product:

Tl ip(v) = {w, v) + W0 £>1* -f W(| 20 • ·Tl ip (v) = {w, v) + W0 £> 1 * -f W (| 20 • ·

Jossa wi, w2,... wn ovat semanttisten luokkien painot ja wo on paino- .*·*.* tus (bias). Semanttisen luokan painotekijä voidaan määrittää suorittamalla ope- • · * tustiedossa olevien dokumenttiparien vertailuja ja ottamalla luokittaisia satun-··· : naisnäytteitä useista vertailuista. Näytteitä voidaan käyttää perseptronin opet- 25 tamiseen ja opetettua perseptronia voidaan käyttää kaikkiin muihin näytteisiin.Where wi, w2, ... wn are the weights of the semantic classes and wo is the weight. * · *. * Tus (bias). The semantic class weighting factor can be determined by performing comparisons of documented pairs of learning information and by • taking random categorical samples from several comparisons. The samples can be used to teach perseptron and the trained perseptron can be used for all other samples.

·*·* · ·

Ristiinvalidointi tuottaa useita painovektorikandidaatteja, jotka voidaan keskiar-vottaa painovektoria w varten käytettäväksi arvioinneissa samanlaisuusvekto-:T: rin ja hypertason välillä.Cross-validation produces a number of weight vector candidates that can be averaged for weight vector w to be used in evaluations between the similarity vectors: T and the hypertension.

Kuvio 5 esittää toimintosarjamallin suoritusmuotona käytetyn järjes- ·*· .· . 30 telmän toiminnasta. Prosessilla P1 on merkitty vaihetta, jossa verrataan multi- • » · *· *j vektoreita A ja B semanttisissa luokissa joilla on erityiset samanlaisuusfunktiot samanlaisuusvektorin v laskemiseksi, kuten edellä kuvattiin. Prosessilla P2 on : :’· merkitty vaihe, jossa verrataan samanlaisuusvektoria v hypertasoon sen vekto- «ti riavaruuden osan määrittämiseksi, johon samanlaisuusvektori v kuuluu, edellä 35 esitetyllä tavalla.Figure 5 shows the system used as an embodiment of the sequence model. 30 operation. The process P1 denotes the step of comparing multi- »· * · * j vectors A and B in semantic classes having specific similarity functions to compute the identity vector v, as described above. Process P2 has: a marked step comparing the identity vector v to the hypertension to determine the portion of the vector space to which the identity vector v belongs, as described above.

1313

Prosessilla P3 on merkitty seuraava vaihe, jossa saadun tulosarvon R perusteella määritetään järjestelmällä suoritettava funktio F. Tätä tarkoitusta varten järjestelmään on järjestetty joukko ennalta ohjelmoituja funktioita, joista kukin vastaa tulosarvoa siten, että kun prosessien P1 ja P2 tuloksena saadaan 5 tulosarvo R, järjestelmä käynnistää vastaavan funktion F. Alan ammattilaiselle on selvää, että F:n toimintosarjat voivat vaihdella merkittävästi sovelluskohtaisesti.Process P3 denotes the next step in determining the function F to be executed by the system based on the resultant value R, for this purpose a number of pre-programmed functions are arranged in the system, each corresponding to the result value. It will be apparent to one skilled in the art of F. function that the procedures for F may vary significantly from application to application.

Suoritusmuotona käytettävässä esimerkkitapauksessa järjestelmän avulla määritetään onko tuleva dokumentti relevantti vertailutermien joukon 10 suhteen. Tulosarvo R voidaan määrittää samanlaisuusvektorin ja painon perusteella, kuten edellä kuvattiin, ja positiivinen arvo liittyy relevantteihin dokumentteihin ja negatiivinen arvo liittyy ei-relevantteihin dokumentteihin. Järjestelmään on siten järjestetty funktio F1, joka käynnistyy vasteena R:n negatiivisille arvoille ja funktio F2, joka käynnistetään vasteena R:n positiivisille arvoille. 15 Esimerkiksi positiivisessa tapauksessa F1 voi käsittää proseduurin tietueen luomiseksi analysoitavasta tietokohteesta ja taltioi tietueen relevanttien dokumenttien luetteloon. Tietue käsittää ainakin osoitteen, josta tietokohde on saatavissa hakua varten, ja edullisesti lisämetadataa, joka soveltuu relevanttien dokumenttien organisoimiseen määrätyn luokittelutekijän (päivämäärä, lähde, 20 jne.) mukaisesti. Vastaavasti F2 voi käsittää toimintosarjan analysoitavan tieto-kohteen hylkäämiseksi. Hakuosoitetta voidaan jäljempänä käyttää reititysosoit-... teenä latauspyynnölle, jonka käyttäjä käynnistää näpäyttämällä käyttöliittymäs- i · « /*.’ sä relevanttien dokumenttien tulosnäkymässä hyperlinkkiä. Näin saadaan |·γ mahdolliseksi menetelmä, jonka avulla käyttäjille tarjotaan suoraviivainen pää- : 25 sy jatkuvasti päivitettävään luetteloon dokumentteja, jotka ovat relevantteja vertailutermien muodostaman suodatuskriteerin kannalta. i"\* Toisena esimerkkinä F1 voi sisältää toimintosarjan analysoitavan tietokohteen reitittämiseksi ensimmäiseen osoitteeseen ja F2 voi sisältää toi-;*·*; mintosarjan analysoitavan tietoyksikön reitittämiseksi toiseen osoitteeseen.In the exemplary embodiment, the system determines whether the forthcoming document is relevant to the set of reference terms 10. The result value R can be determined based on the identity vector and weight as described above, with a positive value associated with relevant documents and a negative value associated with non-relevant documents. The system is thus provided with a function F1 which starts in response to the negative values of R and a function F2 which starts in response to the positive values of R. For example, in the positive case, F1 may comprise a procedure to create a record of the data object to be analyzed and store the record in a list of relevant documents. The record comprises at least an address from which the data object is accessible for retrieval, and preferably additional metadata suitable for organizing relevant documents according to a specific classification factor (date, source, 20, etc.). Similarly, F2 may comprise a procedure for rejecting the data object to be analyzed. The search address can be used below as a routing address for a load request that is triggered by the user clicking on a hyperlink in the result view of relevant documents in the user interface. This enables | · γ to provide a method that provides users with a straightforward access to a continuously updated list of documents that are relevant to the filtering criteria of the reference terms. i "\ * As another example, F1 may include a procedure for routing a data item to be analyzed to a first address, and F2 may include a procedure; * · *; for routing a data item to be analyzed to a second address.

.···. 30 Näin saadaan aikaiseksi menetelmä, jonka avulla tietokohteita voidaan reitittää niiden sisällön semanttisen merkityksen mukaan tehdyn suodatuksen mukai-*·**ϊ sesti.. ···. 30 This provides a method for routing data objects according to the filtering of their content semantically * · ** ϊ.

Kuten edellä keskusteltiin, multivektorissa B olevat vertailutermit voi-• vat vastata mitä tahansa syöttötermien joukkoa. Alan ammattilaiselle on ilmeis- 35 tä, että vertailutermit voidaan muodostaa esimerkiksi joukosta järjestelmän käyttäjän syöttämiä hakutermejä, tai erottaa yhdestä tai useammasta järjes- 14 telmän käyttäjän viittaamista dokumenteista.As discussed above, the control terms in the multivector B can correspond to any set of input terms. It will be apparent to one of ordinary skill in the art that reference terms may be formed, for example, from a set of search terms entered by a system user, or be distinguished from one or more documents referenced by a system user.

Kuvion 6 vuokaavio esittää yksinkertaistetun suoritusmuodon menetelmästä keksinnönmukaisen järjestelmän suoritusmuodossa. Yksityiskohtaisempi kuvaus vaiheista voidaan löytää aiemmista kuvioiden 1-4 kuvauksista.Fig. 6 is a flow chart showing a simplified embodiment of the method in an embodiment of the system of the invention. A more detailed description of the steps can be found in the previous descriptions of Figures 1-4.

5 Alussa järjestelmään järjestetään (vaihe 60) joukko vertailutermejä. Kuten edellä keskusteltiin, joukko vertailutermejä voi vastata esimerkiksi joukkoa järjestelmän käyttäjän syöttämiä hakutermejä, ne voidaan erottaa dokumentista tai keskiarvottaa joukosta dokumentteja. Uuden tietokohteen analyysi alkaa tietokohteen DOCa vastaanottamisella (vaihe 61). Tietokohde jäsennetään (vaihe 10 62) ja valitussa semanttisessa ympäristössä relevantit termit erotetaan tiedos ta. Nämä DOCA:sta erotetut kohdetermit kootaan (vaihe 63) multivektoriksi A, ja ne järjestetään valitussa semanttisessa ympäristössä sovellettaviin semanttisiin luokkiin. Joukko ennalta määrättyjä vertailutermejä, jotka on järjestetty samoja semanttisia luokkia soveltavaksi multivektoriksi B, haetaan prosessoita-15 vaksi. Vertailtu suoritetaan kussakin semanttisessa luokassa semanttiselle luokalle ominaisen samanlaisuusfunktion mukaisesti, ja tuloksena määritetään (vaihe 64) samanlaisuusvektori v, joka kuvaa semanttista samanlaisuutta koh-determien multivektorin A ja vertailutermien multivektorin B välillä. Semanttista ympäristöä varten on olemassa hypertaso w, joka on tuotettu ennalta opetus-20 materiaalin valitusta näytteistyksestä. Semanttisen ympäristön hypertason w " parametrit haetaan prosessoitavaksi ja samanlaisuusvektoria v verrataan hy- pertasoon vertailuarvon R määrittämiseksi (vaihe 65), joka ilmaisee sen osan • · ♦ ]·*/ vektoriavaruudesta, johon samanlaisuusvektori v kuuluu. Kukin käytettävä ver- ··· ϊ tailuarvo R vastaa ennalta määrättyä järjestelmän funktiota ja määritettyä tu- • · : 25 losarvoa R vastaava funktio F(R) käynnistetään (vaihe 66). Järjestelmä siirtyy :,:V sitten valmiustilaan, jossa tarkistetaan (vaihe 67) uuden tietokohteen vastaan- otto tai syöttö. Kun uusi tietokohde saapuu (vaihe 68), toimintosarja jatkuu vaiheesta 61.5 Initially, a set of reference terms is arranged in the system (step 60). As discussed above, a plurality of control terms may correspond, for example, to a plurality of search terms entered by a system user, may be extracted from a document, or averaged from a plurality of documents. The analysis of the new object begins with the reception of the DOCa (step 61). The data object is parsed (step 10 62) and, in the selected semantic environment, the relevant terms are separated from the data. These DOCA-extracted target terms are assembled (step 63) into a multivector A, and organized into semantic classes applicable in the selected semantic environment. A set of predetermined reference terms arranged in a multivector B applying the same semantic classes is retrieved as processes-15. The comparison is performed in each semantic class according to the semantic class specificity function, and as a result (step 64) an identity vector v is defined which describes the semantic similarity between the object determinant multivector A and the comparison terms multivector B. For the semantic environment, there is a hypert level w produced from a pre-selected sampling of teaching material 20. The parameters of the hypertensive w "of the semantic environment are retrieved for processing and the identity vector v is compared to the hyper level to determine a reference value R (step 65) expressing the portion of the · · ♦] · * / vector space to which the identity vector v belongs. R corresponds to a predetermined system function and the function F (R) corresponding to the defined output · ·: 25 is triggered (step 66) The system enters:,: V then goes into standby to check (step 67) the reception or input of a new data object When a new data item arrives (step 68), the procedure continues from step 61.

Keksinnön lisäsuoritusmuodossa suodatin muodostetaan vertailu- • · · • .*··. 30 termien multivektorista ja multivektori lasketaan keskiarvona termeistä, jotka **\ saadaan joukosta dokumentteja, jotka käsittelevät tiettyä aihetta. Esimerkiksi ·*·**: voidaan kerätä joukko dokumentteja, jotka tunnistettavasti käsittelevät tiettyä ♦ aihetta, esimerkiksi ’Zidane headbutt’. Dokumenteista voidaan erottaa ja mää-. rittää joukko termejä, jotka muodostavat tuon määrätyn tapauksen suhteen 35 relevanttien dokumenttien keskiarvovektorin. Keskiarvovektori ei siten edusta ♦ ♦ mitään yksittäistä vertailudokumenttia, vaan muodostaa yhdistelmän termeistä, 15 jotka optimaalisella tavalla yhdistävät semanttisen kontekstin aiheen ympärillä. Keskiarvovektorin käyttö mahdollistaa myös keskiarvotettujen termien painotuksen valitulla tavalla. Yksinkertaistettuna esimerkkinä, kaikki termit tietyn tapahtuman kannalta joukossa relevanteiksi katsottuja dokumentteja voidaan si-5 säilyttää joukkoon vertailutermejä. Kunkin termin paino voidaan tämän jälkeen asettaa niiden kertojen keskiarvoon, jotka termi esiintyy dokumenteissa. Esimerkiksi, jos termi ’headbutt’ esiintyy yhdessä dokumentissa kerran ja toisessa dokumentissa seitsemän kertaa, termin paino keskiarvovektorissa olisi yhden ja seitsemän keskiarvo, eli neljä. Käyttämällä keskiarvovektoria vertailutermien 10 vektorina, järjestelmä pystyy havaitsemaan ja taltioimaan uudet tapausta käsittelevät dokumentit.In a further embodiment of the invention, the filter is formed by a reference • · · •. * ··. The term multivector and multivector are calculated as the average of the terms derived from a set of documents that deal with a particular topic. For example, · * · **: You can collect a set of documents that identifiably cover a specific topic, such as 'Zidane Headbutt'. Documents can be separated and defined. is the set of terms that make up the mean vector of 35 documents relevant to that particular case. The mean vector thus does not represent ♦ ♦ any single reference document, but forms a combination of terms that optimally combine semantic context around the subject. Using the mean vector also allows the weighted average terms to be weighted in the selected way. As a simplified example, all terms in a set of documents considered relevant to a particular event can be retained within a set of reference terms. The weight of each term can then be set to the average number of times the term appears in documents. For example, if the term 'Headbutt' occurs once in one document and seven times in another, the weight of the term in the mean vector would be the mean of one and seven, that is, four. By using the mean vector as a vector for comparison terms 10, the system is able to detect and store new case documents.

Keskiarvovektorin käyttö mahdollistaa lisäedun järjestelmän dynaamisen adaptiivisuuden muodossa. Adaptiivisuus tässä yhteydessä tarkoittaa, että termien semanttista tulkintaa voidaan muokata suoritetun analyysin tulos-15 ten välityksellä. Esimerkiksi tarkkailtava aihe tyypillisesti kehittyy ja tiedon määrä kasautuu tarkkailujakson aikana. Muokkauksessa relevantin dokumentin havaitsemisen jälkeen keskiarvovektori lasketaan uudelleen käyttäen uutta relevanttia dokumenttia yhtenä relevanteista dokumenteista. Keskiarvovektorin välityksellä uudet termit ja ilmaisut tarkkailtavan tapauksen kuvaamisessa päivit-20 tyvät dynaamisesti tietokohteiden semanttiseen tulkintaan. Lukuisista iteraati-. öistä tulevan takaisinsyötön välityksellä aiheen semanttista tulkintaa voidaan * jatkuvasta fokusoida, mikä jälleen parantaa järjestelmän toimintaa.The use of the mean vector provides an additional advantage in the form of dynamic adaptability of the system. Adaptivity in this context means that the semantic interpretation of the terms can be modified through the results of the analysis performed. For example, the subject being monitored typically evolves and the amount of information accumulates during the observation period. In the edit, after finding a relevant document, the mean vector is recalculated using the new relevant document as one of the relevant documents. Through the mean vector, new terms and expressions in describing the observed case are dynamically updated to the semantic interpretation of the data objects. Of the numerous iterations. through night-time feedback, semantic interpretation of the subject can * be continuously focused, again improving system performance.

• « » [•*(* Kuvion 7 vuokaavion avulla havainnollistetaan suodatuksen dynaa- : mistä säätöä. Suoritusmuotona käytetyssä tapauksessa järjestelmään on jär- • * 25 jestetty (vaihe 70) joukko vertailutermejä, jotka edustavat määrätyn aiheen suhteen relevanteiksi tiedettyjen tietokohteiden joukon keskiarvoa. Vaiheet 71- 75 vastaavat olennaisesti kuvion 6 vaiheita 61-65. Kun tulosarvo R, joka ilmai- see tiedon mahdollista samanlaisuutta tai erilaisuutta vertailutermien muodos- tämän suodatuskriteerin suhteen, on käytettävissä, järjestelmä tarkistaa (vaihe .···. 30 76) soveltuuko tulos käytössä olevan keskiarvovektorin säätämiseen. Säätö- • ♦ päätöksen ehto voi vaihdella sovelluskohtaisesti. Esimerkiksi säätö voidaan tehdä sellaisen tulosarvon perusteella, joka ilmaisee suuren samankaltaisuu-den keskiarvovektorin kanssa (säätö positiivisen löydöksen perusteella). Vaih-. .·. toehtoisesti säätö voidaan tehdä sellaisen tulosarvon perusteella, joka ilmaisee 35 hyvin heikkoa samanlaisuutta keskiarvovektorin kanssa (säätö negatiivisen • · löydöksen perusteella). Alan ammattilaiselle on selvää, että suojapiirin sisällä 16 voidaan käyttää myös muita kriteereitä. Jos järjestelmä päättää (vaihe 77), että tietokohdetta voidaan käyttää keskiarvovektorin fokusoimiseen, se laskee uudelleen keskiarvovektorin, jonka jälkeen uutta päivitettyä keskiarvovektoria tullaan käyttämään vertailussa kohdetermien kanssa (vaihe 78). Jos päivitystä ei 5 pidetä tarpeellisena, järjestelmä jatkaa vaiheiden 79-81 läpi kuten kuvion 6 vaiheissa 66-68.The flow diagram of FIG. 7 illustrates a dynamic control of filtering. In this embodiment, the system is provided (step 70) with a set of reference terms representing the average of a set of data items known to be relevant to a given subject. 71-75 correspond substantially to steps 61-65 of Figure 6. When a conversion value R, which indicates a possible similarity or difference in the filtering criterion of the comparison terms, is available, the system checks (step. ···. 30 76) whether the result is applicable • ♦ the decision condition may vary from application to application, for example, the adjustment may be based on a result value that expresses a high degree of similarity to the average vector (adjustment based on a positive finding). with a door that expresses 35 very weak similarities with the mean vector (adjustment based on a negative · · finding). It will be apparent to one skilled in the art that other criteria may be used within the scope 16. If the system decides (step 77) that the data object can be used to focus on the mean vector, it recalculates the mean vector, after which the new updated mean vector will be used for comparison with the target terms (step 78). If no update 5 is deemed necessary, the system proceeds through steps 79-81 as in steps 66-68 of Figure 6.

Lisäksi keskiarvovektorin käyttö mahdollistaa myös semanttisen tulkinnan säädön sellaisten laskennallisten kriteerien perusteella, joiden määrittäminen järjestelmän toiminnan aikana on suhteellisen helppoa tai jotka ovat 10 jopa saatavilla samanlaisuusfunktioiden laskennan sivutuotteena. Esimerkiksi, kuten edellä keskusteltiin, termin informatiivisuus voidaan määrittää sen taajuudesta perättäisten tietokohteiden sarjassa. Ajoittain termi, joka melko harvoin esiintyy tietokohteissa voi hypähtää esiin hyvin usein, ja tällaisina kertoina termi itse muuttuu vähemmän informatiiviseksi.In addition, the use of a mean vector also allows for the adjustment of semantic interpretation based on computational criteria that are relatively easy to determine during system operation or even available as a by-product of computing similarity functions. For example, as discussed above, the informative nature of a term can be determined by its frequency in a series of consecutive data objects. From time to time, a term that rarely appears in data objects can pop up very often, and as such, the term itself becomes less informative.

15 Esimerkiksi termi ‘Lebanon’ voi olla pitkään suhteellisen harvoin käytetty, ja termin informatiivisuus pysyy määrätyllä tasolla. Levottomuuksien aikaan niiden tietokohteiden määrä, jotka sisältävät termin ‘Lebanon’ lisääntyy, mutta tietokohteet voivat tosiasiallisesti olla relevantteja erilaisille semanttisille aiheille, esimerkiksi Oil damanges’, 'UN operations’, ’tourism’. Relevanttien 20 tietokohteiden havaitsemiseksi tehtävän dynaamisen suodatuksen tarkkuuden "a>. parantamiseksi termin 'Lebanon' informatiivisuutta voidaan vähentää tai lisätä j„* suhteessa taajuuteen, jolla termi esiintyy perättäisissä tietokohteissa.For example, the term 'Lebanon' may be relatively infrequently used for a long time, and the informativeness of the term remains at a certain level. During times of unrest, the number of data objects containing the term 'Lebanon' increases, but data objects may actually be relevant to various semantic topics, such as Oil damanges, 'UN operations', 'Tourism'. In order to improve the accuracy of dynamic filtering to detect relevant data objects, the informativeness of the term 'Lebanon' may be reduced or increased relative to the frequency at which the term occurs in consecutive data objects.

• * * |·*#* Toisena esimerkkinä, jos keskiarvovektori edustaa kokoelmaa ylei- : sistä termeistä keskiarvotetuissa relevanteissa dokumenteissa, on mahdollista • · : 25 säätää suodatusta fokusoimalla keskiarvo yleisimpiin termeihin. Tämä vähen- ;.·,*·* tää virheellisten positiivisten tulkintojen mahdollisuutta ('kohina') suodatuksen aikana.• * * | · * # * As another example, if the mean vector represents a collection of generic terms in averaged relevant documents, it is possible to · ·: 25 adjust the filtering by focusing the mean on the most common terms. This reduces the possibility of false positive interpretations ('noise') during filtering.

• · ·• · ·

Lisäesimerkkinä on myös mahdollista, että kun aika kulkee eteen- päin, saman aiheen suhteen relevanteissa tietokohteissa käytetyt termit muut- .···. 30 tuvat merkittävästi. Tätä voidaan helpottaa säätämällä termien painoa keskiar- • · ]‘*a vovektorissa niiden perättäisissä tietokohteissa olevan taajuuden mukaan.As a further example, as time goes on, the terms used in the relevant data items for the same subject will change. 30 significantly increased. This can be facilitated by adjusting the weight of the terms in the mean · ·] '* a wave vector according to the frequency in their successive data objects.

**.’*: Esimerkiksi, jos tietokohde ei ole esiintynyt määrätyssä määrässä perättäisiä tietokohteita, tai määrätyn ajanjakson aikana, sen painoa voidaan vähentää.**. '*: For example, if a data item has not appeared in a certain number of consecutive data items, or for a specified period of time, its weight may be reduced.

. .·. Termin esiintymiset, tai perättäiset esiintymiset, vastaavasti, virkistävät termiä .!!!: 35 lisäämällä sen painoa keskiarvovektorissa.. . ·. Appearances of the term, or successive appearances, respectively, refresh the term. !!!: 35 by increasing its weight in the mean vector.

• « 17 Nämä menetelmät keskiarvovektorin säätämiseksi ovat vain esimerkinomaisia, eikä niitä tule tulkita suojapiirin rajoituksina. Alan ammattilaiselle on ilmeistä, että keskiarvovektoria voidaan säätää monin eri tavoin.• «17 These methods for adjusting the mean vector are exemplary only and should not be construed as limiting the scope. It will be apparent to one skilled in the art that the mean vector can be adjusted in many different ways.

Vektoreiden avaruus on harvoin lineaarisesti erotettavissa ja se-5 manttisten luokkien määrä voi joskus olla melko pieni. Opittavuuden parantamiseksi keksinnön edullisessa suoritusmuodossa vektoria v laajennetaan ar-vokertajoukolla (power): Φ(γ) m (1, Oi, «2 ... , Vf>, i'lVU) tvtty, .., , »l ‘ * Wn) jossa 10 Φ·. IR" - JR*·* kuvaa vektorin v avaruuteen, jossa 2n dimensiota vastaa semanttisten luokkien osajoukkoa.The space of vectors is rarely linearly separable and the number of se-5 mantle classes can sometimes be quite small. To improve readability, in a preferred embodiment of the invention, the vector v is expanded by a power set: Φ (γ) m (1, 0, 2 2 ..., Vf>, i'lVU) set, ..,, »l '* Wn) where 10 Φ ·. IR "- JR * · * represents a vector v for space where the 2n dimension corresponds to a subset of semantic classes.

Esimerkiksi semanttisia luokkia N, L, M käyttävässä semanttisessa ympäristössä tietokohteet A ja B voidaan prosessoida multivektoreiksi 15 A(ni,li,mi) ja B(n2,l2.m2), kuten edellä kuvattiin. Merkitsemällä luokkien N, L, M samanlaisuusfunktioita SN, SL, SM, vastaavasti, samanlaisuusvektorin v esimerkinomaiset arvot voisivat olla: V(N)=SN(ni,n2)=0,2 V(M)=SM(m1,m2)=0,3 20 V(L)=Si.{nun2)=0,5For example, in a semantic environment using semantic classes N, L, M, data objects A and B can be processed into multivectors 15A (ni, li, mi) and B (n2, 1.2m2) as described above. By denoting the similarity functions SN, SL, SM of the classes N, L, M, respectively, the exemplary values of the similarity vector v could be: V (N) = SN (ni, n2) = 0.2 V (M) = SM (m1, m2) = 0.3 20 V (L) = Si. (Nun 2) = 0.5

Kuten edellä kuvattiin, jotta voitaisiin määrittää ilmaiseeko samanlaisuusvektori :T: v riittävää semanttista samanlaisuutta (tässä yhteydessä puhutaan relevans- : :*: sista) tietokohteiden A ja B välillä, eräs tapa tulkita vektorien samanlaisuutta • ·*: käytettävässä semanttisessa ympäristössä. Siten useita tietokohteita, joiden . 25 tiedetään olevan riittävän samanlaisia relevanssin kannalta, verrataan samalla • · « ,··*, tavalla ja rekisteri lasketuista samanlaisuusvektoreista v ja vastaavista para- • · metriarvoista, kuten 1 ja -1, tallennetaan opetusmateriaaliin. Tämä tuottaa joukon vektori/arvopareja *;]/ (<v1,1 >,<v2,-1 >,<v3,-1 > <vn,1> 30 Tilastollisen analyysin tai koneoppimisen perusteella on mahdollista määrittää painovektori w, joiden avulla uusien määritettyjen samanlaisuusvektoreiden • · voidaan havaita edustavan relevanssia tai ei-relevanssia. Päätös voi perustua *. v:n etäisyyteen w:stä, laskettuna sisätulon avulla. Sisätulon positiivisen arvon • · · *·:·] voidaan tulkita ilmaisevan relevanssia, negatiivisen ei-relevanssia.As described above, to determine whether the similarity vector: T expresses sufficient semantic similarity (referred to herein as relevance: *) between data objects A and B, one way to interpret vector similarity in the semantic environment used is · · *. Thus, several data objects whose. 25 known to be sufficiently similar in terms of relevance, is compared in the same way, and a register of calculated similarity vectors v and corresponding parametric values, such as 1 and -1, is stored in the training material. This produces a set of vector / value pairs *;] / (<v1,1>, <v2, -1>, <v3, -1> <vn, 1> 30) From statistical analysis or machine learning, it is possible to determine a weight vector w that defined · similarity vectors • · can be observed to represent relevance or non-relevance The decision may be based on the distance of *. v from w as calculated by the internal product Positive value of the internal product • · · * ·: ·] may be interpreted as expressing relevance, negative non-relevant .

• · 18• · 18

On havaittu, että kolme tai neljä semanttista luokkaa, joita voisi hyvin käyttää semanttiseen analyysiin, eivät välttämättä tarjoa riittävää määrää luokkia koneoppimista varten. Hypertason suhteen tehtävien vertailujen tarkkuuden parantamiseksi vektoreita voidaan laajentaa. Tässä suoritusmuodossa 5 laajennus tehdään dimensioilla N*M, N*L, M*L, N*M*L. Useissa tapauksissa samanlaisuusvektorin ja hypertason välisen etäisyyden määrittäminen edellyttää, että hypertason w etäisyys otetaan huomioon. Laskennan virtaviivaistamiseksi tavallisesti oletetaan, että w[0]=b ja v:hen sisällytetään neutraali elementti v[0]=1. Siten laajennetuksi vektoriksi tulee 10 V’=(1; 0,2; 0,3; 0,5; 0,06; 0,1; 0,15; 0,03)It has been found that three or four semantic classes that could be well used for semantic analysis do not necessarily provide a sufficient number of classes for machine learning. The vectors can be expanded to improve the accuracy of the comparisons with respect to the hypertension. In this embodiment, the expansion is made with dimensions N * M, N * L, M * L, N * M * L. In many cases, to determine the distance between the identity vector and the hypertension, it is necessary to consider the distance of the hypertension w. To streamline the calculation, it is usually assumed that w [0] = b and that a neutral element v [0] = 1 is included in v. Thus, the expanded vector becomes 10 V '= (1; 0.2; 0.3; 0.5; 0.06; 0.1; 0.15; 0.03)

On havaittu, että kahdeksandimensioisessa avaruudessa painovektori w’, joka havaitsee positiiviset ja negatiiviset tapaukset, on helpompi löytää. Edellä esitetty esimerkki on tarjottu vain havainnollistamaan esitettyjä esimerkkejä. Muiden laajennusmenetelmien soveltaminen on alan ammattilaiselle ilmeistä.It has been found that in eight-dimensional space the weight vector w ', which detects positive and negative cases, is easier to find. The above example is provided for purposes of illustration only. The application of other expansion methods will be apparent to one skilled in the art.

15 Eräässä näkökulmassa keksintö tarjoaa tietokoneohjelmatuotteen, joka koodaa käskyjä sisältävän tietokoneohjelman tietokonetoimintosarjan suorittamiseksi.In one aspect, the invention provides a computer program product that encodes a computer program containing instructions to execute a computer procedure.

Eräässä toisessa näkökulmassa keksintö tarjoaa tietokoneella luettavissa olevan tietokoneohjelman jakelutietovälineen, joka koodaa käskyjä si-20 sältävän tietokoneohjelman tietokonetoimintosarjan suorittamiseksi.In another aspect, the invention provides a computer-readable computer program distribution medium that encodes a computer program containing instructions to execute a computer program sequence.

Jakelutietoväline voi sisältää tietokoneella luettavissa olevan tieto-välineen, ohjelman tallennustietovälineen, taltiointitietovälineen, tietokoneella • « » [·*'’ luettavissa olevan muistin, tietokoneella luettavissa olevan ohjelmiston jakelu- :·:: paketin, tietokoneella luettavissa olevan signaalin, tietokoneella luettavissa • » · 25 olevan tietoliikennesignaalin ja/tai tietokoneella luettavissa olevan tiivistetyn s,;*s ohjelmistopaketin.The distribution medium may include computer readable media, program storage media, storage media, computer readable memory, computer readable software distribution: · :: packet, computer readable signal, computer readable • »· 25, and / or a computer readable compressed software package.

:***: Tietokoneprosessin suoritusmuotoja on näytetty ja kuvattu kuvion 6 yhteydessä. Tietokoneohjelma voidaan suorittaa tietokohteita vastaanottavan laskentasolmun ohjausyksikössä.: ***: Embodiments of a computer process have been shown and described in conjunction with Figure 6. The computer program may be executed in the control unit of the computing node receiving the data objects.

.···, 30 Vaikka keksintöä kuvataan edellä viitaten liitettyjen piirrosten mukai- • sen esimerkin mukaisesti, on selvää, että keksintö ei rajoitu näihin vaan sitä \*·: voidaan muunnella monin tavoin oheisten vaatimusten mukaisessa suojapiiris- *:**· sä.While the invention is described above with reference to the example of the accompanying drawings, it will be understood that the invention is not limited thereto, but may be modified in many ways within the scope of the appended claims.

• · « • · · ♦ ·» ♦ *• · «• · · ♦ ·» ♦ *

Claims

A method for filtering data objects from a data object stream, which data object comprises several terms, which term is a separately identifiable character string, to which is assigned a meaning in a semantic environment, characterized in that the method includes: is divided into the semantic environment the terms into semantic classes; (60) an amount of subsistence is determined by one or more comparison sites; (63) from the data object, an amount of one or more message terms is determined; (64) a relevance vector is calculated between the reference terms and the melter terms, wherein the relevance vector element is calculated by a predetermined relevance function for a semantic class, which is arranged to restore the element value corresponding to the semantic congruence of the semantic class between the reference terms; (65), the position of the relevance vector is compared to a predetermined relationship hyperactive, which divides the relevance vector space into parts, each part of the relational vector space enclosing within the part space of the relevance vector space and. 20 corresponds to the determined filtering function; For example, (65) for the data object, a filtering function is determined according to a portion of the position of the vector object and (66) the filtering function determined for the data object is executed, in which method additionally #: generates a plurality of reference terms from an average vector determined from an average vector. amount of data object selected from the data object flow and whose components correspond to the mean of the terms of the selected data object; • · · is determined by a part of the relevance vector space, which relates to the most significant positions for the mean; . ···. In response to the position object of the data object's relevance vector being significant, the mean value vector (78) is again determined by saying that data object is included in the amount of selected data object.

A method according to claim 1, characterized in that. . ·. to one or more components of the average vector, a weight is added; The weightings of the mean vector components are regulated in a series of successive data objects on the basis of a statistical analysis relating to the occurrence of the terms.

3. A method according to claim 1 or 2, characterized in that the comparison step comprises calculating the distance of the relevance vector and hyperplane with an input and that the positive values are considered to indicate that the vector is in a first part and the negative values are considered to indicate that the vector is in a second part.

Method according to any of the preceding claims 1-3, characterized in that the semantic classes comprise at least one of the following classes: places, times and names.

Method according to any of the preceding claims 1-4, characterized in that the predetermined semantic relevance in the semantic class relating to locations is calculated on the basis of a hierarchical maths, each level corresponding to the type of hierarchical location determination. .

Method according to claim 5, characterized in that the predetermined semantic relevance is calculated by dividing the length of the common path in the hierarchical matrix by the sum of the element paths. 20

Method according to any of the preceding claims 1-6, characterized in that the comparison step comprises the use of extra dimensions generated from the combinations of semantic classes.

* · · / * / 8. System, comprising:: ··: an interface unit (11) for receiving a flow of data objects comprising: several terms, which term is a separately identifiable character string to which say a meaning in a semantic environment; a control unit (14) electrically connected to the interface unit, which control unit can be controlled at least in part with a program code, characterized in that the program code comprises. A program code which causes said system to divide in the semantic environment the terms into semantic classes; a program code which causes said system to determine an amount of "existence of one or more reference terms (B); . . *. a program code which causes said system to determine from the data object an amount of one or more melter terms (A); 2061742FI a program code which causes said system to calculate (P1) a relation vector (v) between the reference terms and the melter terms, wherein the relevance vector elements are calculated by a relevance function for a semantic class specific predetermined function, which is arranged to reset the element value. , which corresponds to the semantic congruence of the semantic class between the reference terms and the melter terms; a program code which causes said system to compare the position of the (P2) relevance vector with a predetermined relevance hyper level (w) which divides the relational vector space into parts, each part of the relevance vector space closes within the subspace of the relevance vector space and corresponds to a determined filtering -feature; a program code which causes said system to determine (P3) for the data object a filtering function (F (R)) according to a portion of the position of the vector and to execute the filtering function determined for the data object, a program code which causes said system to form a plurality of reference terms from a mean vector (Bave), which is determined from the amount of data objects selected from the data object flow and whose components correspond to the mean value of the terms of the selected data objects; a program code which causes said system to determine from the relational vector space a portion which relates to those for the mean calculation. most significant positions; ) .. * a program code, which causes said system to, in response to the position object's data object * position, be significant considering the mean value calculation: determine the mean value vector again, so that said data object • »: 25 is included in the the amount of selected data objects.

:, :): 9. System according to claim 8, characterized by a program code, which in addition allows the system to · add to one or more components of the average vector a weight. up; • * *, ···. 30 regulate the weightings in a series of successive data objects on the basis of a statistical analysis relating to the occurrence of the terms.

10. A system according to claim 8 or 9, characterized in that the equation (P2) comprises the calculation of the relevance vector and the hyper level of. ) *. close with an input, and that the positive values are considered to indicate that the vector is. !!!: 35 in the first part and the negative values are considered to indicate that the vector is in a second part. 2061742FI

System according to any of the preceding claims 8-10, characterized in that the comparison step comprises the use of extra dimensions generated from the combinations of the semantic classes.

System according to any of the preceding claims 8-11, 5, characterized by the avat system system network server.

A computer program product, which encodes data objects comprising several terms in a computer process consisting of commands to perform a computer process to be filtered from the data object stream, which term is a separately identifiable data unit to which a meaning is assigned in a semantic environment , Characterized in that the process comprises: the terms in the semantic environment are divided into semantic classes; an amount of resistance is determined by one or more reference terms; from the data object, an amount of one or more melter terms is determined; 15, a relevance vector is calculated between the reference terms and the melter terms, wherein the relevance vector element is calculated by a predetermined relevance function for a semantic class, which is arranged to reset the element value corresponding to the semantic congruence of the semantic class between the reference terms and the melter; The position of the relevance vector is compared with a predetermined relevance hyperplane which divides the relevance vector space into parts, each relevance vector space within it closes the relevance vector space subspace and performs a filtering function for the data object * determined; : a filtering function is determined for the data object according to part of the position of the generator and the filtering function determined for the data object is performed; a plurality of reference terms are formed from an average vector, which is determined from a plurality of data objects selected from the data object flow and whose components correspond to the mean of the terms of the selected data objects; is determined from a part of the relevance vector space that relates to those for. 30 means the most significant positions; • · In response to the position object's relevance vector position being significant considering the mean value calculation, a mean value vector * is determined: "in (78) again, so that said data object is included in the amount of selected data object.

Computer software distribution data means, which can be read with a computer and encode data objects comprising several terms in a computer process, the creation of commands to execute a computer process to be filtered from a 2061742FI data object stream, which term is a separately identifiable data unit, which relates to a meaning in a semantic environment, characterized by the processor comprising: the terms of the semantic environment are subdivided into semantic classes; an amount of resistance is determined by one or more reference terms; from the data object, an amount of one or more melter terms is determined; a relevance vector is calculated between the reference terms and the malterms, wherein the relevance vector element is calculated by a predetermined relevance function for a semantic class which is arranged to recover the element value corresponding to the semantic congruence of the semantic class between the reference terms and the melter terms; comparing the position of the relevance vector with a predetermined relevance hypernive which divides the relevance vector space into parts, each part of the relevance vector space closes within the subspace of the relevance vector space and corresponds to a definite filtering function; for the data object, a filtering function is determined according to part of the position of the vector and the filtering function determined for the data object is performed; a plurality of reference terms are formed from an average value vector which is determined from a plurality of data objects selected from the data object flow and whose commands. ponents correspond to the mean of the terms of the selected data objects; is determined from a part of the relevance vector space, which relates to the significant positions for the mean value; : in response to the position object of the data object's relevance vector being significant - considering the average value calculation, an average vector is determined again, so that said data object is included in the amount of selected data objects.

The computer software distribution data medium according to claim 14, wherein the computer software distribution data means comprises a computer-readable data medium, a program storage data medium, a storage data medium, a with computer, ···, a readable memory, a computer-readable software distribution package, a • »T with computer readable signal, one with computer readable telecommunication signal \ 1 ·; and / or a computer-readable software package packed. • · • «• •