NO327318B1

NO327318B1 - Steps to improve the efficiency of a search engine

Info

Publication number: NO327318B1
Application number: NO20080836A
Authority: NO
Inventors: Johannes Gehrke; Robbert Van Renesse; Fred Schneider
Original assignee: Fast Search & Transfer Asa
Priority date: 2008-02-15
Filing date: 2008-02-15
Publication date: 2009-06-08
Also published as: NO20080836A

Abstract

I en fremgangsmåte for å forbedre effektiviteten til en søkemotor ved aksessering, søking og gjenfinning av informasjon i form av dokumenter lagret i dokument- eller innholdsmagasiner omfatter søkemotoren en gruppe av søkenoder anbrakt på en eller flere tjenere. En indeks for de lagrede dokumenter dannes. Søkemotoren behandler et søkespørsmål fra en bruker og returnerer en resultatmengde av dokumenter som tilsvarer søkespørsmålet. Søkemotorens indeks konfigureres på basis av én eller flere dokumentegenskaper og partisjoneres, reproduseres og fordeles over gruppen av søkenoder. Søkespørsmålene behandles på basis av den fordelte indeks. Fremgangsmåten realiserer en struktur for å fordele indeksen til en søkemotor over en rekke verter i en datamaskinklynge og er basert på tre ortogonale mekanismer for indeksfordeling, nemlig indekspartisjonering, indeksreproduksjon og tilordning av replikkene til vertene. På denne måte fås det forskjellige måter for å konfigurere indeksen til en søkemotor, og det skaffes en sterkt forbedret ressursutnyttelse kombinert med hvilken som helst ønsket grad av feiltoleranse.In a method of improving the efficiency of a search engine in accessing, searching and retrieving information in the form of documents stored in document or content magazines, the search engine comprises a group of search nodes placed on one or more servers. An index of the stored documents is created. The search engine processes a search query from a user and returns a result set of documents corresponding to the search query. The search engine index is configured based on one or more document properties and is partitioned, reproduced and distributed across the group of search nodes. The search queries are processed on the basis of the distributed index. The method implements a structure for distributing the index of a search engine across a series of hosts in a computer cluster and is based on three orthogonal mechanisms for index distribution, namely index partitioning, index reproduction and assignment of the replicas to the hosts. In this way, different ways of configuring the index of a search engine are obtained, and a greatly improved resource utilization is obtained combined with any desired degree of fault tolerance.

Description

Oppfinnelsen angår en fremgangsmåte for å forbedre effektiviteten til en søkemotor ved aksessering, søking og gjenfinning av informasjon i form av dokumenter lagret i dokument- eller innholdsmagasiner, hvor et indekseringssystem i søkemotoren nedsamler de lagrede dokumenter og genererer en indeks for disse, hvor anvendelsen av en brukers søkespørsmål på indeksen vil returnere til brukeren en resultatmengde med i det minste noen dokumenter som tilsvarer søkespørsmålet, og hvor søkemotoren omfatter en gruppe av søkenoder plassert på én eller flere tjenere. The invention relates to a method for improving the efficiency of a search engine when accessing, searching and retrieving information in the form of documents stored in document or content magazines, where an indexing system in the search engine collects the stored documents and generates an index for them, where the application of a user's search query on the index will return to the user a result set with at least some documents corresponding to the search query, and where the search engine comprises a group of search nodes located on one or more servers.

Spesielt viser oppfinnelsen hvordan en ny struktur for indeksfordeling på en søkemotor, og enda mer spesielt på en bedriftssøkemotor, kan dannes. In particular, the invention shows how a new structure for index distribution on a search engine, and even more particularly on a business search engine, can be formed.

Å bygge en søkemotor byr på utfordringer av flere grunner: Building a search engine presents challenges for several reasons:

• Ytelse. Latensen for å beregne et søkespørsmålsrespons må være meget lav og bedriftssøkemotoren må støtte en høy spørsmålsrate. • Skalerbarhet. Ytelsen må skalere med antallet dokumenter og ankomstraten for søkespørsmål. • Feiltoleranse. Søkemotoren må opprettholde høy tilgjengelighet og høy ytelse selv under maskinvarefeil. • Performance. The latency to calculate a search query response must be very low and the enterprise search engine must support a high query rate. • Scalability. Performance must scale with the number of documents and the arrival rate of search queries. • Fault tolerance. The search engine must maintain high availability and high performance even during hardware failures.

For å tilfredsstille de ovennevnte tre krav, benytter søkemotorer sofistikerte metoder for å fordele sine indekser over en muligvis stor klynge av verter. To satisfy the above three requirements, search engines use sophisticated methods to distribute their indexes over a potentially large cluster of hosts.

Kjent teknikk Known technique

En oversikt og drøftelse av kjent teknikk relevant for den foreliggende oppfinnelse skal nå gis. Alle litteraturreferanser identifiseres ved forkortelser i parentes på det passende sted i det følgende. En fullstendig bibliografi er gitt i et vedlegg ved slutten av beskrivelsen. An overview and discussion of known technology relevant to the present invention will now be given. All literature references are identified by abbreviations in brackets at the appropriate place in the following. A full bibliography is provided in an appendix at the end of the description.

For å forbedre effektiviteten til søkesystemer er det nylig utført mye forskning på fordelingen av søkemotorindekser. Tidligere arbeider beskjeftiget seg med hvordan posteringslister skulle fordeles og undersøkte avveiningen mellom fordeling av posteringslister basert på indekstermer (her også kalt stikkord) og dokumenter [BadOl, MMROO, RNB98, TGM93, CKE<+>90, MWZ06]. Den foreliggende oppfinnelse tar utgangspunkt i den innsikt at et globalt valg mellom disse to alternativer er suboptimalt fordi de statistiske egenskaper til stikkord og dokumenter varierer i et typisk søkemiljø. Dette kan belyses av frasen "one size does not fit all" for to eksempler: • For et stikkord k hvis posteringsliste kan rommes på en enkelt minneside, vil fordelingen av k' s posteringslister over flere verter i realiteten øke responstiden for søkespørsmål som involverer k fordi mange verter vil være involvert i å gjenfinne posteringslisten selv om en enkelt vert ville være i stand til å gjenfinne posteringslisten med en enkelt minneaksess. For et stikkord k<1> hvis posteringsliste ikke får plass på en enkelt minneside, vil imidlertid fordelingen av k' posteringslister over mengden av verter redusere responstiden og forskjellige deler av posteringslisten kan gjenfinnes i parallell. • For et upopulært stikkord k som bare forekommer i noen få søkespørsmål, utgjør reproduksjonen av dets posteringsliste et spill av ressurser, da det er små muligheter for parallellisme ved eksekvering av søkespørsmålene og således vil ikke mange søkespørsmål noensinne lese k' s posteringslister i parallell fra forskjellige verter. Posteringslisten for et forekommende populært stikkord k' vil imidlertid aksesseres av mange søkespørsmål og bør således reproduseres for å muliggjøre parallellisme. In order to improve the efficiency of search systems, much research has recently been conducted on the distribution of search engine indexes. Earlier work dealt with how posting lists should be distributed and investigated the trade-off between distribution of posting lists based on index terms (here also called keywords) and documents [BadOl, MMROO, RNB98, TGM93, CKE<+>90, MWZ06]. The present invention is based on the insight that a global choice between these two alternatives is suboptimal because the statistical properties of keywords and documents vary in a typical search environment. This can be illustrated by the phrase "one size does not fit all" for two examples: • For a keyword k whose posting list can be accommodated on a single memory page, the distribution of k's posting lists over several hosts will in effect increase the response time for search queries involving k because many hosts would be involved in retrieving the posting list even though a single host would be able to retrieve the posting list with a single memory access. For a keyword k<1> if the posting list does not fit on a single memory page, however, the distribution of k' posting lists over the set of hosts will reduce the response time and different parts of the posting list can be found in parallel. • For an unpopular keyword k that only appears in a few queries, the reproduction of its posting list is a waste of resources, as there is little opportunity for parallelism when executing the queries and thus not many queries will ever read k's posting lists in parallel from different hosts. The posting list for an occurring popular keyword k' will, however, be accessed by many search queries and should thus be reproduced to enable parallelism.

For å bedre forstå kjent teknikk, skal det gis en kort drøftelse av en søkemotorarkitektur som kjent i teknikken og for tiden benyttet, med henvisning til fig. 1 som viser et blokkdiagram av en søkemotor slik den vil være kjent for fagfolk, dens viktigste undersystemer og dens grensesnitt respektive til et innholdsdomene, dvs. dokumentmagasinet som kan underkastes et søk, og et klientdomene som omfatter alle brukere som fremsetter søkespørsmål til søkemotoren for gjenfinning av spørsmåls-tilsvarende dokumenter fra innholdsdomenet. In order to better understand the prior art, a brief discussion of a search engine architecture as known in the art and currently used shall be given, with reference to fig. 1 showing a block diagram of a search engine as it will be known to those skilled in the art, its main subsystems and its interfaces respectively to a content domain, i.e. the document store which can be subjected to a search, and a client domain comprising all users submitting search queries to the search engine for retrieval of question-matching documents from the content domain.

Søkemotoren 100 i henhold til den foreliggende oppfinnelse og som kjent i teknikken, omfatter forskjellige undersystemer 101-107. Søkemotoren kan aksessere dokument- eller innholdsmagasiner som befinner seg i et innholdsdomene eller -rom, hvorfra innholdet enten kan aktivt skyves inn i søkemotoren eller med bruk av en datakobler trekkes inn i søkemotoren. Typiske magasiner innbefatter databaser, kilder som står til rådighet via ETL- (Extract-Transform-Load)verktøy slik som Informatica, ethvert XML-formatert magasin, filer fra filtjenere, filer fra vevtjenere, dokumenthåndteringssystemer, innholdshåndteringssystemer, e-postsystemer, kommunikasjonssystemer, samarbeidssystemer og rike media, så som audio, bilder og video. De gjenfunne dokumenter leveres til søkemotoren 100 via et innholds-API (Application Programming Interface) 102. Deretter blir dokumentene analysert i et innholdsanalysetrinn 103, som også betegnes et undersystem for forhåndsprosessering av innhold for å forberede innholdet for forbedrede søke- og oppdagelsesoperasjoner. Typisk kan utdata fra dette innholdsanalysetrinn 103 være en XML-representasjon av inndokumentet. Utdata fra innholdsanalysen benyttes til å mate kjernesøkemotoren 101. Kjernesøkemotoren 101 kan typisk være anbrakt på en tjenerfarm på en desentralisert måte for å tillate prosessering av store dokumentmengder og høye spørsmålsbelastninger. Kjernesøkemotoren 101 aksepterer brukeranmodninger og frembringer lister av tilsvarende dokumenter. Dokumentordningen blir vanligvis bestemt i henhold til en relevansmodell som måler den sannsynlige betydning av et gitt dokument relativt til søkespørsmålet. I tillegg kan kjernesøkemotoren 101 frembringe ytterligere metadata om resultatmengden så som sammendragsinformasjon på dokumentattributter. Kjernesøkemotoren 101 i seg selv omfatter ytterligere undersystemer, nemlig et indekseringsundersystem 101a for nedsamling ("crawling") og indeksering av innholdsdokumenter og et søkeundersystem 101b for å utføre det egentlige søk og gjenfinning. Alternativt kan utdataene fra innholdsanalysetrinnet 103 mates inn i en valgfri varselmotor 104. Varselmotoren 104 vil ha lagret et sett av søkespørsmål og kan bestemme hvilke søkespørsmål som ville ha blitt tilfredsstilt av den gitte dokumentinngang. En søkemotor kan aksesseres fra en rekke forskjellige klienter og applikasjoner som typisk kan være mobile eller datamaskinbaserte klientapplikasjoner. Andre klienter innbefatter PDAer og spillinnretninger. Disse klientene, som befinner seg i et klientrom eller -domene, leverer anmodninger til et søkespørsmål eller klient-API 107 i søkemotoren. Søkemotoren 100 vil typisk ha et ytterligere undersystem i form av et søkespørsmålsanalysetrinn 105 for å analysere og forfine søkespørsmålet med tanke på å konstruere et avledet søkespørsmål som kan utvinne mer meningsfylt informasjon. Endelig blir utdata fra kjernesøkemotoren 101 typisk ytterligere analysert i et annet undersystem, nemlig et resultatanalysetrinn 106 for å frembringe informasjon eller visualiseringer som benyttes av klientene. - Begge trinn 105 og 106 er forbundet mellom kjernesøkemotoren 101 og klient-API 107, og i tilfelle varselmotoren 104 er til stede, er den forbundet i parallell med kjernesøkemotoren 101 og mellom innholdsanalysetrinnet 103 og søkespørsmåls- og resultatanalysetrinnet 105;106. The search engine 100 according to the present invention and as known in the art, comprises different subsystems 101-107. The search engine can access document or content magazines located in a content domain or space, from which the content can either be actively pushed into the search engine or, with the use of a data coupler, pulled into the search engine. Typical warehouses include databases, sources available via ETL (Extract-Transform-Load) tools such as Informatica, any XML-formatted warehouse, files from file servers, files from web servers, document management systems, content management systems, e-mail systems, communication systems, collaboration systems and rich media, such as audio, images and video. The recovered documents are delivered to the search engine 100 via a content API (Application Programming Interface) 102. The documents are then analyzed in a content analysis step 103, which is also referred to as a content pre-processing subsystem to prepare the content for improved search and discovery operations. Typically, output from this content analysis step 103 can be an XML representation of the input document. Output from the content analysis is used to feed the core search engine 101. The core search engine 101 can typically be located on a server farm in a decentralized manner to allow processing of large document volumes and high query loads. The core search engine 101 accepts user requests and produces lists of corresponding documents. The document arrangement is usually determined according to a relevance model that measures the likely importance of a given document relative to the search query. In addition, the core search engine 101 can produce additional metadata about the result set such as summary information on document attributes. The core search engine 101 itself comprises additional subsystems, namely an indexing subsystem 101a for crawling and indexing content documents and a search subsystem 101b for performing the actual search and retrieval. Alternatively, the output from the content analysis step 103 can be fed into an optional alert engine 104. The alert engine 104 will have stored a set of search queries and can determine which search queries would have been satisfied by the given document input. A search engine can be accessed from a number of different clients and applications which can typically be mobile or computer-based client applications. Other clients include PDAs and gaming devices. These clients, located in a client space or domain, deliver requests to a search query or client API 107 in the search engine. The search engine 100 will typically have a further subsystem in the form of a search question analysis step 105 to analyze and refine the search question with a view to constructing a derived search question that can extract more meaningful information. Finally, output from the core search engine 101 is typically further analyzed in another subsystem, namely a result analysis step 106 to produce information or visualizations that are used by the clients. - Both steps 105 and 106 are connected between the core search engine 101 and the client API 107, and in case the notification engine 104 is present, it is connected in parallel with the core search engine 101 and between the content analysis step 103 and the query and result analysis step 105;106.

For å forbedre søkehastigheten til en søkemotor foreslår internasjonal publisert patentsøknad WO00/68834 en søkemotor med en todimensjonal, lineært skalerbar parallellarkitektur for søking i en samling av tekstdokumenter D, hvor dokumentene kan deles i et antall partisjoner di, d2,... d„, hvor samlingen av dokumenter D er forhåndsbehandlet i et tekstfiltreringssystem, slik at det fås en forhåndsbehandlet dokumentsamling Dp og tilsvarende forhåndsbehandlede partisjoner dp], dp2, ..., dpn, hvor en indeks / kan genereres fra dokumentsamlingen D slik at for hver tidligere forhåndsbehandlet partisjon dp], dp2, ..., dpn fås en tilsvarende indeks i}, i2,..., i„, hvor søking i en partisjon d av dokumentsamlingen D finner sted med et partisjonsavhengig datasett dp k, hvor 1< k < n, og hvor søkemotoren omfatter databehandlingsenheter som danner sett av noder forbundet i et nettverk. Et første nodesett omfatter sendenoder Na, et annet sett av noder omfatter søkenoder Np, et tredje sett av noder omfatter indekseringsnoder NY. Søkenodene Np er gruppert i søyler som via nettverk er forbundet i parallell mellom sendenodene Na og en indekseringsnode Nr Sendenodene Na er innrettet til å prosessere søkespørsmål og søkesvar, søkenodene Np er innrettet til å romme søkeprogramvare, og indekseringsnodene er innrettet til å generere indekser / for søkeprogramvaren. Valgfritt kan innsamlingsnoder Ng være anordnet i et fjerde nodesett og innrettet til å prosessere søkespørsmål slik at sendenodene kan frigjøres fra denne oppgave. Den todimensjonale skalering finner sted henholdsvis med en skalering av datavolum og skalering av søkemotorytelsen med en respektiv tilpasning av arkitekturen. To improve the search speed of a search engine, international published patent application WO00/68834 proposes a search engine with a two-dimensional, linearly scalable parallel architecture for searching a collection of text documents D, where the documents can be divided into a number of partitions di, d2,... d„, where the collection of documents D is preprocessed in a text filtering system, so that a preprocessed document collection Dp and corresponding preprocessed partitions dp], dp2, ..., dpn are obtained, where an index / can be generated from the document collection D such that for each previously preprocessed partition dp], dp2, ..., dpn, a corresponding index i}, i2,..., i„ is obtained, where searching in a partition d of the document collection D takes place with a partition-dependent data set dp k, where 1< k < n , and where the search engine comprises data processing units that form sets of nodes connected in a network. A first set of nodes comprises sending nodes Na, a second set of nodes comprises searching nodes Np, a third set of nodes comprises indexing nodes NY. The search nodes Np are grouped in columns which are connected via network in parallel between the sending nodes Na and an indexing node Nr. The sending nodes Na are arranged to process search questions and search answers, the searching nodes Np are arranged to accommodate search software, and the indexing nodes are arranged to generate indexes / for the search software. Optionally, collection nodes Ng can be arranged in a fourth node set and arranged to process search queries so that the sending nodes can be freed from this task. The two-dimensional scaling takes place respectively with a scaling of data volume and scaling of the search engine performance with a respective adaptation of the architecture.

Det skjematiske opplegg av denne skalerbare søkemotorarkitektur er vist på fig. 2 som illustrerer prinsippet av todimensjonal skalering. En viktig fordel ved denne arkitekturen er at spørsmålsresponstiden hovedsakelig blir uavhengig av katalogstørrelse, da hvert søkespørsmål eksekveres i parallell på alle søkenoder Np. Videre er arkitekturen iboende feiltolerant slik at feil i de individuelle noder ikke vil resultere i et systemsammenbrudd, bare en temporær reduksjon av ytelsen. The schematic arrangement of this scalable search engine architecture is shown in fig. 2 which illustrates the principle of two-dimensional scaling. An important advantage of this architecture is that the query response time is mainly independent of directory size, as each search query is executed in parallel on all search nodes Np. Furthermore, the architecture is inherently fault tolerant so that failures in the individual nodes will not result in a system crash, only a temporary reduction in performance.

Selv om arkitekturen vist på fig. 2 skaffer en flernivås data- og funksjonsparallellisme slik at store datavolum kan søkes effektivt og meget hurtig av et stort antall brukere samtidig, er den beheftet med visse ulemper og derfor langt fra optimal. Det skyldes det faktum at rad- og søylearkitekturen er basert på mekanisk og rigid partisjonsskjema som ikke tar hensyn til modaliteter i stikkordfordelingen og brukeradferd, så som uttrykt ved frekvensfordelinger av søketermer eller stikkord, og aksessmønstre. Although the architecture shown in fig. 2 provides a multi-level data and function parallelism so that large volumes of data can be searched efficiently and very quickly by a large number of users at the same time, it is fraught with certain disadvantages and is therefore far from optimal. This is due to the fact that the row and column architecture is based on a mechanical and rigid partition scheme that does not take into account modalities in the keyword distribution and user behaviour, as expressed by frequency distributions of search terms or keywords, and access patterns.

Videre er det fra US patent nr. 7293016 Bl (Shakib & al., overdratt til Microsoft Corporation) kjent å anordne indekserte dokumenter i en indeks i henhold til en statisk rangering og partisjonert i henhold til denne rangeringen. Indekspartisjonen avsøkes progressivt fra og med en partisjon som inneholder disse dokumenter med høyeste statisk rang for å lokalisere dokumenter som inneholder et søkeord, og en skåre beregnes på basis av et foreliggende sett av dokumenter så langt funnet i søket og på basis av området for statiske rangeringer i neste partisjon som skal avsøkes. Den neste partisjon avsøkes for å lokalisere dokumenter som inneholder et søkeord når den beregnede skåre ligger over en målskåre. Et søk kan stoppes når det ikke vil finnes flere relevante resultater i den neste partisjon. Furthermore, it is known from US Patent No. 7293016 B1 (Shakib & al., assigned to Microsoft Corporation) to arrange indexed documents in an index according to a static ranking and partitioned according to this ranking. The index partition is progressively searched starting from a partition containing those documents with the highest static rank to locate documents containing a search term, and a score is calculated based on the existing set of documents found so far in the search and based on the range of static ranks in the next partition to be scanned. The next partition is scanned to locate documents containing a search term when the calculated score is above a target score. A search can be stopped when no more relevant results will be found in the next partition.

US publisert patentsøknad nr. 2008/033943 Al (Richards & al., overdratt til BEA Systems, Inc.) angår et desentralisert søkesystem med en sentral kø av dokumentbaserte dataposter, hvor en gruppe noder tilordnes forskjellige partisjoner, indekser for en gruppe dokumenter lagres i hver partisjon, og nodene i de samme partisjon prosesserer uavhengig dokumentbaserte dataposter fra den sentrale kø for å danne indeksene. US Published Patent Application No. 2008/033943 Al (Richards & al., assigned to BEA Systems, Inc.) relates to a decentralized search system with a central queue of document-based data records, where a group of nodes is assigned to different partitions, indexes for a group of documents are stored in each partition, and the nodes in the same partition independently process document-based data records from the central queue to form the indexes.

Foreliggende kjent teknikk skaffer ikke en konstruksjonsstruktur basert på generelle oppfatninger av fordelingsegenskapene til stikkord og søkespørsmål og oppnår således ikke fleksibiliteten til en utførelse som gjør det, med resulterende ytelsesforbedringer og reduksjon i ressurskravene. The prior art does not provide a construction structure based on general perceptions of the distributional properties of keywords and search queries and thus does not achieve the flexibility of an implementation that does, with resulting performance improvements and reduction in resource requirements.

Spesielt har veksten i indeksene vært bekymringsfull, og en rekke spesifikke metoder som elegant håndterer direktekoblet indekskonstruksjon er blitt utviklet [BCL06]. Disse teknikker er ortogonale til strukturen som fremkommer ved å benytte fremgangsmåten i henhold til den foreliggende oppfinnelse, slik det vil fremgå av en detaljert beskrivelse av denne. In particular, the growth of indexes has been a concern, and a number of specific methods that elegantly handle direct-linked index construction have been developed [BCL06]. These techniques are orthogonal to the structure that emerges by using the method according to the present invention, as will be apparent from a detailed description thereof.

Den foreliggende oppfinnelse tar ikke hensyn til spesifikke rangeringsalgoritmer, da det er antatt at brukeren alltid ønsker alle svar på søkespørsmålet. Imidlertid kan disse utvides på en likefrem måte til noen av de nylig utviklet rangeringsalgoritmer [RPB06, AM06, LLQ<+>07] og algoritmer for nye søkespørsmålsmodeller [CPD06, LT1T07, ZS07, DEFS06, TKT06, JRMG06, YJ06, KCMK06]. Algoritmer for å finne det best tilsvarende svar på søkespørsmålet når tilsvarsfunksjoner kombineres, har også vært i fokus for mye av forskningen [PZSD96, Fag99, MYL02]. Disse metodene er imidlertid ortogonale for en indeksfordelingsstruktur som realisert ved fremgangsmåten ved den foreliggende oppfinnelse, og de kan også lett innbefattes. The present invention does not take into account specific ranking algorithms, as it is assumed that the user always wants all answers to the search question. However, these can be straightforwardly extended to some of the recently developed ranking algorithms [RPB06, AM06, LLQ<+>07] and algorithms for new search query models [CPD06, LT1T07, ZS07, DEFS06, TKT06, JRMG06, YJ06, KCMK06]. Algorithms for finding the best corresponding answer to the search query when response functions are combined have also been the focus of much research [PZSD96, Fag99, MYL02]. However, these methods are orthogonal to an index distribution structure as realized by the method of the present invention, and they can also be easily incorporated.

Teknikkene benyttet av den foreliggende oppfinnelse for prosessering av søkespørsmål med partisjonerte posteringslister er basert på grunnleggende ideer hentet fra parallelle databasesystemer [DGG<+>86]; imidlertid ble parallelle databasesystemer utviklet for databaseforvaltningssystemer som lager strukturerte data, mens fokus for den foreliggende oppfinnelse er bedrifts- og Internettsøking, hvor søkespørsmålet eksekveres over samlinger av ofte ustrukturerte eller semistrukturerte dokumenter. The techniques used by the present invention for query processing with partitioned posting lists are based on basic ideas derived from parallel database systems [DGG<+>86]; however, parallel database systems were developed for database management systems that create structured data, while the focus of the present invention is corporate and Internet searching, where the search query is executed over collections of often unstructured or semi-structured documents.

Det foreligger også kjent teknikk vedrørende tekstspørsmålsprosessering i likemannssystemer, hvor målet er å koordinere løst koblede verter med vekt på å finne søkeresultater uten å kringkaste et søkespørsmål til alle verter i nettverket [RV03, LLH<+>03, ODODg02, SMwW<+>03,CAN02, KRo02, SL02, TXM03, TXD03, BJR03, TD04]. Hovedantakelsen i disse kjente publikasjoner angår graden av kobling mellom vertene og er forskjellig fra det initiale grunnlag for den foreliggende oppfinnelse som antar at alle verter er tett koblet og under styring av en enkelt størrelse, f.eks. i en klynge i et bedriftsdatasenter som er den dominerende arkitektur i dag. Den begrepsmessige struktur på hvilken den foreliggende oppfinnelse er bygget, avbildes direkte på denne arkitekturen ved å anta en tett koblet mengde av verter. There is also known technique regarding text query processing in peer-to-peer systems, where the goal is to coordinate loosely connected hosts with an emphasis on finding search results without broadcasting a search query to all hosts in the network [RV03, LLH<+>03, ODODg02, SMwW<+>03 ,CAN02, KRo02, SL02, TXM03, TXD03, BJR03, TD04]. The main assumption in these known publications concerns the degree of coupling between the hosts and is different from the initial basis for the present invention which assumes that all hosts are tightly coupled and under the control of a single quantity, e.g. in a cluster in a corporate data center which is the dominant architecture today. The conceptual structure upon which the present invention is built maps directly onto this architecture by assuming a tightly coupled set of hosts.

I lys av manglene og ulempene ved den ovennevnte kjente teknikk, er det en hovedhensikt med den foreliggende oppfinnelse å skaffe en fremgangsmåte som i vesentlig grad forbedrer ytelsen til en søkemotor. In light of the shortcomings and disadvantages of the above-mentioned known technique, it is a main purpose of the present invention to provide a method which substantially improves the performance of a search engine.

En annen hensikt med den foreliggende oppfinnelse er å konfigurere indeksen til en søkemotor og spesielt en bedriftssøkemotor på det grunnlag å innse at stikkord og dokumenter vil være forskjellige med hensyn til iboende så vel som ytre egenskaper, f.eks. så som gitt av modaliteter i søke- og aksessmønstre. Another object of the present invention is to configure the index of a search engine and especially a business search engine on the basis of realizing that keywords and documents will differ with respect to intrinsic as well as extrinsic characteristics, e.g. such as provided by modalities in search and access patterns.

Endelig er det en hensikt med den foreliggende oppfinnelse å optimere en indekskonfigurasjon med hensyn til iboende trekk ved søkesystemet selv så vel som dets driftsmiljø. Finally, it is a purpose of the present invention to optimize an index configuration with respect to inherent features of the search system itself as well as its operating environment.

De ovennevnte hensikter så vel som ytterligere trekk og fordeler realiseres i henhold til den foreliggende oppfinnelse med en fremgangsmåte som er kjennetegnet ved å konfigurere søkemotorens indeks på basis av én eller flere dokumentegenskaper og minst én blant et feiltoleransenivå, en ønsket søkeytelse, dokument-metaegenskaper og en optimal ressursutnyttelse; å partisjonere indeksen, å reprodusere indeksen, å fordele den således partisjonerte og reproduserte indeks over gruppen av søkenoder, slik at indekspartisjonene av og replikkene av disse tilordnes nevnte én eller flere tjenere hvorpå gruppen av søkenoder er plassert, og å prosessere søkespørsmål på basis av den fordelte indeks. The above purposes as well as further features and advantages are realized according to the present invention with a method characterized by configuring the search engine's index on the basis of one or more document properties and at least one among a fault tolerance level, a desired search performance, document meta-properties and an optimal utilization of resources; to partition the index, to reproduce the index, to distribute the thus partitioned and reproduced index over the group of search nodes, so that the index partitions of and the replicas thereof are assigned to said one or more servers on which the group of search nodes is located, and to process search queries on the basis of the distributed index.

Ytterligere trekk og fordeler ved den foreliggende oppfinnelse vil fremgå av de vedføyde, uselvstendige krav. Further features and advantages of the present invention will be apparent from the appended, independent claims.

Den foreliggende oppfinnelse vil forstås bedre med henvisning til den etterfølgende detaljerte drøftelse av den generelle bakgrunn og aktuelle utførelser lest i samband med den vedføyde tegning, på hvilken fig. 1 viser et forenklet blokkdiagram av en søkemotor som kjent i teknikken og drøftet i det ovenstående; The present invention will be better understood with reference to the subsequent detailed discussion of the general background and relevant embodiments read in conjunction with the attached drawing, in which fig. 1 shows a simplified block diagram of a search engine as known in the art and discussed above;

fig. 2 et diagram og en skalerbar søkemotorarkitektur som benyttet i den kjente søketjeneste AllTheWeb og drøftet i det ovenstående, fig. 2 a diagram and a scalable search engine architecture as used in the well-known search service AllTheWeb and discussed in the above,

fig. 3 begrepet avbildningsfunksjon, fig. 3 the concept of mapping function,

fig. 4 begrepet vertstilordning, fig. 4 the concept of host assignment,

fig. 5 begrepet avbildningsfunksjoner for rader og søyler, og fig. 6 begrepet klassifikasjon av stikkord. fig. 5 the concept of mapping functions for rows and columns, and fig. 6 the concept of classification of keywords.

For å beskrive den foreliggende oppfinnelse i sin helhet skal nå noen antakelser og innledende betraktninger drøftes. Deretter vil den nye struktur for indeksfordeling muliggjort ved fremgangsmåten i henhold til den foreliggende oppfinnelse beskrives. In order to describe the present invention in its entirety, some assumptions and initial considerations must now be discussed. Next, the new structure for index distribution made possible by the method according to the present invention will be described.

For den foreliggende oppfinnelse introduseres en forenklet modell av en søkemotor. Notasjonen som benyttes er oppsummert i tabell 1. For the present invention, a simplified model of a search engine is introduced. The notation used is summarized in table 1.

Tabell 1: Notasjon benyttet i denne patentsøknad Table 1: Notation used in this patent application

Det has en mengde av stikkord K = { kj, ..., k„} og en mengde av dokumenter D = { di, ..., dm}. Hvert dokument d er en liste av stikkord og identifiseres av en entydig identifikator kalt en URL. En forekomst er en tuppel ( k, u) som angir at dokumentet forbundet med URL u inneholder stikkordet k. En dokumentpost er en tuppel ( u, dato) som angir at dokumentet forbundet med URL u ble frembrakt på en gitt dato. There is a set of keywords K = { kj, ..., k„} and a set of documents D = { di, ..., dm}. Each document d is a list of keywords and is identified by a unique identifier called a URL. An instance is a tuple ( k, u) that indicates that the document associated with URL u contains the keyword k. A document record is a tuple ( u, date) that indicates that the document associated with URL u was produced on a given date.

I praksis inneholder en forekomst andre data, for eksempel posisjonen av stikkordet i dokumentet eller data som er anvendelige til å bestemme rangeringen av dokumentet som fås på grunnlag av et søkespørsmål. I tillegg har et dokument også andre forbundne metadata ved siden av dokumentposten, f.eks. en aksesskontrolliste. Ingen av disse forhold er viktige for trekk ved indeksen som er gjenstand for den følgende drøftelse. In practice, an instance contains other data, such as the position of the keyword in the document or data useful in determining the ranking of the document obtained on the basis of a search query. In addition, a document also has other associated metadata next to the document record, e.g. an access control list. None of these conditions are important for features of the index which are the subject of the following discussion.

Indeksen til en søkemotor består av mengder av forekomster og en mengde av dokumentposter. Det er en mengde av forekomster for hvert stikkord k, heretter betegnet som posteringsmengden for stikkordet k. Posteringsmengden for stikkordet k inneholder alle forekomster av stikkordet k, og den inneholder bare forekomster av stikkordet k. For å være konsistent med kjent teknikk antas posteringsmengder å være ordnet i en fast orden (for eksempel leksikografisk ved URL), og den ordnede posteringsmengden av et stikkord x-vil bli betegnet som posteringslisten PL(k) av stikkordet k\ ' den følgende fremstilling. Mengden av dokumentposter inneholder bare en dokumentpost for hvert dokument, og den inneholder bare dokumenter. The index of a search engine consists of a set of occurrences and a set of document records. There is a set of occurrences for each keyword k, hereafter referred to as the posting set for the keyword k. The posting set for the keyword k contains all occurrences of the keyword k, and it contains only occurrences of the keyword k. To be consistent with the prior art, posting sets are assumed to be arranged in a fixed order (for example, lexicographically by URL), and the ordered set of postings of a keyword x will be denoted as the posting list PL(k) of the keyword k\ ' the following representation. The set of document records contains only one document record for each document, and it contains only documents.

Nå skal søkespørsmål og spørsmålsprosessering drøftes i noe detalj. Brukere utsteder søkespørsmål, og et søkespørsmål q består av en mengde av stikkord q = { k})..., ki} c K. Den foreliggende oppfinnelse antar en modell for et søkespørsmål i hvilken bruker vil foretrekke å finne ethvert dokument som inneholder alle stikkord i søkespørsmålet. Det kan antas at ankomsttiden for hvert søkespørsmål q følger en eksponensiell fordeling og således kan karakteriseres av en enkelt parameter Xq, mellomankomstraten for søkespørsmålet q. Bemerk at denne probabilistiske modell for søkespørsmål impliserer at søkespørsmålene er uavhengige. En spørsmålsbelastning co er en funksjon som forbinder hvert søkespørsmål q e 2 med en ankomstrate X-ojfø). Fra en spørsmålsbelastning kan ankomstraten \ m( K) beregnes for hvert stikkord Arved å summere over alle søkespørsmål som inneholder k, formelt Now search questions and question processing will be discussed in some detail. Users issue search queries, and a search query q consists of a set of keywords q = { k})..., ki} c K. The present invention assumes a model for a search query in which a user would prefer to find any document that contains all keywords in the search query. It can be assumed that the arrival time for each query q follows an exponential distribution and can thus be characterized by a single parameter Xq, the average arrival rate for the query q. Note that this probabilistic model of search queries implies that the search queries are independent. A question load co is a function that associates each search question q e 2 with an arrival rate X-ojfø). From a question load, the arrival rate \ m( K) can be calculated for each keyword By summing over all search questions containing k, formally

Den følgende forenklede måte for logisk å prosessere et søkespørsmål q = The following simplified way to logically process a query q =

{/Ci,..., Ki} skal antas. For hvert stikkord k, - gjenfinnes dets posteringsliste PL(K-j) for / e {1,...,^}, og deres snitt i URL-feltene beregnes. Formelt beregner følgende rasjonelle algebraiske uttrykk spørsmålsresultater QueryResultfø) for hvert søkespørsmål q = {*•/,..., Ki) : {/Ci,..., Ki} shall be assumed. For each keyword k, - its posting list PL(K-j) for / e {1,...,^} is found, and their average in the URL fields is calculated. Formally, the following rational algebraic expression calculates query results QueryResultfø) for each query q = {*•/,..., Ki) :

QueryResultfa) = 7IurlPL(ki) fl...fl 7IurlPL(k/) QueryResultfa) = 7IurlPL(ki) fl...fl 7IurlPL(k/)

Dét finnes mer sofistikerte måter for å definere QueryResultfø); for eksempel kan brukeren ønske å se bare undermengde av QueryResultfø), og kan også ønske å se denne undermengde i rangert orden. There are more sophisticated ways to define QueryResultfø); for example, the user may wish to see only subset of QueryResultfø), and may also wish to see this subset in ranked order.

Fysisk utførelse Physical execution

Den foreliggende oppfinnelse antar at en klynge arbeidsstasjoner modelleres som en mengde av verter H= { hh0} [ACPtNt95]. Videre antas hver vert h å ha et enkelt plateminne med en fast mengde lagringsplass på DiskSize enheter. Bemerk at for å underlette fremstillingen er oversettelsen av den abstrakte enhet for lagring til en konkret enhet som bytes utelatt i denne modellen. Hver forekomst antas å ha en fast størrelse på 1 enhet. For et stikkord k og dets posteringsliste PL(k) blir størrelsen av posteringslisten |PL(k)|, definert som tallet på forekomster i PL(k). The present invention assumes that a cluster of workstations is modeled as a set of hosts H= { hh0} [ACPtNt95]. Furthermore, each host is assumed to have a single disk memory with a fixed amount of storage space in DiskSize units. Note that for ease of presentation, the translation of the abstract unit of storage into a concrete unit of bytes is omitted in this model. Each instance is assumed to have a fixed size of 1 unit. For a keyword k and its posting list PL(k), the size of the posting list |PL(k)|, defined as the number of occurrences in PL(k).

Hver vert h er antatt å være i stand til å levere forbundet total ytelse som tillater den å gjenfinne buc(/z) enheter av lagring innenfor latencyBound millisekunder; dette tallet er en aggregert enhet som innbefatter CPU-hastighet, mengden av hovedminne som står til rådighet og latensen og overføringsraten fra vertens plateminne. Videre er i det følgende alle verter antatt å ha identisk ytelse, og således kan avhengigheten buc(/z) eller h utelates og henvisning bare gjøres til bue som antallet enheter som enhver vert kan gjenfinne innenfor latencyBound millisekunder. Each host h is assumed to be able to deliver connected total performance that allows it to retrieve buc(/z) units of storage within latencyBound milliseconds; this number is an aggregate unit that includes CPU speed, the amount of main memory available, and the latency and transfer rate from the host's disk memory. Furthermore, in the following, all hosts are assumed to have identical performance, and thus the dependency buc(/z) or h can be omitted and reference is made only to bue as the number of units that any host can recover within latencyBound milliseconds.

En struktur for indeksfordeling skal nå drøftes. Spesifikt innbefatter strukturen eller arkitekturen som realisert ved fremgangsmåten i henhold til den foreliggende oppfinnelse tre aspekter, nemlig partisjonering, reproduksjon (replikasjon) og vertstilordning, slik det er angitt nedenfor. A structure for index distribution will now be discussed. Specifically, the structure or architecture as realized by the method according to the present invention includes three aspects, namely partitioning, reproduction (replication) and host assignment, as indicated below.

Partisjonering Partitioning

For hvert stikkord blir dens posteringsliste partisjonert i én eller flere komponenter. Denne partisjonering av posteringslister i komponenter utføres for å være i stand til å foredele posteringslisten over flere verter, slik at alle komponenter kan gjenfinnes i parallell. For each keyword, its posting list is partitioned into one or more components. This partitioning of posting lists into components is performed to be able to distribute the posting list over multiple hosts, so that all components can be retrieved in parallel.

Reproduksjon ( replikasjon) Reproduction (replication)

For hvert stikkord blir hver av dets komponenter reprodusert et visst antall ganger i flere komponentreplikker for hver komponent. Komponentreplikker dannes av flere grunner. Den første grunn for reproduksjonen er feiltoleranse; i tilfelle en vert som lagrer en komponent svikter, kan komponenten leses fra en annen vert. Den annen grunn til reproduksjonen er forbedret ytelse fordi søkespørsmål kan gjenfinne en komponent fra hvilken som helst av vertene på hvilke komponenten er reprodusert, og således kan lasten balanseres. For each cue, each of its components is reproduced a certain number of times in multiple component replicas for each component. Component replicas are formed for several reasons. The first reason for reproduction is fault tolerance; in the event that a host storing a component fails, the component can be read from another host. The second reason for replication is improved performance because queries can retrieve a component from any of the hosts on which the component is replicated, thus load balancing.

Vertstilordning Host assignment

Etter partisjonering og reproduksjon blir hver komponentreplikk av en posteringsliste tilordnet en vert, men ved tilordningen underlagt den restriksjon at to komponentreplikker av samme komponent og samme partisjon ikke skal tilordnes samme vert. Vertstilordningen muliggjør at plasseringen av komponentene optimeres globalt over stikkordene. Komponenter av stikkord som vanlig forekommer sammen i spørsmål, kunne for eksempel plasseres sammen for å redusere kostnaden ved spørsmålsprosessering. After partitioning and reproduction, each component replica of a posting list is assigned to a host, but upon assignment subject to the restriction that two component replicas of the same component and the same partition must not be assigned to the same host. The host assignment enables the placement of the components to be optimized globally over the keywords. Components of keywords that usually occur together in questions could, for example, be placed together to reduce the cost of question processing.

Nå skal de tilsvarende tre deler av indeksfordelingens struktur i henhold til fremgangsmåten ved den foreliggende oppfinnelse innføres. Now the corresponding three parts of the structure of the index distribution according to the method of the present invention must be introduced.

1. Partisjonering av posteringslistene i komponenter. 1. Partitioning the posting lists into components.

2. Reproduksjon av komponentene. 2. Reproduction of the components.

3. Avbildning av komponentene til verter. 3. Mapping the components to hosts.

For den første del velges en funksjon numPartitions(-) som benytter et stikkord k som inndata og returnerer antallet komponenter som posteringslisten PL(k) er partisjonert i; de resulterende komponenter er C01(k-),C02(x-),..., COnum<p>artmons()C)(>). I tillegg velges en funksjon occLoc(-) som benytter en forekomst som inndata og gir ut antall komponenter i hvilke denne forekomsten befinner seg. Hvis således occLoc((ac, w)) = U så has { k, u) € C0i(K"). Bemerk at hvis ( k u) e PL(k), så gjelder 1 < occLoc((ac w)) < numPartitions(Ac). For the first part, a function numPartitions(-) is selected which uses a keyword k as input and returns the number of components into which the posting list PL(k) is partitioned; the resulting components are C01(k-),C02(x-),..., COnum<p>artmons()C)(>). In addition, a function occLoc(-) is selected which uses an instance as input and outputs the number of components in which this instance is located. Thus if occLoc((ac, w)) = U then { k, u) € C0i(K"). Note that if ( k u) e PL(k), then 1 < occLoc((ac w)) < numPartitions(Ac).

For den annen del velges en funksjon numReplicas(-) som benytter et stikkord k som inndata og returnerer antallet komponentreplikker av partisjonene til posteringslisten for k. Den opprinnelige komponenten er innbefattet i antallet komponentreplikker. For et stikkord k foreligger det således numReplicas(x) • numPartitions(jr) komponentreplikker. Hvis de riktige numPartitions(x)-komponenter kombineres, da vil de sammen omfatte PL(k-); og for enhver komponent CO^k) kan det finnes numReplicas(jt) identiske komponentreplikker. Hvis stikkordet k i en arbeidsbelastning co har en ankomstrate ÅJ^ k), og last jevnt balanseres mellom For the second part, a function numReplicas(-) is selected which uses a key word k as input and returns the number of component replicas of the partitions of the posting list for k. The original component is included in the number of component replicas. For a keyword k, there are thus numReplicas(x) • numPartitions(jr) component replicas. If the correct numPartitions(x) components are combined, then together they will comprise PL(k-); and for any component CO^k) there can be numReplicas(jt) identical component replicas. If the keyword k in a workload co has an arrival rate ÅJ^ k), and load is evenly balanced between

numReplicas(K-)-komponentreplikker, vil spesielt ankomstraten for dette stikkord for hver av komponentreplikkene være numReplicas(K-) component replicas, in particular the arrival rate for this keyword for each of the component replicas will be

For den tredje del, velges en funksjon hostAssign ( k, i, j) som tar som inndata et stikkord k, et replikktall i og komponenttall j og returnerer verten som lagrer komponentreplikk / av komponent j av posteringslisten PL(k). Bemerk at to identiske komponentreplikker (det vil si replikker av hverandre) må avbildes til forskjellige verter. Formelt må hostAssign ( K, i\, j) hostAssign ( K> h, j) holde for je {l,...numPartitions(/c)} og i\, For the third part, a function hostAssign ( k, i, j ) is selected which takes as input a keyword k, a replica number i and component number j and returns the host that stores component replica / of component j of the posting list PL(k). Note that two identical component replicas (that is, replicas of each other) must be mapped to different hosts. Formally, hostAssign ( K, i\, j) hostAssign ( K> h, j) must hold for je {l,...numPartitions(/c)} and i\,

ii e {l,...numPartitions(/r)} med iyt i2. ii e {l,...numPartitions(/r)} with iyt i2.

Figurene 3 og 4 viser en eksemplarisk forekomst av strukturen i henhold til den foreliggende oppfinnelse for et stikkord k med en posteringsliste med åtte forekomster: A, B, C, D, E, F, G og H. I eksempelet has numPartitions(K) = 4, (dvs. posteringslisten for a:blir partisjonert i fire komponenter) og en numReplicas(x-) = 3 (dvs. det er tre komponentreplikker). Fem verter h\, h2, h3, h4, og h5 er gitt. Funksjonen hostAssign(x, 1, 2) = h\, hostAssign(/c, 2, 2) = h2, hostAssign(x, 3, 1) = ^5. Figures 3 and 4 show an exemplary instance of the structure according to the present invention for a keyword k with a posting list of eight instances: A, B, C, D, E, F, G and H. In the example, numPartitions(K) has = 4, (ie the posting list for a: is partitioned into four components) and a numReplicas(x-) = 3 (ie there are three component replicas). Five hosts h\, h2, h3, h4, and h5 are given. The function hostAssign(x, 1, 2) = h\, hostAssign(/c, 2, 2) = h2, hostAssign(x, 3, 1) = ^5.

En instansiering av de tre funksjoner numPartitions(-) numReplicas(-) og hostAssign(x, i, j) skal kalles en indekskonfigurasjon for en søkemotor. An instantiation of the three functions numPartitions(-) numReplicas(-) and hostAssign(x, i, j) shall be called an index configuration for a search engine.

Gitt strukturen som er vist ovenfor, kan den fysiske modell for å prosessere et søkespørsmål q nå innføres. Prosessering av et søkespørsmål q involverer tre trinn. 1. For hvert stikkord k e q identifiseres en mengde av verter slik at hvis unionen av komponentreplikkene er lagret på vertene som omfatter PL(k) . numRepIicas(/r) > 1, da er det mer enn en slik mengde, og det kan velges mellom forskjellige mengder basert på andre karakteristikker, f.eks. lasten på en vert. 2. For hvert stikkord k e q må den valgte komponentreplikk gjenfinnes for alle valgte verter. 3. Det er nødvendig å beregne Query Resultfø), som krever at det dannes snitt mellom de forskjellige posteringslistene. Given the structure shown above, the physical model for processing a query q can now be introduced. Processing a search query q involves three steps. 1. For each keyword k e q a set of hosts is identified such that if the union of the component replicas is stored on the hosts comprising PL(k) . numRepIicas(/r) > 1, then there is more than one such quantity, and different quantities can be chosen based on other characteristics, e.g. the load on a host. 2. For each keyword k e q, the selected component replica must be found for all selected hosts. 3. It is necessary to calculate Query Resultfø), which requires an average to be formed between the different posting lists.

Nå skal disse tre trinn behandles etter tur. Now these three steps are to be processed in turn.

I det første trinn bemerkes at funksjonen hostAssign(/c, i, j) for hvert stikkord k koder mengden av verter hvor alle komponentreplikker til posteringslisten for k er lagret. In the first step, note that the function hostAssign(/c, i, j) for each keyword k encodes the set of hosts where all component replicas of the posting list for k are stored.

I det annet trinn gjenfinner hver vert involvert i prosesseringen av søkespørsmålet q (som valgt i det første trinn) alle dets lokale komponentreplikker for alle stikkord involvert i søkespørsmålet. In the second step, each host involved in processing the query q (as selected in the first step) retrieves all its local component replicas for all keywords involved in the query.

I det tredje trinn vil hver vert først danne snitt med den lokale komponentreplikk av alle stikkord. Deretter prosesseres resultatene av de lokale snitt ytterligere for å fullføre beregning av Query Resultfø). In the third step, each host will first form sections with the local component replica of all keywords. The results of the local cuts are then processed further to complete the calculation of Query Resultfø).

Nå kan problemet med indekskonstruksjonen defineres som følger: En mengde av verter som har forbundet lagringsrom DiskSize og ytelse bue er gitt. Også gitt er en mengde av stikkord med posteringslister PL(x-i),...PL(xrOT) som har størrelser I PL(k-() PL(xrm) |, så vel som en spørsmålsbelastning co. Now the problem of the index construction can be defined as follows: A set of hosts that have connected storage space DiskSize and performance arch is given. Also given is a set of keywords with posting lists PL(x-i),...PL(xrOT) having sizes I PL(k-() PL(xrm) |, as well as a query load co.

For indekskonstruksjonsproblemet må det nå finnes funksjoner numPartitions(-), numReplicas(-), og hostAssign slik at den forventede latens for å besvare et søkespørsmål q ligger under latencyBound, hvor forventningen er over mengden av alle mulige søkespørsmålssekvenser. For the index construction problem, there must now be functions numPartitions(-), numReplicas(-), and hostAssign so that the expected latency to answer a query q is below latencyBound, where the expectation is over the amount of all possible query sequences.

I det følgende skal en drøftelse av noen utførelser gis ved hjelp av spesifikke og eksemplariske instanseringer av disse. In the following, a discussion of some embodiments will be given with the help of specific and exemplary instances thereof.

1. AllTheWeb Rader og Søyler 1. AllTheWeb Rows and Columns

AllTheWeb-arkitekturen Rader og Søyler (som tributt til søkesystemet AllTheWeb som beskrevet i den ovenstående innledning) er en triviell instansiering av strukturen, jf. fig. 5 som gjengir avbildningsfunksjonene for vertstilordning. I denne arkitekturen er det en matrise av verter bestående av r rader og c søyler. Matrisen kan visualiseres som følger: The AllTheWeb architecture Rows and Columns (as a tribute to the search system AllTheWeb as described in the above introduction) is a trivial instantiation of the structure, cf. fig. 5 which reproduces the mapping functions for host mapping. In this architecture, there is an array of hosts consisting of r rows and c columns. The matrix can be visualized as follows:

Ved å benytte en spredefunksjon på URLer som er uavhengige av stikkordet, blir posteringene av ethvert stikkord omtrent tilnærmet jevnt partisjonert til c komponenter. Hver komponent blir deretter reprodusert innenfor søylen; en komponentreplikk for hver rad, noe som resulterer i r komponentreplikker. For å rekonstruere posteringslisten til et stikkord, må en vert fra hver søyle aksesseres, men det er ikke nødvendig å velge alle disse verter fra den samme rad, og denne fleksibiliteten forenkler spørsmålslastbalansering mellom verter og forbedrer feiltoleranse. By using a spread function on URLs that are independent of the keyword, the postings of any keyword are approximately evenly partitioned into c components. Each component is then reproduced within the column; one component replica for each row, resulting in r component replicas. To reconstruct the posting list of a keyword, one host from each column must be accessed, but it is not necessary to select all these hosts from the same row, and this flexibility facilitates query load balancing between hosts and improves fault tolerance.

For å danne forbindelsen til notasjonen for strukturen som realisert ved fremgangsmåten i henhold til den foreliggende oppfinnelse, må de tre ovennevnte funksjoner instansieres. På grunn av rad- og søyleskjemaet has det at for alle stikkord k e K, at numPartitions(v) = c, og numRepIicas(Ar) = r holder, og for alle URLer u og kx, k2 e K må følgende holde: In order to form the connection to the notation for the structure as realized by the method according to the present invention, the above three functions must be instantiated. Due to the row and column scheme, it follows that for all keywords k e K, that numPartitions(v) = c, and numRepIicas(Ar) = r holds, and for all URLs u and kx, k2 e K the following must hold:

occLoc((k"i, «)) = occLoc((k"2, u)), occLoc((k"i, «)) = occLoc((k"2, u)),

dvs. for alle URL u, er funksjonen occLoc((/c, u)) uavhengig av stikkordet k. Funksjonen hostAssign er også meget enkel. La hostAssign(*:, i,j) = (i,j), hvor i er raden til verten, ogy angir søylen til verten i r x c -matrisen. Bemerk at hvis antallet søyler c velges passende, så vil alle komponentreplikker av ethvert enkelt stikkord k leses i parallell innenfor latencyBound. Det minste antall c er følgende: i.e. for all URL u, the function occLoc((/c, u)) is independent of the keyword k. The hostAssign function is also very simple. Let hostAssign(*:, i,j) = (i,j), where i is the row of the host, and gy denotes the column of the host in the r x c matrix. Note that if the number of columns c is chosen appropriately, then all component replicas of any single keyword k will be read in parallel within latencyBound. The smallest number c is the following:

Når søkespørsmålsprosesseringen utføres i AllTheWebs rad- og søylearkitektur som vist på fig. 2, behøver bare en vert i hver søyle å involveres, selv for søkespørsmål med flere stikkord, da funksjonen occLoc((k; u)) er uavhengig av k. When query processing is performed in AllTheWeb's row and column architecture as shown in fig. 2, only one host in each column needs to be involved, even for multi-keyword queries, as the function occLoc((k; u)) is independent of k.

Imidlertid har AllTheWebs Rader og Søyler en rekke ulemper. For det første er antallet verter som aksesseres for et stikkord x: uavhengig av lengden til h! s posteringsliste; c verter må alltid aksesseres selv for stikkord med meget korte posteringslister. For det annet tar ikke AllTheWebs Rader og Søyler populariteten til et søkeord i spørsmålslasten i betraktning; hver komponent reproduserer seg r ganger selv om det forbundne stikkord aksesseres forholdsvis sjeldent. For det tredje er forandringene i den fysiske oppstilling for AllTheWebs Rader og Søyler begrenset til tilføyelse av verter i multipler av c og r på en gang, noe som resulterer i en ytterligere rad eller en ytterligere søyle i arkitekturen. However, AllTheWeb's Rows and Columns have a number of drawbacks. First, the number of hosts accessed for a keyword x: is independent of the length of h! s posting list; c hosts must always be accessed even for keywords with very short posting lists. Second, AllTheWeb's Rows and Columns do not take into account the popularity of a keyword in the query load; each component reproduces itself r times even if the associated keyword is accessed relatively rarely. Third, the changes in the physical layout for AllTheWeb's Rows and Columns are limited to the addition of hosts in multiples of c and r at once, resulting in an additional row or an additional column in the architecture.

Tilføyelse av en ny (c vert-) rad er forholdsvis likefrem; tilføyelse av en ny (c vert-) søyle er imidlertid ikke-triviell. For å belyse dette poeng, kan det betraktes en forekomst av AllTheWeb Rader og Søyler med r rader og c søyler og som benytter en forbundet funksjon occLocc(-) med en verdimengde {1,c}. Når det tilføyes en annen rad, må en ny funksjon oecLoc'() med en verdimengde {1, c+1} velges og generelt has: Adding a new (c host) row is relatively straightforward; however, adding a new (c host) column is non-trivial. To illustrate this point, consider an instance of AllTheWeb Rows and Columns with r rows and c columns and which uses a connected function occLocc(-) with a set of values {1,c}. When another row is added, a new function oecLoc'() with a set of values {1, c+1} must be selected and generally have:

occLoc((at, u)) * occLoc' (( k, u)), occLoc((at, u)) * occLoc' (( k, u)),

slik at alle posteringslister behøver å repartisjoneres i henhold til occLoc'( ), noe som i utgangspunktet resulterer i en ombygging av hele indeksen. so that all posting lists need to be repartitioned according to occLoc'( ), which basically results in a rebuilding of the entire index.

2. Helt adaptive Rader and Søyler ( Fully Adaptive Rows and Columns) 2. Fully Adaptive Rows and Columns

Nå skal det beskrives en løsning i henhold til den foreliggende oppfinnelse som tar hensyn til både forskjellen i størrelsen av posteringslister og forskjellen i populariteten til stikkordene i søkespørsmålet. Essensen i denne nye løsning er at AllTheWeb Rader og Søyler instansieres forskjellig for hvert stikkord: Hvert stikkord kan ha et forskjellig antall rader og søyler. Med andre ord, ved å benytte fremgangsmåten i henhold til den foreliggende oppfinnelse skaffes det en løsning med helt adaptive rader og søyler. Now a solution according to the present invention will be described which takes into account both the difference in the size of posting lists and the difference in the popularity of the key words in the search query. The essence of this new solution is that AllTheWeb Rows and Columns are instantiated differently for each keyword: Each keyword can have a different number of rows and columns. In other words, by using the method according to the present invention, a solution with fully adaptive rows and columns is obtained.

Betrakt et stikkord k. Start med en instansiering av numPartitions(zc). Da hver vert bare kan gjenfinne bue enheter og tilfredsstille det globale søkespørsmålslatenskrav latencyBound, er PL(*r) partisjonert i komponenter. Således er hver komponent dimensjonert slik at den kan leses innenfor søkespørsmållatenskravet fra en enkelt vert. Bemerk at for et stikkord som har meget korte posteringslister, blir én (eller meget få) komponent(er) dannet, mens for stikkord som har lange posteringslister, dannes mange komponenter. Consider a keyword k. Start with an instantiation of numPartitions(zc). Since each host can only retrieve arc units and satisfy the global query latency requirement latencyBound, PL(*r) is partitioned into components. Thus, each component is sized so that it can be read within the query latency requirement from a single host. Note that for a keyword that has very short posting lists, one (or very few) component(s) are formed, while for keywords that have long posting lists, many components are formed.

Spørsmålet er nå hvor mange komponentreplikker det bør dannes for et stikkord k. Det skal huskes at komponentreplikker dannes med tanke på feiltoleranse og for å fordele søkespørsmålsbelastningen over vertene. For å tolerere/utilgjengelige verter, påtvinges numReplicas(xr) >/. For å balansere spørsmålsbelastningen, blir posteringslister for populære stikkord (i spørsmålsbelastningen) replisert hyppigere enn posteringslister for sjeldne stikkord. Så antallet replikker blir omvendt proporsjonalt med ankomstraten for stikkordet i belastningen. The question is now how many component replicas should be created for a keyword k. It should be remembered that component replicas are created with fault tolerance in mind and to distribute the query load across the hosts. To tolerate/unreachable hosts, force numReplicas(xr) >/. To balance the question load, posting lists for popular keywords (in the question load) are replicated more frequently than posting lists for rare keywords. So the number of replicas becomes inversely proportional to the arrival rate of the keyword in the load.

Ved å gjøre numPartitions(/c) og numReplicas(K) forskjellige for hvert stikkord k, fås et antall rader og søyler som er spesifikt for hvert stikkord. Antall søyler angir fortsatt antallet partisjoner og antall rader antallet replikker for hver partisjon. Imidlertid har stikkord med lange posteringslister mange søyler, og stikkord med korte posteringslister har få søyler. Populære stikkord har mange rader, upopulære stikkord har få søyler. Sammenlignet med AllTheWeb Rader and Søyler resulterer Helt Adaptive Rader og Søyler i mindre ubalanse i størrelsen for komponentene for forskjellige stikkord. Det oppnås således at hver komponentreplikk nå normaliseres i den forstand at hver komponentreplikk har omtrent den samme størrelse (opp til en forskjell på bue) og har omtrent samme ankomstrate. By making numPartitions(/c) and numReplicas(K) different for each keyword k, a number of rows and columns specific to each keyword is obtained. The number of columns still indicates the number of partitions and the number of rows the number of replicas for each partition. However, keywords with long posting lists have many columns, and keywords with short posting lists have few columns. Popular keywords have many rows, unpopular keywords have few columns. Compared to AllTheWeb Rows and Columns, Fully Adaptive Rows and Columns results in less imbalance in the size of the components for different keywords. It is thus achieved that each component replica is now normalized in the sense that each component replica has approximately the same size (up to a difference in arc) and has approximately the same arrival rate.

Det gis en rekke forskjellige måter for å tilordne komponentreplikker til verter. Gitt o verter, ville en mulighet være å spre hvert stikkord /crpå tallene fra 1 til o og deretter tilordne komponentene sekvensielt (mod 6) til verter. Begrepsmessig innbaker dette for et stikkord k k! s spesifikke matrise med numPartitions(x) søyler og numRepIicas(/c) rader sekvensielt i o verter. Formelt har tilordningsfunksjonen følgende form. La keywhash(-) være en funksjon framtil {1, o) med den egenskap at A number of different ways are provided to assign component replicas to hosts. Given o hosts, one possibility would be to spread each keyword /cr over the numbers from 1 to o and then assign the components sequentially (mod 6) to hosts. Conceptually, this backs up a key word k k! s specific matrix with numPartitions(x) columns and numRepIicas(/c) rows sequentially in o hosts. Formally, the assignment function has the following form. Let keywhash(-) be a function up to {1, o) with the property that

for i e K til {1, ..., o} og x- e Å". Deretter kan undermatrisen for stikkordet k i i/ legges opp rad for rad som følger: hostAssign(*; i, j) = (keywHash(K) + { i - 1) • numPartitions(/c) + ( j - 1)) mod o, for i e K to {1, ..., o} and x- e Å". Then the submatrix for the keyword k i i/ can be set up row by row as follows: hostAssign(*; i, j) = (keywHash(K) + { i - 1) • numPartitions(/c) + ( j - 1)) mod o,

hvor i e {1, ..., numReplicas(x)} ogj e {1, ..., numPartitions(xr)}. where i e {1, ..., numReplicas(x)} andj e {1, ..., numPartitions(xr)}.

Med dennes instansiering av hostAssign er spørsmålet nå hvor mange komponentreplikker som vil tilordnes til en vert. Med Helt Adaptive Rader og Søyler viser det følgende enkle teorem at det ikke vil være stor ubalanse mellom to verter med hensyn til antallet komponentreplikker. With this instantiation of hostAssign, the question now is how many component replicas will be assigned to a host. With Fully Adaptive Rows and Columns, the following simple theorem shows that there will not be a large imbalance between two hosts with respect to the number of component replicas.

Teorem 1 Theorem 1

La s være det totale antall komponentreplikker dannet over alle stikkord k, formelt Let s be the total number of component replicas formed over all keywords k, formally

La o være antallet verter og anta at hostAssign er definert som i det foregående avsnitt og anta at s = Q(ø). Da vil det maksimale antall komponentreplikker på hver vert h e //være dvs. det maksimale antall komponentreplikker tilordnet til en vert vil være i størrelsesorden av middeltallet for komponentreplikker tilordnet en vert. Let o be the number of hosts and assume that hostAssign is defined as in the previous section and assume that s = Q(ø). Then the maximum number of component replicas on each host will h e //be, i.e. the maximum number of component replicas assigned to a host will be in the order of magnitude of the average number of component replicas assigned to a host.

Bevis Proof

Følger fra "grenser for baller i binger" (bounds on balls into bins) [MR95]. Follows from "bounds on balls into bins" (bounds on balls into bins) [MR95].

For å gjenvinne posteringslisten for et stikkord tc\ ed prosessering av et søkespørsmål velges en vert fra hver av numPartitions(x) "virtuelle søyler" for k! s matrise; således vil antallet forskjellig muligheter for å velge denne mengden være numPartitions(/c)<n>umReplica<s>('<f>). To retrieve the posting list for a keyword tc\ ed processing a search query, a host is selected from each of the numPartitions(x) "virtual columns" for k! s matrix; thus the number of different possibilities for choosing this quantity will be numPartitions(/c)<n>umReplica<s>('<f>).

Prosessering av søkespørsmål med flere stikkord i Helt Adaptive Rader og Søyler er mye mer kostbart enn i AllTheWeb Rader and Søyler. For eksempel kan det betraktes et stikkord q = { kuk2} hvor Processing multi-keyword queries in Fully Adaptive Rows and Columns is much more expensive than in AllTheWeb Rows and Columns. For example, a keyword q = { kuk2} can be considered where

numPartitions(x:i) ^ numPartitions(/c2). Da stikkord K\ og k2 partisjoneres forskjellig, må posteringslisten til f.eks. Kj repartisjoneres for å tilsvare partisjonering av k2, noe som er en kostbar operasjon. I tillegg er det ingen garanti for at noen komponenter til K\ og tc2 er samplassert på samme vert. numPartitions(x:i) ^ numPartitions(/c2). Since keywords K\ and k2 are partitioned differently, the posting list for e.g. Kj is repartitioned to correspond to the partitioning of k2, which is an expensive operation. In addition, there is no guarantee that any components of K\ and tc2 are co-located on the same host.

3. To- Klasse Rader og Søyler ( Two- Class Rows and Columns) 3. Two-Class Rows and Columns (Two-Class Rows and Columns)

En tredje instansiering er strukturen som realisert ved fremgangsmåten i henhold til den foreliggende oppfinnelse er spesialtilfellet av Helt Adaptive Rader og Søyler som resulterer i en meget enklere (og billigere) søkespørsmålsprosessering. Som i AllTheWeb Rader og Søyler er det antatt at r - c verter er anordnet i den vanlige matrise av verter. A third instantiation is the structure realized by the method according to the present invention is the special case of Fully Adaptive Rows and Columns which results in a much simpler (and cheaper) query processing. As in AllTheWeb Rows and Columns, it is assumed that r - c hosts are arranged in the usual array of hosts.

For To-Klasse Rader og Søyler blir stikkordet klassifisert langs to akser. Den første aksen er størrelsen på posteringslisten, hvor stikkord partisjoneres i korte og lange stikkord basert på størrelsen av deres posteringslister. Den annen akse er ankomstraten for stikkordene i spørsmålsbelastningen hvor stikkord partisjoneres i populære og upopulære stikkord basert på deres ankomstrate. Dette resulterer i fire forskjellige klasser og stikkord. • Korte, upopulære (SU) stikkord. Posteringslisten til et SU-stikkord k blir ikke partisjonert og det dannes det minimale antall komponentreplikker for å oppnå den ønskede feiltoleransenivå. For et SU-stikkord usettes således numPartitions(K) = 1, og numReplicas(/c) =f. • Lange, upopulære (LU) stikkord. Posteringslisten til et LU-stikkord k partisjoneres i c komponenter og/komponentreplikker dannes for hver komponent for å oppnå feiltoleranse. Således settes for et LU-stikkord xrnumPartitions(x) = c, og numReplicas(K") =/ • Korte, populære (SP) stikkord. Posteringslisten til et SP-stikkord partisjoneres ikke og r komponentreplikker av k! s posteringslister dannes for å fordele ids ankomstrate over verter. For et SP-stikkord k settes således henholdsvis numPartitions(x) = 1, og numReplicas(Ac) = r. • Lange, populære (LP) stikkord. Posteringslisten til et LP-stikkord partisjoneres i c komponenter og hver komponent reproduseres r ganger. For et LP-stikkord k settes således henholdsvis numPartitions(K) = c, og numReplicas(K) = r. For Two-Class Rows and Columns, the keyword is classified along two axes. The first axis is the size of the posting list, where keywords are partitioned into short and long keywords based on the size of their posting lists. The second axis is the arrival rate of the keywords in the question load where keywords are partitioned into popular and unpopular keywords based on their arrival rate. This results in four different classes and keywords. • Short, unpopular (SU) keywords. The posting list of a SU keyword k is not partitioned and the minimum number of component replicas is formed to achieve the desired fault tolerance level. For an SU keyword, numPartitions(K) = 1, and numReplicas(/c) =f are thus set. • Long, unpopular (LU) keywords. The posting list of an LU keyword k is partitioned into c components and/component replicas are created for each component to achieve fault tolerance. Thus, for an LU cue xrnumPartitions(x) = c, and numReplicas(K") =/ • Short, popular (SP) cues. The posting list of an SP cue is not partitioned and r component replicas of k!'s posting lists are formed to distribute the ids arrival rate over hosts. For an SP keyword k, numPartitions(x) = 1, and numReplicas(Ac) = r are set respectively. • Long, popular (LP) keywords. The posting list of an LP keyword is partitioned into c components and each component is reproduced r times.For an LP keyword k, numPartitions(K) = c, and numReplicas(K) = r are set respectively.

Instansieringen av de to funksjoner numPartitions(-) og numReplicas(-) fra de foreliggende strukturer for de forskjellige klasser av stikkord er vist i tabell 2. Bemerk at sammenlignet med Helt Adaptive Rader og Søyler er det i To-Klasse Rader og Søyler bare fire forskjellige typer matriser som gjengitt på fig. 6, hvor man har feiltoleransenivået/= 2. The instantiations of the two functions numPartitions(-) and numReplicas(-) from the available structures for the different classes of keywords are shown in Table 2. Note that compared to Fully Adaptive Rows and Columns, in Two-Class Rows and Columns there are only four different types of matrices as shown in fig. 6, where you have the error tolerance level/= 2.

Tabell 2 Table 2

Klassifikasjon av stikkord: Funksjoner (numPartitions(/c), numReplicas(*:)) Classification of keywords: Functions (numPartitions(/c), numReplicas(*:))

La keywHash(-) være som definert og vist i forbindelse med drøftelsen av Helt Adaptive Rader og Søyler som ovenfor. La rowHash(v) være en funksjon fra Å" x {l,.../} to {\,..., r} slik at rowHash(ftr, i\) * rowHash(*:, i2) for i\, i2 e {l,...,r}. Hvordan komponentreplikkene til et stikkord er tilordnet til vertene avhenger av klassen av stikkordet. Let keywHash(-) be as defined and shown in connection with the discussion of Fully Adaptive Rows and Columns as above. Let rowHash(v) be a function from Å" x {l,.../} to {\,..., r} such that rowHash(ftr, i\) * rowHash(*:, i2) for i\ , i2 e {l,...,r} How the component replicas of a cue are assigned to the hosts depends on the class of the cue.

• For et SU-stikkord kSu defineres • For a SU keyword, kSu is defined

hostAssign(/f5ty,/,l) = (rowHash( x; z'),(keywHash( mod c) + 1)), forie {l,...j}. hostAssign(/f5ty,/,l) = (rowHash( x; z'),(keywHash( mod c) + 1)), forie {l,...j}.

• For et LU-stikkord klu, defineres • For an LU keyword klu, is defined

hostAssign(x-z,(/,y) = (rowHash(x:,/),y')shostAssign(x-z,(/,y) = (rowHash(x:,/),y')s

for / e andye {l,...,c}. for / e andye {l,...,c}.

• For et SP-stikkord ksp defineres • For an SP keyword ksp is defined

hostAssign(x"5p,z',l) = (/, (keywHash(x:5(/) mod c) + 1), hostAssign(x"5p,z',l) = (/, (keywHash(x:5(/) mod c) + 1),

for i e { l,..., r}. for i e { l,..., r}.

• For et LP-stikkord klp defineres • For an LP keyword klp is defined

hostAssign(Ki,/»,zV) <=> (i,j), hostAssign(Ki,/»,zV) <=> (i,j),

for / e { l,..., r} and; e {l,...,c}. for / e { l,..., r} and; e {l,...,c}.

I likhet med det foregående avsnitt kan det følgende teorem bevises. Similar to the previous section, the following theorem can be proved.

Teorem 2 Theorem 2

La s være det totale antall komponentreplikker dannet over alle stikkord k, (uavhengig av klassen av k), formelt Let s be the total number of component replicas formed over all keywords k, (regardless of the class of k), formally

La o være antallet verter og anta at hostAssign er definert som i det Let o be the number of hosts and assume that hostAssign is defined as in that

foregående avsnitt. Da vil det maksimale antall komponentreplikker ved hver vert h e //være dvs. det maksimale antall komponentreplikker tilordnet en vert er i størrelsesorden middeltallet for komponentreplikker tilordnet en vert. previous paragraph. Then the maximum number of component replicas at each host will h e //be, i.e. the maximum number of component replicas assigned to a host is of the order of magnitude of the average number of component replicas assigned to a host.

Bevis Følger fra "grenser for baller i binger" (bounds on balls into bins) [MR95]. Proof Follows from "bounds on balls into bins" (bounds on balls into bins) [MR95].

Gitt funksjonen hostAssign(*r,v) for de forskjellige klasser og stikkord er Given the function hostAssign(*r,v) for the different classes and keywords are

ikke spørsmålsprosessering i To-Klasse Rader og Søyler svært komplisert, og det er vanligvis mange muligheter for hvordan et søkespørsmål kan prosesseres. Tabell 3 beskriver hvordan et søkespørsmål q = { k\, k2} kan prosesseres med to stikkord; søkespørsmålsprosessering for spørsmål med mer enn to stikkord er analog. query processing in Two-Class Rows and Columns is not very complicated, and there are usually many possibilities for how a search query can be processed. Table 3 describes how a search query q = { k\, k2} can be processed with two keywords; query processing for queries with more than two keywords is analogous.

Mengden av mulige mengder av verter som skal velges for The amount of possible amounts of hosts to select for

spørsmålsprosessering er likefrem, gitt denne drøftelsen. question processing is straightforward, given this discussion.

Basert på den forekommende struktur som skaffet ved fremgangsmåten i henhold til den foreliggende oppfinnelse og drøftet ovenfor, skal noen få praktiske betraktninger skisseres i det følgende. Based on the existing structure obtained by the method according to the present invention and discussed above, a few practical considerations shall be outlined in the following.

For det første tillater bruk av fremgangsmåten i henhold til den foreliggende oppfinnelse en utvidelse av søkesystemets strukturer som gjør at hvert ord kan ha mer enn en enkelt rad- og søyleforekomst. Dette skal beskrives umiddelbart i det følgende. Firstly, use of the method according to the present invention allows an expansion of the search system's structures which means that each word can have more than a single row and column occurrence. This shall be described immediately in the following.

I de ovennevnte utførelser er det så langt blitt antatt at for hvert stikkord k er det to funksjoner numPartitions(/c) og numReplicas(/r). Imidlertid kan det av hensyn til ytelse et stikkord kpartisjoneres på mer enn en måte og muligvis ha forskjellig antall replikker for de forskjellige partisjoneringer. For eksempel blir i To-Klasse Rader og Søyler posteringslisten for et SP-stikkord ksp reprodusert over en søyle. I tillegg til den foreliggende reproduksjon kan posteringslisten til KSp være partisjonert over en rad fordi KSp ofte forekommer sammen med en annen LP-stikkord klp som er partisjonert over alle rader. In the above embodiments it has so far been assumed that for each keyword k there are two functions numPartitions(/c) and numReplicas(/r). However, for performance reasons, a keyword can be partitioned in more than one way and possibly have different numbers of replicas for the different partitions. For example, in Two-Class Rows and Columns, the posting list for an SP keyword ksp is reproduced over a column. In addition to the present reproduction, the posting list of KSp can be partitioned over one row because KSp often occurs together with another LP cue klp which is partitioned over all rows.

Denne utvidelsen kan karakteriseres ved å forbinde mengder av funksjoner fra den resulterende struktur med hvert stikkord og benytte fremgangsmåten i henhold til den foreliggende oppfinnelse; f. eks. kunne et stikkord xrha to mengder av funksjoner {numPartitionsi(x-), numReplicasi(x)} og {numPartitions2(Kr), numReplicas2(K)}. Antallet mengder kunne være stikkordavhengige. Dette øker i høy grad de valgmulighetene for søkespørsmålsprosessering. Imidlertid skal denne utvidelsen ikke innføres formelt her, da den er begrepsmessig likefrem. This extension can be characterized by connecting sets of functions from the resulting structure with each keyword and using the method according to the present invention; e.g. could a keyword xrha two sets of functions {numPartitionsi(x-), numReplicasi(x)} and {numPartitions2(Kr), numReplicas2(K)}. The number of quantities could be keyword dependent. This greatly increases the options for search query processing. However, this extension shall not be introduced formally here, as it is conceptually straightforward.

For det annet vil fagfolk innse at bruk av fremgangsmåten i henhold til den foreliggende oppfinnelse for å danne strukturer for virkelige søkesystemer, innbefattet bedriftssøkesystemer, kan tillate forskjellige optimeringer av disse. Slike optimeringer skal ta fremgangsmåten i henhold til den foreliggende oppfinnelse som deres utgangspunkt, men deres reduksjon til praksis antas å ligge utenfor rammen til den foreliggende oppfinnelse, og de skal følgelig ikke ytterligere omtales her. Second, those skilled in the art will appreciate that using the method according to the present invention to form structures for real search systems, including enterprise search systems, may allow various optimizations thereof. Such optimizations shall take the method according to the present invention as their starting point, but their reduction to practice is assumed to lie outside the scope of the present invention, and they shall consequently not be further discussed here.

Fremgangsmåten i henhold til den foreliggende oppfinnelse realiserer en struktur for å fordele indeksen til en søkemotor over flere verter i en databehandlingsklynge. Strukturene som vist skjelner mellom tre ortogonale mekanismer for å fordele en søkeindeks: Indekspartisjonering, indeksreproduksjon og tilordning av replikker til verter. Instansieringer av disse mekanismer gir forskjellige måter for å fordele indeksen til en søkemotor, innbefattet populære fremgangsmåter kjent fra litteraturen og nye metoder som helt og holdent utkonkurrerer kjent teknikk med hensyn til ressursbruk og ytelse, samtidig som det oppnås samme feiltoleransenivå. The method according to the present invention realizes a structure for distributing the index of a search engine over several hosts in a data processing cluster. The structures shown distinguish between three orthogonal mechanisms for distributing a search index: index partitioning, index reproduction, and assignment of replicas to hosts. Instantiations of these mechanisms provide different ways to allocate the index to a search engine, including popular methods known from the literature and new methods that completely outperform the prior art in terms of resource usage and performance, while achieving the same level of fault tolerance.

Videre anerkjenner fremgangsmåten i henhold til den foreliggende oppfinnelse for første gang at forskjellige stikkord og forskjellige dokumenter i en søkemotor kan ha forskjellige egenskaper (så som lengde eller aksessfrekvens). Strukturen realisert ved å benytte fremgangsmåten i henhold til den foreliggende oppfinnelse danner en konfigurasjon av indeksen på en søkemotor i henhold til disse egenskaper. Strukturen tjener også til å skissere hvordan søkespørsmål behandles for konfigurasjonsrommet som muliggjøres ved realiseringer av strukturen Instansieringer av denne strukturen fører dessuten til eksisterende indekskonfigurasjoner som er kjent i teknikken så vel som til nye indekskonfigurasjoner som ikke vil være mulige med kjent teknikk. Furthermore, the method according to the present invention recognizes for the first time that different keywords and different documents in a search engine can have different properties (such as length or access frequency). The structure realized by using the method according to the present invention forms a configuration of the index on a search engine according to these characteristics. The structure also serves to outline how queries are processed for the configuration space enabled by realizations of the structure.In addition, instantiations of this structure lead to existing index configurations known in the art as well as to new index configurations that would not be possible with the prior art.

Claims

1. Method for improving the efficiency of a search engine in accessing, searching and retrieving information in the form of documents stored in document or content magazines, where an indexing system in the search engine collects the stored documents and generates an index for these, where the application of a user's search query on the index will return to the user a result set with at least some documents corresponding to the search query, where the search engine comprises a group of search nodes located on one or more servers, and where the method is characterized by to configure the search engine's index on the basis of one or more document properties and at least one among a fault tolerance level, a desired search performance, document meta-properties and an optimal resource utilization;, to partition the index, to reproduce the index, to distribute the thus partitioned and reproduced index over the group of search nodes, so that the index partitions of and the replicas thereof are assigned to said one or more servers on which the group of search nodes is located, and to process search queries on the basis of the distributed index.

2. Procedure according to claim 1, character ti sert by distributing the index so that a search query latency is below a user-specified latency limit.

3. Procedure according to claim 1, characterized by processing a search query with posting lists that have different classifications.

4. Procedure according to claim 3, characterized by classifying posting lists for search queries on the basis of length and popularity, the latter being determined by the user access pattern.

5. Method according to claim 4, where the group of search nodes comprises r rows and c columns, characterized by using a number of partitions different from the number of replicas for each query keyword, so that the number of rows and columns is different for each query keyword, and to distribute the index taking into account either the size differences of the posting lists or the popularity differences of the posting lists, or both.

6. Method according to claim 4, where the group of search nodes comprises r rows and c columns, characterized by to classify a search query in two dimensions, a first dimension being a posting list size and a second dimension the arrival rate for each query keyword, so that a query keyword is partitioned in the first dimension as respectively short and long and in the second dimension as respectively popular and unpopular, and distributing the index taking into account at least one of the posting lists' size differences, the posting lists' popularity differences, and the cost of processing a search query.

7. Procedure according to claim 3, characterized by dividing the query posting lists into components to balance the query processing load between the query nodes.

8. Procedure according to claim 7, characterized by reproducing posting list components so that, in order to increase the fault tolerance level, identical replicas of these are formed.

9. Procedure according to claim 8, characterized by assigning the component replicas to the search nodes to balance the processing load for search queries between them.

10. Procedure according to claim 1, characterized by distributing the index of the search engine to a two-dimensional, linearly scalable group of search nodes, where the scaling itself is used to handle variations either in the data volume or in the frequency of search queries, or both.