WO2002021325A1 - Method and system for searching stored information on one or more computers - Google Patents

Method and system for searching stored information on one or more computers Download PDF

Info

Publication number
WO2002021325A1
WO2002021325A1 PCT/AU2001/001111 AU0101111W WO0221325A1 WO 2002021325 A1 WO2002021325 A1 WO 2002021325A1 AU 0101111 W AU0101111 W AU 0101111W WO 0221325 A1 WO0221325 A1 WO 0221325A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
database
documents
list
document
Prior art date
Application number
PCT/AU2001/001111
Other languages
French (fr)
Inventor
Phillip André BERTOLUS
Timothy Grant Lewis
Original Assignee
Web Wombat Pty Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from AUPQ9868A external-priority patent/AUPQ986800A0/en
Priority claimed from AUPR6308A external-priority patent/AUPR630801A0/en
Application filed by Web Wombat Pty Ltd filed Critical Web Wombat Pty Ltd
Priority to AU2001287351A priority Critical patent/AU2001287351A1/en
Publication of WO2002021325A1 publication Critical patent/WO2002021325A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists

Definitions

  • the present invention relates generally to a method and system for searching stored information on one or more computers and, more particularly, to a method and system for use in searching stored information on a networked system of computers, such as the Internet.
  • search engines to locate information on the Internet.
  • Such search engines generally perform incremental scans of the Internet to build substantial indexes that can later be searched in response to user queries.
  • Many of the largest search engines index hundreds of millions of web pages, comprising a comparable number of discrete terms.
  • the size and scope of these indexes continue to increase, to the extent that search queries for relatively common terms will now often return hundreds of thousands, if not millions, of results.
  • search queries for relatively common terms will now often return hundreds of thousands, if not millions, of results.
  • the ability of users to look at documents remains unchanged, and most will only look at the first ten or twenty results generated in response to their search query.
  • a basic index format provides record identifications ('document IDs') against particular index terms, and when a word query is entered by a user this is simply matched against the index terms to return a set of documents that have at least one occurrence of the query term.
  • the set can be ordered according to a conventional relevancy score based on, say, the combined frequency of occurrence of all query terms in a document, or the number of times any of the query terms occurs in a document.
  • the present invention provides in one aspect a computer search engine for searching information relating to a plurality of documents, configured to return, to a user submitting a search query comprising one or more search terms, a list of search results referencing a plurality of said documents, the list ordered in dependence on the proximity of at least one of said search terms to a beginning position of the respective documents.
  • said beginning position of a document, relative to which said proximity is measured is offset from the actual beginning of that document.
  • the method may include the step of comparing a plurality of documents to determine whether they share a common initial content portion greater than a prescribed size, and using said common initial content portion in order to determine said offset.
  • a computer-readable medium containing instructions for performing a method for searching stored information representing searchable documents on one or more computers, comprising the steps of: generating a database of indexed information relating to said stored information, the database containing indexed terms and positional data representing the position of respective terms in respective documents; receiving, from a remote computer, a search query relating to the stored information; using a central computer to search the database of indexed information for at least one search term comprising the search query, and identifying documents in which said at least one search term occurs; using said positional data to generate a list of search results in accordance with the proximity of the at least one term from a beginning position of each respective document; and sending the list of search results to the remote computer.
  • a system for searching stored information representing searchable documents on one or more computers comprising: means for generating a database of indexed information relating to said stored information, the database containing indexed terms and positional data representing the position of respective terms in respective documents; means for receiving, from a remote computer, a search query relating to the stored information; means for using a central computer to search the database of indexed information for at least one search term comprising the search query, and identifying documents in which said at least one search term occurs; means for using said positional data to generate a list of search results in accordance with the proximity of the at least one term from a beginning position of each respective document; and means for sending the list of search results to the remote computer.
  • a method for searching stored information representing searchable documents on one or more computers comprising the steps of: compiling a database of indexed information relating to said stored information, each entry in the database relating to the content of a respective searchable document; using a hashing algorithm to convert at least a portion of each entry in said database into a corresponding hash value of fixed length; adding each hash value to a hash table with a pointer to its corresponding entry in said database; receiving, from a remote computer, a search query relating to the stored information; searching the database of indexed information for at least one search term comprising the search query, and identifying a set of entries in which at least one search term occurs; identifying, from said set of entries, entries in the hash table which have an identical hash value; generating a list of search results corresponding with said set of entries, omitting all but one of the entries corresponding to said duplicated hash value; and sending the list of search results to the remote computer.
  • Said hashing algorithm may be a 16-bit cyclic redundancy checking algorithm.
  • said step of adding each hash value to a hash table is carried out before said search query is received.
  • said stored information includes a respective document content abstract for each entry, each document content abstract being converted into the corresponding hash value.
  • a computer-readable medium containing instructions for performing a method for searching stored information representing searchable documents on one or more computers, the method comprising the steps of: compiling a database of indexed information relating to said stored information, each entry in the database relating to the content of a respective searchable document; using a hashing algorithm to convert at least a portion of each entry in said database into a corresponding hash value of fixed length; adding each hash value to a hash table with a pointer to its corresponding entry in said database; receiving, from a remote computer, a search query relating to the stored information; searching the database of indexed information for at least one search term comprising the search query, and identifying a set of entries in which at least one search term occurs; identifying, from said set of entries, entries in the hash table which have an identical hash value; generating a list of search results corresponding with said set of entries, omitting all but one of the entries corresponding to said duplicated hash value;
  • a system for searching stored information representing searchable documents on one or more computers comprising: means for compiling a database of indexed information relating to said stored information, each entry in the database relating to the content of a respective searchable document; means for using a hashing algorithm to convert at least a portion of each entry in said database into a corresponding hash value of fixed length; means for adding each hash value to a hash table with a pointer to its corresponding entry in said database; means for receiving, from a remote computer, a search query relating to the stored information; means for searching the database of indexed information for at least one search term comprising the search query, and identifying a set of entries in which at least one search term occurs; means for identifying, from said set of entries, entries in the hash table which have an identical hash value; means for generating a list of search results corresponding with said set of entries, means for omitting all but one of the entries corresponding to said duplicated hash value;
  • Figure 1 is an illustration of a computer network for practicing methods and systems consistent with the present invention
  • Figure 2 is a diagram illustrating the general process by which a search query is processed by a search engine consistent with the present invention
  • Figure 6 is a diagram illustrating one way in which the method can be used to generate multiple results lists in response to search queries consistent with the present invention
  • Figure 7 is a diagram illustrating one way in which multiple tiers of results may be used to meet the requirements of a search query consistent with the present invention
  • Figure 8 is a diagram depicting in detail a process by which the method of the invention meets a typical search query through the generation of multiple results lists consistent with the present invention
  • Figure 9 is a diagram depicting a process by which the method of the invention meets a typical search query through the use of page segments consistent with the present invention.
  • Figure 10 is a diagram illustrating a process by which a hash function transforms variable- length inputs in the form of document information into fixed-length outputs in the form of hash values consistent with the present invention.
  • One embodiment of the present invention is a method for processing information retrieved from computers connected to a communications network.
  • the basic arrangement of a computer network for practicing methods and systems consistent with the present invention is shown in Figure 1.
  • the computer network 100 includes a central computer 110, a remote computer 120, and a plurality of pages of information 140 which are to be searched, stored on computer system 130 . All of the computers in Figure 1 are connected, either directly or indirectly, via a communications network 150.
  • Other embodiments of the present invention may involve more than one central computer 110, a plurality of remote computers 120, and a plurality of computer systems 130 on which pages of information 140 are stored.
  • the communications network 150 is the Internet, a Transmission Control Protocol/Internet Protocol ("TCP/IP") based network, and the computers are connected to communication network 150 using technology in common use.
  • communications network 150 can be any device or arrangement that allows the computers to communicate with each other, including a wireless communication network.
  • communications network 150 might take a different form for different pairs of computers.
  • central computer 110 might communicate with a computer system 130 via the Internet, and computer system 130 might communicate with remote computer 120 via a local area network.
  • FIG. 2 is a diagram illustrating the general process by which a search query is processed by a search engine consistent with the present invention.
  • a remote computer 200 submits a search query 210 to a central computer 220 on which a search engine program 230 resides in memory. That search query may consist of a single search term, a plurality of search terms, or any number of search terms and search operators which establish the parameters within which the search is to be conducted.
  • the central computer 220 then consults a database 240, which in turn contains a word list 250 and positional data 260 relating to each entry in word list 250.
  • Central computer 220 extracts from database 240 the relevant data 270 required to process search query 210.
  • Central computer 220 processes the retrieved data 270, ranks the results in order of relevance to search query 210, and sends the final search results 280 to remote computer 200.
  • the data 270 retrieved from database 240 is already ranked, at least to some extent. Such pre-calculation of rankings speeds up the query process.
  • the central computer 220 performs all the ranking calculations as the raw data 270 is retrieved from database 240.
  • components of the system are described as being stored in memory, it will be appreciated that at least some of these may instead be stored on or read from other computer- readable media, such as secondary storage devices, like hard disks, floppy disks, or CD-ROM; a carrier wave from a network, such as the Internet; or other forms of RAM or ROM either currently known or later developed.
  • secondary storage devices like hard disks, floppy disks, or CD-ROM
  • carrier wave from a network
  • RAM or ROM either currently known or later developed.
  • a number of the software components are described as being located on the same machine, one skilled in the art will appreciate that these components may be distributed over a plurality of machines.
  • word list 250 and positional data 260 are generated (automatically in most cases) typically by data- collection programs called "spiders.”
  • FIG. 3 the generation of a word list and positional data from a page of stored information is diagrammatically illustrated.
  • a given page of information 300 with docID 'Document 1 ' contains a string of four words 310, here represented as "a,” "b,” “c” and “d.”
  • a data- collection agent such as a conventional indexing spider reads the information comprising page 300 and extracts data such as the title of the page, an abstract describing the content of the page, and a list of words contained on the page, along with information such as the position at which each word occurs.
  • index entries 320 comprising a word list 330 of the four characters "a,” “b,” “c,” “d,” with positional data indicating the position of each occurrence of each of the unique entries in the word list 330.
  • the positional data associated with each entry in word list 330 comprises a document list 340 and a word position list 350.
  • search engine For each occurrence of each word on the page, the search engine will refer to word list 330, and if there is an entry for that word, the search engine will add to that entry a pointer to the location of this additional occurrence. If the search engine encounters a word which does not have an entry in word list 330, it will create an entry, and will add to that entry a pointer to the location at which the word occurred.
  • Document list 340 lists each document in which each item in word list 330 has occurred. For each entry in document list 340, there is a corresponding entry in word position list 350 which specifies the position(s) within that document at which the word in question occurred. There is a one-to-one correlation between the entries in document list 340 and word position list 250.
  • Data 320 is then sent to central computer 110 to be added into a larger index.
  • the spider program may in fact be running on the central computer itself.
  • a basic pre-ranking of the index entries can be carried out by the search engine independently of any search query, to speed up any subsequent search queries by performing part of the ranking in advance.
  • search engine When a user subsequently submits a search query for the word "c,” it is a trivial matter for the search engine to scan word list 330 for word “c" and, in this case, return a hit for document 1. According to the methodology of the present invention, the closer the word “c" appears to the beginning of the page, the higher that page will be ranked in the list of records returned to the user. If a user submits a search query comprising more than one word, such as the boolean query ("b" AND “c” AND “d”), the search engine scans word list 330 for each of the words "b,” "c” and "d", then examines the positional data associated with each character to determine whether they occurred in the same document.
  • those instances can be differentiated by looking at further criteria such as the position of those words on each page.
  • the closer the search terms are to the top of a document identified the higher the rank (presumed pertinence) of that document.
  • the number of occurrences of those words on each page can be applied to filter the results (the more occurrences, the higher the rank), as well as the proximity of those words to one another on each page (the closer they are, the higher the rank).
  • Figure 4 is a diagram showing in detail a preferred way in which basic positional data is recorded for each unique indexable item on a given page of stored information. As described in more detail below, this approach involves dividing a page into page segments for the purpose of recording positional data. Although a somewhat coarse indicator of position - as the positional data is not a unique indicator of position - this approach has been found by the present inventors to provide a very practicable relevance indicator, at least in respect of certain categories of searchable stored pages of information.
  • the first word on the page may be regarded as segment 1, the first five words as segment 2, the first ten words as segment 3, the first twenty-five words as segment 4 (respectively referenced as 400, 410, 420, 430), and so forth.
  • segment 1 the first five words as segment 2
  • segment 3 the first twenty-five words as segment 4
  • the positional information can instead be represented by a much smaller number of separate page segment codes.
  • each document needs only be nominally split into four such segments, allowing the information to be conveyed using just two bits of data (ie. 00, 01, 10 and 11).
  • each indexed term is first added to the index, it comprises two lists. For each indexed term, one list 520 specifies the documents in which that term occurs, whilst a second list 530 specifies the position(s) within each of those documents at which that term occurs, with a correspondence between entries in the two lists.
  • the retrieval of search results can be considerably accelerated, as the search engine is able to retrieve the required information for a given search term 510 with a single scan of the condensed list 550, rather than having to scan document list 520 and word position list 530 separately.
  • This is implemented by simply appending to the document list a page segment code 540 (such as the two-bit code referred to above), effectively simplifying the word position list into page segments.
  • the page segment encoding of word position data is preferably generated when the data is first added to the index, but alternatively this may be done at the time a search is conducted.
  • this inefficiency is magnified in the case of many modern search engines, to which thousands of search queries may be submitted every minute.
  • this situation may be ameliorated by applying a "just in time” approach, producing only the level of results which are actually required by the user. This is accomplished by the ability to generate successive tiers of search results in response to a specific search query
  • the first tier is generally the smallest tier of results, as it contains only those results which most closely match the search query (for example only those instances where the search term was found in the first page segment - the first word on the page).
  • Each subsequent tier includes results with a lower level of relevance than results in the preceding tiers.
  • the second tier may contain only those results where the search term was found in the second page segment - the first five words of the respective page, from which the results in the first tier are subtracted.
  • the number of such parallel tiers of results generated depends on the nature of the search terms and the requirements of the user submitting the query. Rather than automatically generating an exhaustive list of every occurrence of a search term in its database, then, this approach can be used to minimise the search time and the load on the computational resources of the search engine, by generating tiers of results as they are required, beginning with only the most relevant results and moving to decreasing levels of relevance if they are required.
  • search query 600 results initially in the generation of a first tier of results 610. Only on the specific request from the user (eg by clicking on a ⁇ NEXT RESULTS> command) is a second tier of results 620 generated. Subsequent tiers, with each one to a greater depth than the preceding tier, can then be selectively generated on further instruction received from remote computer 120.
  • the first tier 610 of returned results may therefore be a very small number of documents, in which the search term occurs at the very beginning of those documents.
  • the system may be configured only to generate further tiers of results if the user specifically requests them, or, alternatively, if there are no instances (or an insufficient number of instances) in which the search term occurs at the beginning of the respective pages.
  • This methodology - of generating a series of results to varying depths, as they are required - is illustrated in more detail in Figure 7.
  • a search for the word "Internet” first generates a tier of results 700, containing the instances in which the word "Internet” occurs as the first word on the page.
  • the search engine only delves deeper into the index database if an insufficient number of instances is identified.
  • the search engine moves to the next segment level and generates another tier of results 720, comprising a list of all instances where the word "Internet” was, for example, one of the first five words on the respective page.
  • this second tier of results 720 those instances where "Internet” is the first word on the page are flagged as having already been retrieved as part of first tier 700, leaving only those instances where "Internet” is the second word on the page to be added to the results list 710.
  • search engine repeats the process to a further level, generating another tier of results 730 in a similar manner.
  • the search engine displays the remaining four results from third tier 730, and then generates additional results from subsequent tiers. This process of generating tiers of responses can be performed in 'real time, as the results are required, as the process can be conducted entirely within the RAM (Random Access Memory) of the computer on which the search engine operates.
  • RAM Random Access Memory
  • Figure 8 depicts a detailed example of the above in respect of a search query 800 for a single term.
  • Database 810 to be searched comprises a word list 820 and positional data 830.
  • the positional data 830 is represented here in the format (docID, word position), and it can be seen from the data contained in database 810 that the search engine has indexed three documents X, Y and Z, comprising a total of 14 occurrences of five discrete words (words 1-5).
  • the search engine scans index database 810, locates the entry for "word 3" in word list 820, and examines the positional data 830 associated with that word.
  • the search engine first retrieves all instances in which "word 3" occurs as the first word on the respective page (documents X and Z), and both these occurrences therefore constitute a first tier of results 850.
  • the search engine can then apply further intrinsic criteria, such as the number of times the search term appears on each page, along with extrinsic criteria such as the relative "popularity" of the page.
  • Some search engines use indicia such as the number of links to the page to help determine its ranking, with each external link in effect counting as a "vote” for that page. In this case, "word 3" appears twice on document Z, but only once on document X, and so, within the first tier of results document Z would probably be ranked higher.
  • a second tier of results 860 can then be selectively generated, consisting of document Y, in which "word 3" occurred as the fourth word in the document, documents X and Z automatically being removed from the second tier, having already been retrieved in first tier 850.
  • documents X and Z automatically being removed from the second tier, having already been retrieved in first tier 850.
  • the second tier may retrieve all instances where the search term occurred within the first five terms of the document.
  • the third tier may then retrieve all instances where the search term occurred within the first ten terms of the document, and so on.
  • Figure 9 depicts a process by which the invention meets a typical search query in the condensed index form, with the page segment codes appended to the document list to allow the search engine to retrieve all the necessary positional data with a single scan of a single list, through the use of page segments.
  • the database 910 to be searched comprises a word list and positional data 930 relating to each entry in word list 920.
  • search engine scans database 910, finds the entry for "word X" in word list 920, and examines positional data 930 associated with that word.
  • the search engine first retrieves all instances where "word X" occurs as the first word on the page (ie. in page segment 1), on the assumption that there is generally a direct correlation between the proximity of the search term to the beginning of the page (and therefore the relevance of that page to the search query).
  • a page segment code 940 indicates that the search term occurs in the first segment of document 5, the third segment of document 1, the fourth segment of document 2 and the fifth segment of document 3.
  • search results are therefore displayed to the user in this order, based on the assumption that documents with the search term occurring closer to the beginning of the page are likely to be more relevant, hi other embodiments of the present invention, the proximity of the search term to the beginning of the page may be combined with or moderated by other indicia of relevance to determine the final ranking of search results.
  • the page segmented approach described above effectively includes the results associated with the first segment with those associated with the second, and so on ( Figure 4). This can present a disadvantage from the point of view of index memory. However, when multiple search terms are included in a search query, and those terms are widely separated in a particular document, this approach helps address the potential problem that this type of relevance algorithm might otherwise produce.
  • An alternative mechanism for indexing the pages according to a page segmented approach is to exclude the terms in one portion from those in the others.
  • the first word on a page may be regarded as segment 1 , the second to fifth words as segment 2, the sixth to tenth words as segment 3, the eleventh to twenty-fifth words as segment 4, and so forth.
  • the index database can be processed to simply replicate the entries of all indexed terms associated with a particular segment as entries of those terms but associated with all subsequent regions, which has the effect of copying all indexed terms from that particular segment into all subsequent segments. Again, the memory load associated with this technique can be offset by the practical advantages to which it gives rise.
  • a problem can arise if a plurality of pages open with a repeated component, such as a 'boilerplate portion', or other common text element.
  • Many websites now use template formats from within content management systems, which can result in the first portion of every webpage at that site being identical. This situation clearly has the potential to interfere with the relevance algorithm, by effectively offsetting the unique content further down the respective page, and thus out of proximity with the actual beginning of that page.
  • the present invention addresses this potential problem by employing a 'virtual' page beginning for operation of the relevance algorithm in respect of documents of this type, thus effectively ignoring such common portions.
  • all the pages of that site are first sorted into alphabetical order of the abstracts.
  • Figure 10 sets out a method used by the present invention to increase the relevance of the results, by eliminating duplicate records from the search results.
  • Each entry (or at least a selected field of that entry, generally the content abstract) in the search engine' s list of documents 1000 is treated as if it were one large binary number.
  • a hash function C- ⁇ (1010) is applied to that value to produce a list of corresponding hash values 1020.
  • the list of hash values 1020, each with a pointer to the corresponding entry in the database, is stored in a RAM in the computer system on which the search engine operates.
  • Hash function 1010 is an algorithm that turns variable-sized inputs 1000 into fixed-sized outputs 1020.
  • the list of hash values 1020 therefore comprises fixed-length numerical representations of each document indexed by the search engine, hi the list of hash values 1020, it is intended that each unique document will have a unique hash value.
  • Hash function 1010 maybe the CRC-16 (16-bit Cyclic Redundancy Checking) algorithm, which is most commonly used to check for errors in data transmission.
  • the CRC- 16 algorithm essentially treats each indexed document as a large binary number, which it divides by the polynomial x ⁇ 16 + x ⁇ 15 + x ⁇ 2 + 1 to produce a hash value of fixed length.
  • Other embodiments of the present invention use different hash functions which also produce unique fixed-length numerical representations of each unique document that is indexed by the search engine.

Abstract

The present invention relates generally to a method and system for searching stored information on one or more computers and, more particularly, to a method and system for use in searching stored information on a networked system of computers, such as the Internet. The method comprises the steps of: generating a database of indexed information relating to said stored information, the database containing indexed terms and positional data representing the position of respective terms in respective documents; receiving, from a remote computer, a search query relating to the stored information; using a central computer to search the database of indexed information for at least one search term comprising the search query, and identifying documents in which said at least one search term occurs; using said positional data to generate a list of search results in accordance with the proximity of the at least one term from a beginning position of each respective document; and sending the list of search results to the remote computer. The proximity may be defined in accordance with the precise position of a term in a document, or with a particular segment of the document. The search results may be generated from a series of successive sweeps of the index, the first such sweep providing only the most pertinent documents. Additional relevance algorithms may be also be used, and an offset technique employed to set the beginning position of particular documents, to allow for documents which incorporate initial common portions. Additionally, the method involves suppression of duplicated documents found be a query submission, using a hash table containing hash values for index entries, and omitting from the results returned to the user all but one of the entries corresponding to a single hash value.

Description

METHOD AND SYSTEM FOR SEARCHING STORED INFORMATION ON ONE OR
MORE COMPUTERS
Background of the Invention
The present invention relates generally to a method and system for searching stored information on one or more computers and, more particularly, to a method and system for use in searching stored information on a networked system of computers, such as the Internet.
It has become increasingly common for computers to be networked as part of their everyday operation. In particular, millions of computers around the world are connected daily to the most well-known wide area network, the Internet.
A recent study (Steve Lawrence and C. Lee Giles, "Accessibility of information on the Web" (1999) 400 Nature 107) found that around 85% of Internet users use search engines to locate information on the Internet. Such search engines generally perform incremental scans of the Internet to build substantial indexes that can later be searched in response to user queries. Many of the largest search engines index hundreds of millions of web pages, comprising a comparable number of discrete terms. The size and scope of these indexes continue to increase, to the extent that search queries for relatively common terms will now often return hundreds of thousands, if not millions, of results. Despite this increase in the ability to generate results, the ability of users to look at documents remains unchanged, and most will only look at the first ten or twenty results generated in response to their search query. Even if the user were patient enough to sift through thousands of results, it is likely that only a small fraction would be relevant to their initial query. hi addition, it is now common practice to "mirror" heavily used web sites, i.e., to store the same content on a number of servers across a range of geographical locations. This spreads the access load across several servers instead of focusing it on one, and allows users to optimize their connection speed by selecting a server which is closer to them. One consequence of this practice, however, is that the results produced by a search engine will often contain duplicate entries. They will have different URLs (Universal or Uniform Resource Locators) but the content of the sites will be identical.
The main challenge for search engines is now a matter of how to present users with the most relevant results in a timely fashion — distilling the mass of information to ensure that the handful of results which the user actually examines are all relevant to their initial search query. In doing so, there is generally a trade-off between speed and quality of results. It would be possible for a computer to process search results for hours, applying a multitude of criteria to come up with incredibly pertinent results, but this is impractical for most users and search providers. High levels of relevancy are expected in a matter of seconds, while minimizing the load on the computational resources of the search engine.
Most search engines attempt this with their own ranking algorithms, which analyze search results based on a set of criteria and provide those results to users in order of ranking. The intention is that the highest ranked results are those which most closely match the user's search query. As mentioned above, this process generally involves a balance being struck between accuracy and speed. This balancing act occurs within the parameters set by the computational resources available, for example a faster or more efficient algorithm may allow room for greater accuracy, as more computation may occur within a given amount of time. The indicia used to infer relevance are an integral part of any ranking algorithm, and until computers are capable of true semantic analysis of stored information, it is necessary to devise easily computable yet accurate means for approximating relevance. A basic index format provides record identifications ('document IDs') against particular index terms, and when a word query is entered by a user this is simply matched against the index terms to return a set of documents that have at least one occurrence of the query term. The set can be ordered according to a conventional relevancy score based on, say, the combined frequency of occurrence of all query terms in a document, or the number of times any of the query terms occurs in a document.
In addition, it is known to utilise the relative proximity of search terms in individual documents in order to order or to filter the set of located documents, and a revised relevancy ranking can thus be generated, based on the assumption that the closer the proximity of multiple search terms in a document, the higher the relevance of that document to the user query.
There still exists a need for a system that efficiently searches stored information on one or more computers, such as on the Internet, to produce meaningful and relevant results in a timely manner while minimizing the load on the computational resources of the search engine by using simple yet accurate indicia of relevance. Summary of the invention
In broad terms, the present invention provides in one aspect a computer search engine for searching information relating to a plurality of documents, configured to return, to a user submitting a search query comprising one or more search terms, a list of search results referencing a plurality of said documents, the list ordered in dependence on the proximity of at least one of said search terms to a beginning position of the respective documents.
According to the present invention in a further aspect, there is provided a method for searching stored information representing searchable documents on one or more computers, comprising the steps of: generating a database of indexed information relating to said stored information, the database containing indexed terms and positional data representing the position of respective terms in respective documents; receiving, from a remote computer, a search query relating to the stored information; using a central computer to search the database of indexed information for at least one search term comprising the search query, and identifying documents in which said at least one search term occurs; using said positional data to generate a list of search results in accordance with the proximity of the at least one term from a beginning position of each respective document; and sending the list of search results to the remote computer.
Said stored information may be stored as a plurality of webpages on a network of computers such as the Internet, each document being one webpage.
Preferably, the search query comprises a plurality of search terms and at least one operator, including and the method includes the step of parsing the search query into search terms and operator (s) relating to the search terms.
In a preferred form, the search query also includes one or more search operators which alter the parameters of the search to be conducted by the central computer.
According to a preferred embodiment, the list of search results is arranged in order of relevance, a closer proximity of the at least one term from a beginning position of each respective document associated with a higher relevance for that document. Said positional data representing the position of respective terms in respective documents may be arranged to provide, in each case, a non-unique indicator of the proximity of the at least one term from a beginning position of each respective document. In one embodiment, the method includes the step of allocating a first proximity code in respect of terms within a predetermined distance A from said beginning position, and a second proximity code in respect of terms within a distance B from said beginning position, wherein B>A, and wherein said list of search results is arranged in order of relevance, said first proximity code associated with a higher relevance than said second proximity code. S aid step of generating a list of search results may include generating a first tier of results, sending said first tier of results to the remote computer, wherein, in response to a predetermined instruction from the remote computer, an additional tier of search results is generated, extending to a lower level of relevance to the search query than the preceding tier, said level of relevance being determined in accordance with the proximity of the at least one term from the beginning position of each respective document, wherein the additional tier of search results is then sent to the remote computer. hi accordance with this last form of the invention, a plurality of additional tiers may be successively generated in response to subsequent predetermined instructions, each to a lower level of relevance to the search query than the preceding tier. Preferably, said beginning position of a document, relative to which said proximity is measured, is offset from the actual beginning of that document. To this end, the method may include the step of comparing a plurality of documents to determine whether they share a common initial content portion greater than a prescribed size, and using said common initial content portion in order to determine said offset. According to the invention in a further aspect, there is provided a computer-readable medium containing instructions for performing a method for searching stored information representing searchable documents on one or more computers, comprising the steps of: generating a database of indexed information relating to said stored information, the database containing indexed terms and positional data representing the position of respective terms in respective documents; receiving, from a remote computer, a search query relating to the stored information; using a central computer to search the database of indexed information for at least one search term comprising the search query, and identifying documents in which said at least one search term occurs; using said positional data to generate a list of search results in accordance with the proximity of the at least one term from a beginning position of each respective document; and sending the list of search results to the remote computer.
According to the invention in a further aspect, there is provided a system for searching stored information representing searchable documents on one or more computers, comprising: means for generating a database of indexed information relating to said stored information, the database containing indexed terms and positional data representing the position of respective terms in respective documents; means for receiving, from a remote computer, a search query relating to the stored information; means for using a central computer to search the database of indexed information for at least one search term comprising the search query, and identifying documents in which said at least one search term occurs; means for using said positional data to generate a list of search results in accordance with the proximity of the at least one term from a beginning position of each respective document; and means for sending the list of search results to the remote computer.
The present invention therefore provide an improved method enabling the indexing and searching of information stored on a computer or network of computers, such as the Internet. In one form, this can be employed to minimise the load on the computational resources of the search engine by using page sectors as a more efficient indicia of the relevance of 'words' on each discrete page of stored infonnation. This allows faster searches to be conducted. hi one form of the present invention, the method can be employed to avoid the automatic generation of an exhaustive list of every occurrence of a search term in its database. This can serve to minimise the search time and the load on the computational and memory resources of the search engine, as tiers of results are generated only as they are required, beginning with only the most relevant results and moving to decreasing levels of relevance if they are required.
According to the present invention in a further aspect, there is provided a computer search engine for searching information relating to a plurality of documents, configured to return, to a user submitting a search query comprising one or more search terms, a list of search results referencing a plurality of said documents, the list omitting references to more than one document if a number of such documents contain identical content.
According to the present invention in a further aspect, there is provided a method for searching stored information representing searchable documents on one or more computers, comprising the steps of: compiling a database of indexed information relating to said stored information, each entry in the database relating to the content of a respective searchable document; using a hashing algorithm to convert at least a portion of each entry in said database into a corresponding hash value of fixed length; adding each hash value to a hash table with a pointer to its corresponding entry in said database; receiving, from a remote computer, a search query relating to the stored information; searching the database of indexed information for at least one search term comprising the search query, and identifying a set of entries in which at least one search term occurs; identifying, from said set of entries, entries in the hash table which have an identical hash value; generating a list of search results corresponding with said set of entries, omitting all but one of the entries corresponding to said duplicated hash value; and sending the list of search results to the remote computer.
Said hashing algorithm may be a 16-bit cyclic redundancy checking algorithm. Preferably, said step of adding each hash value to a hash table is carried out before said search query is received. hi a preferred form, said stored information includes a respective document content abstract for each entry, each document content abstract being converted into the corresponding hash value. According to the invention in a further aspect, there is provided a computer-readable medium containing instructions for performing a method for searching stored information representing searchable documents on one or more computers, the method comprising the steps of: compiling a database of indexed information relating to said stored information, each entry in the database relating to the content of a respective searchable document; using a hashing algorithm to convert at least a portion of each entry in said database into a corresponding hash value of fixed length; adding each hash value to a hash table with a pointer to its corresponding entry in said database; receiving, from a remote computer, a search query relating to the stored information; searching the database of indexed information for at least one search term comprising the search query, and identifying a set of entries in which at least one search term occurs; identifying, from said set of entries, entries in the hash table which have an identical hash value; generating a list of search results corresponding with said set of entries, omitting all but one of the entries corresponding to said duplicated hash value; and sending the list of search results to the remote computer.
According to the invention in a further aspect, there is provided a system for searching stored information representing searchable documents on one or more computers, comprising: means for compiling a database of indexed information relating to said stored information, each entry in the database relating to the content of a respective searchable document; means for using a hashing algorithm to convert at least a portion of each entry in said database into a corresponding hash value of fixed length; means for adding each hash value to a hash table with a pointer to its corresponding entry in said database; means for receiving, from a remote computer, a search query relating to the stored information; means for searching the database of indexed information for at least one search term comprising the search query, and identifying a set of entries in which at least one search term occurs; means for identifying, from said set of entries, entries in the hash table which have an identical hash value; means for generating a list of search results corresponding with said set of entries, means for omitting all but one of the entries corresponding to said duplicated hash value; and means for sending the list of search results to the remote computer. The invention therefore provides the considerable practical advantage that documents with identical content, retrieved by the operation of the search engine, although indexed by the spider application, are not returned to the user. This avoid unnecessary duplication of entries in the list of results returned from a search query.
Brief description of the drawings The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an exemplary and non-limiting implementation of the invention and, together with the description, serve to explain the advantages and principles of the invention, hi the drawings:
Figure 1 is an illustration of a computer network for practicing methods and systems consistent with the present invention;
Figure 2 is a diagram illustrating the general process by which a search query is processed by a search engine consistent with the present invention;
Figure 3 is a diagram illustrating the generation of a word list and positional data from a page of stored information in accordance with the present invention; Figure 4 is a diagram illustrating in detail one way in which a given page of stored information is divided into page segments for the purpose of recording positional data consistent with the present invention. Figure 5 is a diagram illustrating one way in which the positional data for each indexed word is simplified using page segment information consistent with the present invention;
Figure 6 is a diagram illustrating one way in which the method can be used to generate multiple results lists in response to search queries consistent with the present invention;
Figure 7 is a diagram illustrating one way in which multiple tiers of results may be used to meet the requirements of a search query consistent with the present invention;
Figure 8 is a diagram depicting in detail a process by which the method of the invention meets a typical search query through the generation of multiple results lists consistent with the present invention;
Figure 9 is a diagram depicting a process by which the method of the invention meets a typical search query through the use of page segments consistent with the present invention; and
Figure 10 is a diagram illustrating a process by which a hash function transforms variable- length inputs in the form of document information into fixed-length outputs in the form of hash values consistent with the present invention.
Detailed description of the drawings The following detailed description of the invention refers to the accompanying drawings. Although the description includes exemplary implementations, other implementations are possible, and changes may be made to the implementations described without departing from the spirit and scope of the invention. The following detailed description does not limit the invention. Instead, the scope of the invention is defined by the appended claims. Wherever possible, the same reference numbers will be used throughout the drawings and the following description to refer to the same or like parts.
One embodiment of the present invention is a method for processing information retrieved from computers connected to a communications network. The basic arrangement of a computer network for practicing methods and systems consistent with the present invention is shown in Figure 1. The computer network 100 includes a central computer 110, a remote computer 120, and a plurality of pages of information 140 which are to be searched, stored on computer system 130 . All of the computers in Figure 1 are connected, either directly or indirectly, via a communications network 150. Other embodiments of the present invention may involve more than one central computer 110, a plurality of remote computers 120, and a plurality of computer systems 130 on which pages of information 140 are stored.
Typically, the communications network 150 is the Internet, a Transmission Control Protocol/Internet Protocol ("TCP/IP") based network, and the computers are connected to communication network 150 using technology in common use. However, as will be understood by the skilled reader, communications network 150 can be any device or arrangement that allows the computers to communicate with each other, including a wireless communication network. Further, communications network 150 might take a different form for different pairs of computers. For example, central computer 110 might communicate with a computer system 130 via the Internet, and computer system 130 might communicate with remote computer 120 via a local area network.
Figure 2 is a diagram illustrating the general process by which a search query is processed by a search engine consistent with the present invention. A remote computer 200 submits a search query 210 to a central computer 220 on which a search engine program 230 resides in memory. That search query may consist of a single search term, a plurality of search terms, or any number of search terms and search operators which establish the parameters within which the search is to be conducted. The central computer 220 then consults a database 240, which in turn contains a word list 250 and positional data 260 relating to each entry in word list 250. Central computer 220 extracts from database 240 the relevant data 270 required to process search query 210. Central computer 220 processes the retrieved data 270, ranks the results in order of relevance to search query 210, and sends the final search results 280 to remote computer 200. In one embodiment of the present invention, the data 270 retrieved from database 240 is already ranked, at least to some extent. Such pre-calculation of rankings speeds up the query process. In other embodiments, the central computer 220 performs all the ranking calculations as the raw data 270 is retrieved from database 240.
Although components of the system are described as being stored in memory, it will be appreciated that at least some of these may instead be stored on or read from other computer- readable media, such as secondary storage devices, like hard disks, floppy disks, or CD-ROM; a carrier wave from a network, such as the Internet; or other forms of RAM or ROM either currently known or later developed. Additionally, although a number of the software components are described as being located on the same machine, one skilled in the art will appreciate that these components may be distributed over a plurality of machines.
In a manner generally known, word list 250 and positional data 260, of which database 240 is at least partly composed, are generated (automatically in most cases) typically by data- collection programs called "spiders."
In Figure 3, the generation of a word list and positional data from a page of stored information is diagrammatically illustrated. A given page of information 300 with docID 'Document 1 ' contains a string of four words 310, here represented as "a," "b," "c" and "d." A data- collection agent such as a conventional indexing spider reads the information comprising page 300 and extracts data such as the title of the page, an abstract describing the content of the page, and a list of words contained on the page, along with information such as the position at which each word occurs. In the example shown, from the string of four words contained on page 300, a typical spider would produce index entries 320 comprising a word list 330 of the four characters "a," "b," "c," "d," with positional data indicating the position of each occurrence of each of the unique entries in the word list 330. The positional data associated with each entry in word list 330 comprises a document list 340 and a word position list 350.
For each occurrence of each word on the page, the search engine will refer to word list 330, and if there is an entry for that word, the search engine will add to that entry a pointer to the location of this additional occurrence. If the search engine encounters a word which does not have an entry in word list 330, it will create an entry, and will add to that entry a pointer to the location at which the word occurred.
Document list 340 lists each document in which each item in word list 330 has occurred. For each entry in document list 340, there is a corresponding entry in word position list 350 which specifies the position(s) within that document at which the word in question occurred. There is a one-to-one correlation between the entries in document list 340 and word position list 250.
Data 320 is then sent to central computer 110 to be added into a larger index. Alternatively, the spider program may in fact be running on the central computer itself.
As mentioned above, a basic pre-ranking of the index entries can be carried out by the search engine independently of any search query, to speed up any subsequent search queries by performing part of the ranking in advance.
When a user subsequently submits a search query for the word "c," it is a trivial matter for the search engine to scan word list 330 for word "c" and, in this case, return a hit for document 1. According to the methodology of the present invention, the closer the word "c" appears to the beginning of the page, the higher that page will be ranked in the list of records returned to the user. If a user submits a search query comprising more than one word, such as the boolean query ("b" AND "c" AND "d"), the search engine scans word list 330 for each of the words "b," "c" and "d", then examines the positional data associated with each character to determine whether they occurred in the same document. If more than one instance is found where all of the words in the search query occur on the same page, those instances can be differentiated by looking at further criteria such as the position of those words on each page. Once again, in accordance with the methodology of the present invention, the closer the search terms are to the top of a document identified, the higher the rank (presumed pertinence) of that document. In addition, the number of occurrences of those words on each page can be applied to filter the results (the more occurrences, the higher the rank), as well as the proximity of those words to one another on each page (the closer they are, the higher the rank).
Figure 4 is a diagram showing in detail a preferred way in which basic positional data is recorded for each unique indexable item on a given page of stored information. As described in more detail below, this approach involves dividing a page into page segments for the purpose of recording positional data. Although a somewhat coarse indicator of position - as the positional data is not a unique indicator of position - this approach has been found by the present inventors to provide a very practicable relevance indicator, at least in respect of certain categories of searchable stored pages of information.
As Figure 4 illustrates, the first word on the page may be regarded as segment 1, the first five words as segment 2, the first ten words as segment 3, the first twenty-five words as segment 4 (respectively referenced as 400, 410, 420, 430), and so forth. If there are, say, fifty words on a given page of information, then instead of requiring fifty discrete indicators of word position, the positional information can instead be represented by a much smaller number of separate page segment codes. In one form of the invention found to be very efficient in many applications, each document needs only be nominally split into four such segments, allowing the information to be conveyed using just two bits of data (ie. 00, 01, 10 and 11). While some accuracy may be lost in such a simplification process, the reduction in storage space and retrieval times as a result can be sufficiently significant (in a database with hundred of millions of entries) to more then compensate. In Figure 5, the use of the page segmented indexing approach to a search activity is illustrated by way of contrast with the use of specific positional data for each indexed term (word list 330 in Figure 3). When positional data in relation to each occurrence of each indexed term is first added to the index, it comprises two lists. For each indexed term, one list 520 specifies the documents in which that term occurs, whilst a second list 530 specifies the position(s) within each of those documents at which that term occurs, with a correspondence between entries in the two lists. By condensing this positional information into a single list 550, the retrieval of search results can be considerably accelerated, as the search engine is able to retrieve the required information for a given search term 510 with a single scan of the condensed list 550, rather than having to scan document list 520 and word position list 530 separately. This is implemented by simply appending to the document list a page segment code 540 (such as the two-bit code referred to above), effectively simplifying the word position list into page segments. The page segment encoding of word position data is preferably generated when the data is first added to the index, but alternatively this may be done at the time a search is conducted. It is to be understood that this approach of page segmented indexing is only one method of recording positional data associated with each word and separating pages of stored information into page segments, and other methodologies may be alternatively or additionally employed. hi a typical search engine, whenever a search query is submitted, the search engine performs an exhaustive search to determine every instance where entries matching the search terms occur within its database. Even if the user who submitted the search query only looks, for example, at the first ten results, in most cases the search engine will nonetheless have retrieved many additional results which are not viewed and which have therefore imposed an unnecessary burden on the available computational resources of the search engine, hi some applications this can tend to waste computational resources, as only a very small fraction of those search results will actually be examined by the user. This inefficiency is magnified in the case of many modern search engines, to which thousands of search queries may be submitted every minute. In one aspect of the present invention, this situation may be ameliorated by applying a "just in time" approach, producing only the level of results which are actually required by the user. This is accomplished by the ability to generate successive tiers of search results in response to a specific search query The first tier is generally the smallest tier of results, as it contains only those results which most closely match the search query (for example only those instances where the search term was found in the first page segment - the first word on the page). Each subsequent tier includes results with a lower level of relevance than results in the preceding tiers. For example, the second tier may contain only those results where the search term was found in the second page segment - the first five words of the respective page, from which the results in the first tier are subtracted. The number of such parallel tiers of results generated depends on the nature of the search terms and the requirements of the user submitting the query. Rather than automatically generating an exhaustive list of every occurrence of a search term in its database, then, this approach can be used to minimise the search time and the load on the computational resources of the search engine, by generating tiers of results as they are required, beginning with only the most relevant results and moving to decreasing levels of relevance if they are required.
An implementation of this process is diagrammatically illustrated in Figure 6, in which search query 600 results initially in the generation of a first tier of results 610. Only on the specific request from the user (eg by clicking on a <NEXT RESULTS> command) is a second tier of results 620 generated. Subsequent tiers, with each one to a greater depth than the preceding tier, can then be selectively generated on further instruction received from remote computer 120. The first tier 610 of returned results may therefore be a very small number of documents, in which the search term occurs at the very beginning of those documents. The system may be configured only to generate further tiers of results if the user specifically requests them, or, alternatively, if there are no instances (or an insufficient number of instances) in which the search term occurs at the beginning of the respective pages.
This methodology - of generating a series of results to varying depths, as they are required - is illustrated in more detail in Figure 7. For example, a search for the word "Internet" first generates a tier of results 700, containing the instances in which the word "Internet" occurs as the first word on the page. The search engine only delves deeper into the index database if an insufficient number of instances is identified. If, for example, the user had instructed the search engine to display the top ten results 710, and there are only five instances 700 in which the word "Internet" occurred as the first word on the page, then, to fill the remaining five positions in the first tier list, the search engine moves to the next segment level and generates another tier of results 720, comprising a list of all instances where the word "Internet" was, for example, one of the first five words on the respective page. Of this second tier of results 720, those instances where "Internet" is the first word on the page are flagged as having already been retrieved as part of first tier 700, leaving only those instances where "Internet" is the second word on the page to be added to the results list 710. If, for example, there are seven instances in which "Internet" was one of the first five words on the page, five of those are flagged as having already been retrieved as part of first tier of results 700, leaving two new results to be displayed. To fill the remaining three positions available, the search engine repeats the process to a further level, generating another tier of results 730 in a similar manner.
If the user then chooses to look at the next ten results, the search engine displays the remaining four results from third tier 730, and then generates additional results from subsequent tiers. This process of generating tiers of responses can be performed in 'real time, as the results are required, as the process can be conducted entirely within the RAM (Random Access Memory) of the computer on which the search engine operates.
Figure 8 depicts a detailed example of the above in respect of a search query 800 for a single term. Database 810 to be searched comprises a word list 820 and positional data 830. The positional data 830 is represented here in the format (docID, word position), and it can be seen from the data contained in database 810 that the search engine has indexed three documents X, Y and Z, comprising a total of 14 occurrences of five discrete words (words 1-5). When query 800 is submitted for the search term "word 3," the search engine scans index database 810, locates the entry for "word 3" in word list 820, and examines the positional data 830 associated with that word. The search engine first retrieves all instances in which "word 3" occurs as the first word on the respective page (documents X and Z), and both these occurrences therefore constitute a first tier of results 850. To differentiate between document X and document Z for ranking purposes, the search engine can then apply further intrinsic criteria, such as the number of times the search term appears on each page, along with extrinsic criteria such as the relative "popularity" of the page. Some search engines use indicia such as the number of links to the page to help determine its ranking, with each external link in effect counting as a "vote" for that page. In this case, "word 3" appears twice on document Z, but only once on document X, and so, within the first tier of results document Z would probably be ranked higher.
A second tier of results 860 can then be selectively generated, consisting of document Y, in which "word 3" occurred as the fourth word in the document, documents X and Z automatically being removed from the second tier, having already been retrieved in first tier 850. There may of course have been a sufficient number of results retrieved in the first tier of results such that subsequent tiers are unnecessary. Alternatively, there may be no instances where the search term has occurred near the beginning of the document, requiring the search engine to look to multiple tiers before any results are generated.
If the user had specified that they required three results to be displayed at a time, an additional result would be required, so a further tier would be automatically generated. For example, the second tier may retrieve all instances where the search term occurred within the first five terms of the document. The third tier may then retrieve all instances where the search term occurred within the first ten terms of the document, and so on. This multiple tier approach, whereby each successive sweep of the index is made in accordance with different filtering parameters, can significantly increase the speed of return in the case of many search queries. However, if multiple sweeps are indeed required, then this approach can suffer in terms of performance, as a completely new sweep of the index needs to be carried out for each successive tier (whether automatically or on demand). An alternative approach is to effectively run the different filters in parallel, with only a single sweep of the index, and storing a temporary set of results in central computer 110. The records in this temporary set of results are then assembled in the order which they would have been generated by successive sweeps. This technique has been found to result in a significant enhancement to performance. Figure 9 depicts a process by which the invention meets a typical search query in the condensed index form, with the page segment codes appended to the document list to allow the search engine to retrieve all the necessary positional data with a single scan of a single list, through the use of page segments. In this example, the database 910 to be searched comprises a word list and positional data 930 relating to each entry in word list 920. When a query 900 is submitted for the search term "word X," the search engine scans database 910, finds the entry for "word X" in word list 920, and examines positional data 930 associated with that word. The search engine first retrieves all instances where "word X" occurs as the first word on the page (ie. in page segment 1), on the assumption that there is generally a direct correlation between the proximity of the search term to the beginning of the page (and therefore the relevance of that page to the search query). In this example, a page segment code 940 indicates that the search term occurs in the first segment of document 5, the third segment of document 1, the fourth segment of document 2 and the fifth segment of document 3. The search results are therefore displayed to the user in this order, based on the assumption that documents with the search term occurring closer to the beginning of the page are likely to be more relevant, hi other embodiments of the present invention, the proximity of the search term to the beginning of the page may be combined with or moderated by other indicia of relevance to determine the final ranking of search results.
The page segmented approach described above effectively includes the results associated with the first segment with those associated with the second, and so on (Figure 4). This can present a disadvantage from the point of view of index memory. However, when multiple search terms are included in a search query, and those terms are widely separated in a particular document, this approach helps address the potential problem that this type of relevance algorithm might otherwise produce. An alternative mechanism for indexing the pages according to a page segmented approach is to exclude the terms in one portion from those in the others. By way of example, the first word on a page may be regarded as segment 1 , the second to fifth words as segment 2, the sixth to tenth words as segment 3, the eleventh to twenty-fifth words as segment 4, and so forth. This reduces the memory required for the index, but can produce disadvantages when multiple search terms are included in a search query. If this approach is taken, then the index database can be processed to simply replicate the entries of all indexed terms associated with a particular segment as entries of those terms but associated with all subsequent regions, which has the effect of copying all indexed terms from that particular segment into all subsequent segments. Again, the memory load associated with this technique can be offset by the practical advantages to which it gives rise.
The embodiments described above rely on a relevance algorithm based on the assumption that the importance of a particular term in a document - such as a webpage - is related to its proximity to the beginning of that document. This is a particularly effective assumption in the case of documents containing journalistic content, such as online news stories and features, as such articles tend - due to the inherent nature of the writing style - to include the most germane terms in the initial text portions (title, opening paragraphs). In some types of documents this assumption may be less appropriate, and additional filtering approaches may then be needed to more effectively discriminate between returns.
In the case of particular categories of stored information, a problem can arise if a plurality of pages open with a repeated component, such as a 'boilerplate portion', or other common text element. Many websites now use template formats from within content management systems, which can result in the first portion of every webpage at that site being identical. This situation clearly has the potential to interfere with the relevance algorithm, by effectively offsetting the unique content further down the respective page, and thus out of proximity with the actual beginning of that page. The present invention addresses this potential problem by employing a 'virtual' page beginning for operation of the relevance algorithm in respect of documents of this type, thus effectively ignoring such common portions. By way of example, in indexing a particular site, all the pages of that site are first sorted into alphabetical order of the abstracts. Each successive page is then compared with the preceding page to determine at what point the content begins to differ, and if a common portion is identified this is cut from each such document and pasted at the end of the document. Those modified documents can then be indexed as described above, the positional data included in the index representing a far stronger measure of relevance than would otherwise be the case. Figure 10 sets out a method used by the present invention to increase the relevance of the results, by eliminating duplicate records from the search results.
Each entry (or at least a selected field of that entry, generally the content abstract) in the search engine' s list of documents 1000 is treated as if it were one large binary number. A hash function C-^ (1010) is applied to that value to produce a list of corresponding hash values 1020. The list of hash values 1020, each with a pointer to the corresponding entry in the database, is stored in a RAM in the computer system on which the search engine operates. Hash function 1010 is an algorithm that turns variable-sized inputs 1000 into fixed-sized outputs 1020. The list of hash values 1020 therefore comprises fixed-length numerical representations of each document indexed by the search engine, hi the list of hash values 1020, it is intended that each unique document will have a unique hash value. It is then a simple matter to effectively eliminate documents with identical content by identifying those which have the same hash values. In practice, this is done at runtime, by checking the hash table, and, for each hash value, only including in the list of search results sent to the user one entry (generally the first entry encountered) corresponding to that hash value. Clearly the hashing algorithm can be applied at runtime, but such an approach is not a preferred option because it will have the effect of slowing search result returns.
Hash function 1010 maybe the CRC-16 (16-bit Cyclic Redundancy Checking) algorithm, which is most commonly used to check for errors in data transmission. In this case, the CRC- 16 algorithm essentially treats each indexed document as a large binary number, which it divides by the polynomial xΛ16 + xΛ15 + xΛ2 + 1 to produce a hash value of fixed length. Other embodiments of the present invention use different hash functions which also produce unique fixed-length numerical representations of each unique document that is indexed by the search engine.
The foregoing description of an implementation of the invention has been presented for purposes of illustration and description only. It is not exhaustive and does not limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practicing of the invention.
It is intended that the specification and examples be considered as exemplary only, with the true scope and spirit of the invention being indicated by the following claims.

Claims

Claims
1. A method for searching stored information representing searchable documents on one or more computers, comprising the steps of: generating a database of indexed information relating to said stored information, the database containing indexed terms and positional data representing the position of respective terms in respective documents; receiving, from a remote computer, a search query relating to the stored information; using a central computer to search the database of indexed information for at least one search term comprising the search query, and identifying documents in which said at least one search term occurs; using said positional data to generate a list of search results in accordance with the proximity of the at least one term from a beginning position of each respective document; and sending the list of search results to the remote computer.
2. The method of claim 1, wherein said stored information is stored as a plurality of webpages on a network of computers, each document being one webpage.
3. The method of claim 2, wherein the network of computers is the Internet.
4. The method of any preceding claim, wherein the search query comprises a plurality of search terms and at least one operator, including the step of parsing the search query into search terms and operator(s) relating to the search terms.
5. The method of any preceding claim, wherein the search query also includes one or more search operators which alter the parameters of the search to be conducted by the central computer.
6. A method according to any preceding claim, wherein said list of search results is arranged in order of relevance, a closer proximity of the at least one term from a beginning position of each respective document associated with a higher relevance for that document.
7. The method of any preceding claim, wherein the database of indexed information contains index entries each associated with a preliminary pertinence attribute.
8. A method according to any preceding claim, wherein said positional data representing the position of respective terms in respective documents provides, in each case, a non-unique indicator of the proximity of the at least one term from a beginning position of each respective document.
9. A method according to claim 8, including allocating a first proximity code in respect of terms within a predetermined distance A from said beginning position, and a second proximity code in respect of terms within a distance B from said beginning position, wherein B>A, and wherein said list of search results is arranged in order of relevance, said first proximity code associated with a higher relevance than said second proximity code.
10. A method according to any preceding claim, wherein step of generating a list of search results includes generating a first tier of results, said first tier of results being sent to the remote computer, and wherein, in response to a predetermined instruction from the remote computer, an additional tier of search results is generated, extending to a lower level of relevance to the search query than the preceding tier, said level of relevance being determined in accordance with the proximity of the at least one term from the beginning position of each respective document, wherein the additional tier of search results is then sent to the remote computer.
11. A method according to claim 10, wherein a plurality of additional tiers are successively generated in response to subsequent predetermined instructions, each to a lower level of relevance to the search query than the preceding tier.
12. A method according to any preceding claim, wherein said beginning position of a document, relative to which said proximity is measured, is offset from the actual beginning of that document.
13. A method according to claim 12, including the step of comparing a plurality of documents to determine whether they share a common initial content portion greater than a prescribed size, and using said common initial content portion in order to determine said offset.
14. A method according to claim 13, wherein the documents associated with a particular stored information provider are sorted according to an attribute of the content, the content of each document is then compared with that of the preceding document to determine said common initial content portion and hence said offset, and for each document containing said common portion, that portion is moved to the end of that document before said database of indexed information relating to said stored information is generated.
15. A method for searching stored information representing searchable documents on one or more computers, comprising the steps of: compiling a database of indexed information relating to said stored information, each entry in the database relating to the content of a respective searchable document; using a hashing algorithm to convert at least a portion of each entry in said database into a corresponding hash value of fixed length; adding each hash value to a hash table with a pointer to its corresponding entry in said database; receiving, from a remote computer, a search query relating to the stored information; searching the database of indexed information for at least one search term comprising the search query, and identifying a set of entries in which at least one search term occurs; identifying, from said set of entries, entries in the hash table which have an identical hash value; generating a list of search results corresponding with said set of entries, omitting all but one of the entries corresponding to said duplicated hash value; and sending the list of search results to the remote computer.
16. The method of claim 15, wherein the hashing algorithm is a 16-bit cyclic redundancy checking algorithm.
17. The method of claim 15 or 16, wherein said step of adding each hash value to a hash table is carried out before said search query is received.
18. The method of any one of claims 15 to 17, wherein said stored information includes a respective document content abstract for each entry, each document content abstract being converted into the corresponding hash value.
19. A computer-readable medium containing instructions for performing a method for searching stored information representing searchable documents on one or more computers, comprising the steps of: generating a database of indexed information relating to said stored information, the database containing indexed terms and positional data representing the position of respective terms in respective documents; receiving, from a remote computer, a search query relating to the stored information; using a central computer to search the database of indexed information for at least one search term comprising the search query, and identifying documents in which said at least one search term occurs; using said positional data to generate a list of search results in accordance with the proximity of the at least one term from a beginning position of each respective document; and sending the list of search results to the remote computer.
20. A system for searching stored information in a network, comprising: means for generating a database of indexed information relating to said stored information, the database containing indexed terms and positional data representing the position of respective terms in respective documents; means for receiving, from a remote computer, a search query relating to the stored information; means for using a central computer to search the database of indexed information for at least one search term comprising the search query, and identifying documents in which said at least one search term occurs; means for using said positional data to generate a list of search results in accordance with the proximity of the at least one term from a beginning position of each respective document; and means for sending the list of search results to the remote computer.
21. A computer-readable medium containing instructions for performing a method for searching stored information representing searchable documents on one or more computers, the method comprising the steps of: compiling a database of indexed information relating to said stored information, each entry in the database relating to the content of a respective searchable document; using a hashing algorithm to convert at least a portion of each entry in said database into a corresponding hash value of fixed length; adding each hash value to a hash table with a pointer to its corresponding entry in said database; receiving, from a remote computer, a search query relating to the stored information; searching the database of indexed information for at least one search term comprising the search query, and identifying a set of entries in which at least one search term occurs; identifying, from said set of entries, entries in the hash table which have an identical hash value; generating a list of search results corresponding with said set of entries, omitting all but one of the. entries corresponding to said duplicated hash value; and sending the list of search results to the remote computer.
22. A system for searching stored information representing searchable documents on one or more computers, comprising: means for compiling a database of indexed information relating to said stored information, each entry in the database relating to the content of a respective searchable document; means for using a hashing algorithm to convert at least a portion of each entry in said database into a corresponding hash value of fixed length; means for adding each hash value to a hash table with a pointer to its corresponding entry in said database; means for receiving, from a remote computer, a search query relating to the stored information; means for searching the database of indexed information for at least one search term comprising the search query, and identifying a set of entries in which at least one search term occurs; means for identifying, from said set of entries, entries in the hash table which have an identical hash value; means for generating a list of search results corresponding with said set of entries, means for omitting all but one of the entries corresponding to said duplicated hash value; and means for sending the list of search results to the remote computer.
23. A computer search engine for searching information relating to a plurality of documents, configured to return, to a user submitting a search query comprising one or more search terms, a list of search results referencing a plurality of said documents, the list ordered in dependence on the proximity of at least one of said search terms to a beginning position of the respective documents.
24. A computer search engine for searching information relating to a plurality of documents, configured to return, to a user submitting a search query comprising one or more search terms, a list of search results referencing a plurality of said documents, the list omitting references to more than one document if a number of such documents contain identical content.
PCT/AU2001/001111 2000-09-04 2001-09-04 Method and system for searching stored information on one or more computers WO2002021325A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2001287351A AU2001287351A1 (en) 2000-09-04 2001-09-04 Method and system for searching stored information on one or more computers

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
AUPQ9868A AUPQ986800A0 (en) 2000-09-04 2000-09-04 Method and apparatus for searching stored information on the internet
AUPQ9868 2000-09-04
AUPR6308 2001-07-11
AUPR6308A AUPR630801A0 (en) 2001-07-11 2001-07-11 Method and system for searching stored information on one or more computers

Publications (1)

Publication Number Publication Date
WO2002021325A1 true WO2002021325A1 (en) 2002-03-14

Family

ID=25646429

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2001/001111 WO2002021325A1 (en) 2000-09-04 2001-09-04 Method and system for searching stored information on one or more computers

Country Status (1)

Country Link
WO (1) WO2002021325A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1443426A1 (en) * 2003-01-29 2004-08-04 Hewlett-Packard Company (a Delaware corporation) Process for searching a repository
EP1585033A1 (en) * 2004-04-08 2005-10-12 Deutsche Thomson-Brandt Gmbh Method and device for preparing an index of a database and for retrieving data from the database
EP2725746A1 (en) * 2012-10-29 2014-04-30 Bouygues Telecom Method of indexing digital contents stored in a device connected to an Internet access box
WO2015065859A3 (en) * 2013-10-29 2015-06-25 Microsoft Technology Licensing, Llc Text sample entry group formulation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5913208A (en) * 1996-07-09 1999-06-15 International Business Machines Corporation Identifying duplicate documents from search results without comparing document content
US6012053A (en) * 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
US6269364B1 (en) * 1998-09-25 2001-07-31 Intel Corporation Method and apparatus to automatically test and modify a searchable knowledge base

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5913208A (en) * 1996-07-09 1999-06-15 International Business Machines Corporation Identifying duplicate documents from search results without comparing document content
US6012053A (en) * 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
US6269364B1 (en) * 1998-09-25 2001-07-31 Intel Corporation Method and apparatus to automatically test and modify a searchable knowledge base

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1443426A1 (en) * 2003-01-29 2004-08-04 Hewlett-Packard Company (a Delaware corporation) Process for searching a repository
US7263519B2 (en) 2003-01-29 2007-08-28 Hewlett-Packard Development Company, L.P. Process for searching a repository of resources
EP1585033A1 (en) * 2004-04-08 2005-10-12 Deutsche Thomson-Brandt Gmbh Method and device for preparing an index of a database and for retrieving data from the database
EP2725746A1 (en) * 2012-10-29 2014-04-30 Bouygues Telecom Method of indexing digital contents stored in a device connected to an Internet access box
FR2997595A1 (en) * 2012-10-29 2014-05-02 Bouygues Telecom Sa METHOD FOR INDEXING THE CONTENTS OF A DEVICE FOR STORING DIGITAL CONTENTS CONNECTED TO AN INTERNET ACCESS BOX
WO2015065859A3 (en) * 2013-10-29 2015-06-25 Microsoft Technology Licensing, Llc Text sample entry group formulation
CN105683958A (en) * 2013-10-29 2016-06-15 微软技术许可有限责任公司 Text sample entry group formulation
US9535983B2 (en) 2013-10-29 2017-01-03 Microsoft Technology Licensing, Llc Text sample entry group formulation

Similar Documents

Publication Publication Date Title
US7860853B2 (en) Document matching engine using asymmetric signature generation
US6795820B2 (en) Metasearch technique that ranks documents obtained from multiple collections
US9619565B1 (en) Generating content snippets using a tokenspace repository
US6615209B1 (en) Detecting query-specific duplicate documents
US8332422B2 (en) Using text search engine for parametric search
US6321220B1 (en) Method and apparatus for preventing topic drift in queries in hyperlinked environments
US7308643B1 (en) Anchor tag indexing in a web crawler system
EP1779273B1 (en) Multi-stage query processing system and method for use with tokenspace repository
US8027974B2 (en) Method and system for URL autocompletion using ranked results
US5963954A (en) Method for mapping an index of a database into an array of files
US5765150A (en) Method for statistically projecting the ranking of information
US5864863A (en) Method for parsing, indexing and searching world-wide-web pages
US6081804A (en) Method and apparatus for performing rapid and multi-dimensional word searches
US20090119289A1 (en) Method and System for Autocompletion Using Ranked Results
US20060253438A1 (en) Matching engine with signature generation
US20050004943A1 (en) Search engine and method with improved relevancy, scope, and timeliness
US20080313178A1 (en) Determining searchable criteria of network resources based on commonality of content
WO1998007105A1 (en) Real-time document collection search engine with phrase indexing
WO2008097856A2 (en) Search result delivery engine
WO2001016807A1 (en) An internet search system for tracking and ranking selected records from a previous search
EP1938214A1 (en) Search using changes in prevalence of content items on the web
Gog et al. Efficient and effective query auto-completion
Franklin How internet search engines work
WO2006122086A2 (en) Matching engine with signature generation and relevance detection
WO2002021325A1 (en) Method and system for searching stored information on one or more computers

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PH PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP