This invention is generally concerned with software and systems for searching. More particularly it relates to systems for searching and cataloguing documents on networks such as the World Wide Web and to new interfaces to such systems.
The World Wide Web is expanding more quickly than the capacity of search engines to catalogue it and search engines are increasingly falling behind ('Accessibility of information on the web' by Steve Lawrence and C. Lee Giles, Nature, 400, 107, July 1999).
As of January 2000 the World Wide Web comprised more than one billion unique documents (http://www.inktomi.com/new/press/billion.html), and indexing new or modified web pages could take several months or longer.
A method for organising information is known from WO 99/06924 in which the search activity of a user is monitored and used to organise articles in a subsequent search by the same or another user who enters a similar search query.
US 5,748,954 refers to determining the popularity of a file according to how often a file is referenced by a computer other than the computer on which the file is stored. US 5,974,455 uses a hash table and a sequential disk file to construct a search database. US 5,983,218 describes a distributed (multimedia) database using a web server to select and co-ordinate information flow between database sites and user sites. US 6,006,217 describes a method for providing enhanced search results in which a server retrieves a document from its home server and highlights matches to search criteria. US 6,038,668 describes a networked catalogue search system in which a search engine forwards retrieved pages to an object oriented database distributed across a network of computers.
A local portal retrieves pages through a web crawler. US 6,078,924 uses collection agents to retrieve specific information without user intervention. WO99/42935 describes a search system in which characteristic information for a search database is stored across a computer network. An information collector comprises a plurality of collecting modules and user access to the system is via an interface server.
EP-A-0 982 672 describes an information retrieval system including a search assisting server having list data constructed using a list of identifiers for accessing information servers. In response to designation of a requested item the identifier corresponding to the requested item is searched for from the list data. JP 11015856A describes a server for integrating databases including multimedia materials comprising a meta-server including a meta-database, a search agent for searching an objective database site by indexing, and an improving module for observing a response pattern from a database site corresponding to a user's enquiry and improving a calculation of a future site relation.
A distributed indexing/searching workshop held by the World Wide Web Consortium in May 1996, Massachusetts, USA (www.w3.org/search 9605-indexing-workshop) provides background information on web spidering. The web site www.webbuildeπnag.conf_^upload/free/features/webbuilder/1999/udell/1999-07-20.asp purports to disclose an article in Web Builder Magazine of July 20, 1999 by Jon Udell which briefly refers to a distributed spidering process in which a number of software agents collect data for a search database. The article invites comment on the idea of "pushing the work of spidering (but not indexing) out to ISPs and other hosts that serve large numbers of pages".
Since the web is expanding more rapidly than the capacity of current search engines to catalogue it, a system and method is required in which inter alia the cataloguing of the web is performed more quickly than has hitherto been the case.
There is also a demand for a search engine with a more comprehensive database than those of current search engines which will enable more complete results to be returned in response to a user's search query.
The present invention addresses these needs.
According to the present invention there is therefore provided a server system for searching a network, the system comprising: a search data store storing: a plurality of addresses of locations of objects accessible using the network; and search data including data relating to information content of at least some of the objects; a program store storing processor implementable instructions; a processor coupled to the data store and to the program store for implementing the stored instructions; the instructions stored in the program store comprising instructions for controlling the processor to:- receive a search request from a user terminal; retrieve search result data from the search data store comprising one or more search result address for objects having an information content relevant to the search request; transmit the search result data to the user terminal; receive from the user terminal information relating to an object located at an address provided to the user terminal by the server system; and update the stored search data using the object-related information received from the user teπriinal.
The address provided to the user terminal by the server system may comprise one of the search result addresses or a search tax address (described below) or an address for spidering as a background task. Preferably, however, a plurality of addresses for a plurality of objects is provided to the user terminal. The information relating to the object or objects at the address or addresses provided to the user terminal may comprise object content characterizing data such as a last modified date and/or checksum for a web page, or it may comprise object information content data such as indexed content data. Alternatively, but less preferably, raw object data may be received from the user terminal, such as raw (i.e. unprocessed) web page data.
Updating the stored search data using information received from a user terminal relating to an object located at an address provided to the user terminal by the server system relieves the server system of much of the search and indexing work it would otherwise have to perform. The reception of information from the user terminal is linked to use of the server system to process search requests from the user terminal which allows better use of network bandwidth and processing bandwidth as well as facilitating simplification of overall system design. As the skilled person will appreciate, the search data store may reside on a single machine or may comprise a distributed data store.
In a preferred embodiment the instructions further comprise instructions for retrieving at least one search tax address from the search data store, transmitting this to the user terminal, and receiving back information relating to an object at the search tax address. The search tax address is an address provided to the user terminal for the user terminal to process, but in general does not comprise one of the search result addresses. Thus, in effect, this additional address is a tax on the user terminal (or user) for allowing the terminal access to the search data store. The search tax address or addresses may comprise an address or addresses which are to be processed by the user terminal in an on-going background spidering process or the search tax address or addresses may be provided to the user terminal in response to receipt of a search request from the user terminal on a per-search basis. In a preferred embodiment both background and per- search tax addresses are sent to the user terminal for spidering.
The search tax addresses are preferably selected according to a logical proximity of an object at the tax address to the user terminal. Such a logical proximity may be based upon the user terminal's IP address, or upon a proximity measure such as ping time or a count of a number of hops between the user terminal and the object at the tax address. Search tax addresses may also be selected dependent upon the network access bandwidth of the user terminal.
The object information content data preferably comprises a list of words in the object and word rating data indicating the likely significance of the words to the object. The
server system may also be configured to receive user object preference data such as bookmark data indicating objects a user has bookmarked for access on later occasions.
To restrict the likelihood of fraud, in preferred embodiments two or more user terminals are sent the same object's address and the search data store is only updated once the result from a first user terminal has been checked against the data received from the second or further user terminals. The system may also monitor user's IP addresses and/or user's traffic to detect fraud.
The invention also provides a search data store for the server system wherein an item of the object information content data, such as a keyword, is associated with a plurality of item location addresses for objects having an information content relevant to the item of object information content data; and wherein the item location addresses have an order corresponding to the relevance of the objects at the addresses to the item of object information content data.
In a complimentary aspect the invention provides a user terminal for searching a network, the user terminal comprising: a data store operable to store data to be processed; a program store storing processor implementable instructions; and a processor coupled to the data store and to the program store for implementing the stored instructions; the instructions stored in the program store comprising instructions for controlling the processor to:- input a search request from a user; transmit the search request to a server system; receive search result data from the server system, the search result data comprising one or more search result address for objects having an information content relevant to the search request; retrieve from at least one address received from the server system object data for an object located at the received address; and transmit to the server system information relating to the object located at the received address derived from the retrieved object data.
The address received from the server system may be a search result address, a search tax address provided in response to a search request or a background search tax address, as
described above with reference to the server system. The search request itself may either be issued in a conventional manner using an internet or web browser, or the search request may originate from dedicated searching code running on the user terminal. The object data retrieved by the terminal may comprise bibliographic data such as a last-modified date or more complete object data; the information transmitted to the server system may comprise the retrieved object data itself, for example where only bibliographic data is retrieved, or it may comprise the results of an object analysis procedure which has been executed on the user terminal. The server system with which the user terminal communicates may comprise a single server or a set of interrelated servers.
The processor implementable instructions of one or both these systems may be provided on a data carrier or storage medium such as a hard or floppy disk, ROM or CD-ROM, or on an optical or electrical signal carrier. The processor implementable instructions of the user terminal may be stored in the data store of a network server such as a web server, for example as part of a page of internet data such as a web page.
The invention also provides a corresponding method for searching a network using a client system, the method comprising: inputting a search request from a user; transmitting the search request to a server system; receiving search result data from the server system, the search result data comprising one or more search result address for objects having an information content relevant to the search request; retrieving from at least one address received from the server system object data for an object located at the received address; and transmitting to the server system information relating to the object located at the received address derived from the retrieved object data.
In another aspect the invention provides a search system for a network comprising: a server coupled to the network; a plurality of user network-access means, couplable to the server via the network for providing a plurality of users with access to the network; a search database coupled to the server; an information collecting program accessible to each said user network-access means for running by said users; wherein said
information collecting program is configured to, when rurining on a said user network access means, collect information relating to data stored at locations within the network and to pass at least a portion of the collected information to the search database; and wherein said locations are provided to the collecting program from the database in response to a search request sent by the collecting program to the server for search data from the database.
The search system may be part of a system providing a user's search service. The network may be an Internet protocol network such as an Internet or an intranet and in what follows references to "web pages" are intended to include pages of information in internets and intranets other than the World Wide Web. Typically the user network - access means will be a personal computer, but network access can also be by means of a mobile telephone, Internet enabled TV and other similar net-compliant devices. In one embodiment the information collecting program is integrated into a web browser, for example, comprising part of an executable file of the browser. The search database comprises generally data and a software.interface thereto and may include associated data manipulation, processing and communication functionality.
In an Internet data locations are usually identified by URLs (Uniform Resource Locators), and in a preferred embodiment these are provided to the information collecting program from the database. However, URLs for collecting information could be obtained from another source. Information from the collecting program for the database could comprise a downloaded web page and/or a compressed or encrypted version thereof, or the web page after partial or full analysis for, for example, keywords and/or phrases, by the information collecting program. In an Internet, the Internet data collected may include (but is not limited to) HTML data, XML data, DHTML data, SGML data, web page information, and audio, video, multi-media, web TV, game, file, financial and other information types.
Often the information collecting program will require some sort of "signature" to show that it can be trusted to read and/or write to a local user's hard disk and to access
information on other servers. This is not, however, an essential aspect of the invention but depends, in part, on how the network is set up and the context (for example the browser type) within which the information collecting program operates.
In a further aspect the invention provides a method of updating a search system for a network, the system comprising: a server; a plurality of user network-access means, couplable to the server via the network, each for providing a user with network access; and a search database couplable to the server; the method comprising: running an information collecting program by a plurality of said users; collecting information relating to data stored within the network using the program; passing at least a portion of the information collected by said plurality of users to the search database; and updating the database using the collected information.
In one embodiment the user's access to or vote of approval for information provided by the search results is logged or registered in the database. Notes can then, for example, be counted so that the results of future searches can be presented or ranked in order of relevance as determined by users of the system. There is preferably also a provision of bookmarking, in the context of an Internet search page, the marking of user-preferred pages in order that these can be returned to at a later stage. More generally bookmarking involves the storage of a location identifier, normally with some information concerning the site, page or data it locates, for example a title or description. Normally a user's bookmarks are specific to an individual user, but bookmarks can also be shared between users or within groups of users. In a preferred embodiment, when a site or web page or other network location is bookmarked this is registered as user approval for later ranking of search results, and where an axis or vote counting system is implemented, additional weight can be given to book marked sites.
The invention also provides a program to, when rurining, on a network: provide a user interface for searching the network; accept a user search request; pass a request to a search database, responsive to the user request; receive a search result having network data location information from the database; access, or request another program to
access, the data location; and pass information from the data location back to the database.
The invention further provides a web browser application program to, when running, receive a URL from a server, at least partly download a web page at the URL, extract a portion of information from the web page, and send the information to a web searching database on the web.
In another aspect the invention provides a web data collection system comprising a plurality of individual users each connected to the web and running a program to collect information on the contents of web pages and to report the information to a common database.
In another aspect the invention provides a database for a network searching system comprising: a list of network resource locators; a list of search terms or term identifiers; and a list of ratings, each linked to at least one resource locator and one term or term identifier, a value of each rating being dependent upon access to or approval of a corresponding located resource by users of the searcWng system.
In another aspect the invention provides a method of bookmarking resource locations in a network searching system, the system comprising a server coupled to a search database and means for remote access to the database by a plurality of users, the method comprising: providing to a user in response to a search request, search results from the database, the results being associated with corresponding resource locators; receiving from the user a request to bookmark a resource associated with a said result; storing, in the database, a corresponding resource locator coupled with user access control information for the user; whereby the resource is locatable by the user after bookmarking.
In another aspect the invention provides a method of ranking results for a network search system, comprising: determining a first user's interest in a network resource by
detecting whether the user stores the resource location for later access; and ranking a plurality of network resource locations provided as results for a search performed by another user, partly responsive to the first user's determined interest.
In another aspect the invention provides a method of providing a web user with a preview of a web page, comprising: locally caching at least part of the web page information; rewriting at least one link in the cached page to point to locally cached data; and displaying at least a part of the cached page.
In another aspect the invention provides a user interface for a network browser or search system, comprising means to automatically download a plurality of documents or web pages, or parts thereof, indicated by displayable results provided to a user, by starting a corresponding plurality of processing tasks to be executed in parallel.
In another aspect the invention provides a network search system comprising: means to store a search request input to the system by a user on a first occasion; and means to repeat the user's stored request automatically and to display the results of the request when the user accesses the system on a second, subsequent, occasion.
In another aspect the invention provides a network search system comprising: a server coupled to a search database; a remote network access means including input means for a user to input a search request; means to provide an instruction from the database to the remote network access means to access and analyse information relating to a resource on the network and to report to the search database; and means to provide search results to the network access means in response to the search request, conditional upon the database receiving the report.
In another aspect the invention provides a method for quality control of a database of search data for a network, comprising: instructing a plurality of client programs to gathering information for the database from locations provided to the programs by the database; double checking a proportion of the gathered information by issuing identical
or equivalent locations to two different client programs; determining whether the gathered information from the two client program agrees to within a tolerance margin; and adjusting said proportion based on the results of said step of determining.
In another aspect the invention provides a stand-alone distributed web crawler to, when run contact a web page; download that web page; analyse the contents of that web page; and send the results of its analysis to a database system.
In yet further aspects the invention provides a system and method in which a signed Java applet performs a web-crawling function analysing web pages and posting the results of its web crawling operations to a system partly comprising a database.
Such a database-system may be built from scratch expressly for the purpose of serving the signed Java applet or it may comprise an existing system, with the potential addition of new schemes, tables, relations or other data structures which facilitate serving the applet.
In the case of a pre-existing database, it may be necessary to incorporate a method of translating data being sent from the applet to the database-system into a form comprehensible to the database-system and a method of translating data being sent from the database-system to the applet into a form comprehensible to the applet. In either of these latter cases translation software may be incorporated into the applet, or incorporated into the database-system, or both.
The utilisation of a signed Java applet for web crawling also confers other advantages upon the search process.
Generally speaking, described herein is an Internet based search engine which is installed on a server but operates in a distributed way in that it makes use of users' local PCs to update the search engine database.
The user accesses the search engine database from a local PC by means of a Java applet which may be downloaded from the search engine server. This applet is run when a search is carried out and returns a list of web page URLs in a conventional manner. However, when a user accesses one of the URLs identified by the search, the Java applet fetches the web page identified by the selected URL and checks the time stamp on the web page against the date of an entry for that URL in the search engine database. If the check shows that the web page fetched by the user is newer than the search engine database entry the Java applet takes further action. It either sends or forwards a copy, preferably in a compressed form, of the web page data to the search engine or it strips out key words from the web page and forwards these to the search engine. In this way the search engine database is updated as users use the search engine. Effectively, the web crawler software is distributed across a large number of local users' PCs.
Preferably the Java applet is "signed", in other words, provided with a digital signature or certificate. An applet which is signed in this way is "trusted" and is permitted access to other servers. This is useful as it facilitates the Java applet forwarding web page data from these other services to the database search engine server. It is also preferable that the signature gives access to the local hard disc of the user's PC to, among other things, allow web pages downloaded from the other servers to be cached on the local hard disc for faster retrieval, previewing and viewing. Typically, when the system is first activated the user will be asked "do you trust the search engine provider?" before access to the local hard disc/other servers is confirmed. Such digital signature/certification systems are provided by Verisign or other certificate authorities and use an RSA or other public key cryptography algorithm. In some cases an additional signature capability is necessary for access to controlled parts of the web browser system and separate signatures may be required for NETSCAPE (Registered trade mark) and/or Microsoft Internet Explorer (Registered trade mark).
We will also describe a means for registering statistics on users' approval or use of a given site presented by the search engine in response to a search request. When a user looks at a URL voting statistics are generated, which relate to the search engine query.
In a refinement when a user bookmarks a particular site extra weight is given to the user's vote for that site. Search results can thus be presented ranking by their relevance to the system's users.
Other features include the provision of a scrolling list of search URL results (which is made possible by use of a Java applet) and a web page preview feature in which a reduced size version or reduced content version of a web page is displayed in a window when the user's cursor is momentarily held in position over a URL hyperlink.
Thus the system effectively provides a distributed web crawler or web spider which uses a signed Java applet for network access. Advantageously the system can be run on some workstations and/or other hardware, and in one embodiment the applet occupies less than 100K bytes with approximately a further 1 Megabyte allocated to local disc caching of downloaded web pages.
In a still further aspect the invention provides a web crawling system or applet to, when running, contact a web page; download that web page; analyse the contents of that web page; and send the results of its analysis to a database system.
The purpose of such a web crawling Java applet is to crawl or spider the world wide web. That is to say, the purpose of this applet is to contact a web page and then analyse its contents. Such a web page will not generally be hosted on the server from which the applet originates.
Ordinarily, a Java applet is not permitted to access any server other than the server from which it originates. If the applet is signed however, that is to say, if it has been granted a digital certificate, it is permitted to access servers other than the server from which it originates.
The applet contacts a web page, perhaps as a result of having been passed that web page's URL by a server, or perhaps as a result of having that web page's URL input by a user. The applet then downloads and proceeds to analyse the contents of that page.
When the applet has performed its analysis it uploads its findings, for storage and later access, to a server hosting a database system.
The findings may be uploaded in an encrypted form, or a compressed form, or an encrypted and compressed form.
An advantage of compressing the data prior to uploading it to the database system is that the time required to upload the data in a compressed form will be generally less than that required to upload the same data in an uncompressed form. Accordingly the applet's connection will be less busy and therefore the applet will have more bandwidth available for spidering.
Preferably the system has a graphical user interface (GUI). Typically in this system a search term is submitted to the database system via the applet and the database system accordingly returns its findings to the applet which the applet then displays. The GUI permits user interaction with the central database of a search engine.
Preferably the Java Applet Graphical user interface accepts from the user a word or phrase which the user wishes to submit to the search engine (Search Term Acceptance). Typically, but not necessarily, this will comprise a text box into which the user can type a search term or a voice recognition system into which the user can announce a search term.
The applet then submits that search term to a database system which has been specially constructed or adapted for this purpose. The search term may be sent in an encrypted form, or a compressed form, or an encrypted and compressed form. After consulting its store of information relating to the search term, the database system returns its findings,
or results, to the applet. The findings of the database system may be sent in an encrypted form, or a compressed form, or an encrypted and compressed form. The applet decrypts or decompresses or decrypts and decompresses the data as appropriate and then presents the results to the user.
Typically, but not necessarily, the database system may also download to the applet one or more URLs of web pages which it would like updated with a request that the applet contact the page represented by the URL, analyse the page, and upload its findings as described earlier.
The code which comprises the "web crawler" is not necessarily written in Java and therefore does not necessarily comprise an Applet. Moreover it is not necessary for the software to "crawl" in the sense of copying itself from computer to computer. The method and system for crawling the web is preferably directly integrated into the code for the web browser, that is to say, the code for the web browser and the code for the crawler are in the same executable file.
A stand-alone distributed web crawler may comprise an executable file which when run may have only the very simplest interface consisting of a 'stop' button or other means of halting the execution of the program.
More typically, but not necessarily, the executable file which comprises a browser incorporates a system and method which calls the executable file which comprises the stand-alone distributed web crawler
The purpose of the stand-alone distributed web crawler is to crawl or spider the world wide web. That is to say, in the context of this description, the purpose of this crawler is to contact a web page and then analyse its contents then upload the analysis to a database-system. The code which comprises the web crawler is preferably, but not necessarily, written in Java (Regd. T.M.) and does not necessarily comprise an Applet.
The stand-alone distributed web crawler contacts a web page, perhaps as a result of having been passed that web page's URL by a server, or perhaps as a result of having that web page's URL input by a user. The stand-alone distributed web crawler then downloads and proceeds to analyse the contents of that page. When the stand-alone distributed web crawler has performed its analysis it uploads its findings, for storage and later access, to a server hosting a database system.
There is also envisaged a system and method comprising software which accepts data from an applet as described previously and translates that data into a form, type, language or schema compatible with the form, structure or language of a database of an existing search engine, (for example, Northern Light, Snap, Alta Vista, HotBot, Microsoft, Infoseek, Google, Yahoo, Excite, Lycos, Euroseek (Registered Trade Marks).
Typically, the data being sent from the signed Java (Regd. T.M.) applet will comprise either queries or the web page-analysis findings of the signed Java applet for inclusion in the database.
There is further envisaged a system and method comprising software which accepts data from a database of an existing search engine, (for example, Northern Light, Snap, Alta Vista, HotBot, Microsoft, Infoseek, Google, Yahoo, Excite, Lycos, Euroseek (Registered Trade Marks) and translates that data into a form, type, structure, style, language or schema compatible with an applet as described previously.
Typically, the data being sent or retrieved from the existing database to be processed by the system or method will comprise results pertaining to search queries returned in response to queries submitted to the existing database via a signed Java applet.
In both cases the data will typically, but not necessarily, be sent in a compressed and/or encrypted form.
In a further aspect the invention provides a database security system and method.
In the above described systems it is desirable to determine whether the data which is uploaded onto the database-system is sent by a bona fide applet of the type described earlier, and that the data the applet uploads is therefore genuine data and not data uploaded maliciously by an algorithm masquerading as a bona fide applet.
To assist in ensuring that the data which is uploaded onto the database-system is genuine, the uploaded data may be put in a holding data-structure or database in the database-system or may be placed in the database-system proper with a flag to indicate that the data has not yet been confirmed as valid.
For data not confirmed as valid, that is to say, for data which purports to represent the findings of an applet's "spidering" of a particular web page, confirmation can be obtained by re-spidering that web page one or more times with applets known to be at a location different from the location from which the initial spidering findings were received (spidering of a page, here, means simply accessing information on the page).
If the re-spidering of that same web page is undertaken by an applet at a location other than the location of the initial spidering and the findings of the re-spidering are identical or similar to the findings of the initial spidering then this will provide a degree of confirmation that the data represents the content of that web page.
Conversely, if the re-spidering of that same web page is undertaken by an applet at a location other than the location of the initial spidering and the findings of the re- spidering differ, or significantly differ, from the findings of the initial spidering then this will provide a degree of confirmation that the data does not represent the content of that web page.
This re-spidering can be repeated in the manner described and with each confirmation that the data is valid the degree of confidence that the data is represented by that URL
increases, such that after a small number of re-spiderings the probability of the data being invalid is significantly reduced.
It is desirable that a search engine is able to determine which web pages are of the greatest interest to the user and is therefore able to return results ranked according to some criterion for relevance.
Thus in a still further aspect the invention provides a search system for returning results ranked according to relevance as determined by, for example, search term density and/or likelihood of user interest. There is thus also provided a method to determine the ratio of the number of appearances of a search term in a particular page to the size of that page.
Since a page with only a small number of references to a search term is likely to be of less interest to the user than a page of the same size with a larger number of references to the same search term, a search term ratio or search term density of a web page can be defined as a ratio of the number of occurrences of the search term on a page to or divided by the size of that page. Other things being equal, it is preferable for a search engine to return results having a high density of references rather than a low density of references. That is, other things being equal, the greater the value for the search term density, the greater the likelihood of that page being of interest to the user with respect to that reference.
Typically, the database system will comprise one or more tables or relations or other data structures in which each search term will be associated with the URL of each web page which contains that search term, and the search term density of that search term in that page.
When a user consults the world wide web in order to discover an answer to a particular question that user will often have a particular question in mind. Questions tend to fall into categories, those that require a simple 'yes' or 'no' answer and those that require a
fuller answer. This latter category of questions often commence with 'What', 'Why', 'When', 'Where', 'How' or 'Who'.
It is often the case that when a particular question appears on a web page that web page then discusses possible answers to that question. Accordingly a means is also provided of determining on which page(s) a particular question appears will assist the user in obtaining an answer to that question.
One embodiment considers sentences beginning with 'What', 'Why', 'When', 'Where', 'How' or 'Who' and terminating with '?'. By compiling a directory of questions of this form associated with the URLs of the pages on which they appear, a directory of likely pages where the corresponding answers can be found is obtained.
These and other aspects of the present invention will now be further described, by way of example only, with reference to the accompanying figures in which:-
Figures la and b show a block diagram of an Internet search system according to an embodiment of an aspect of the invention;
Figure 2 shows a block diagram of a user's computer in an embodiment of the invention;
Figures 3 a to c show a flow diagram of a user registration and background spidering process;
Figure 4 shows a flow diagram of a process for downloading a web page and applet from a web server;
Figure 5 shows a flow diagram of a server process for the user registration and background spidering process of Figure 3;
Figure 6 shows a flow diagram of a search and spidering process on a user's computer;
Figure 7 shows a flow diagram of a server process for the search and spidering process of Figure 6;
Figure 8 shows a flow diagram of a graphical user interface thread for a search process for a user's computer;
Figure 9 shows dataflows in search and spidering processes according to an embodiment of an aspect of the present invention.
Figure 10 shows an exemplary graphical user interface for a search system according to an embodiment of the present invention; and
Figure 11 shows an exemplary plurality of concurrently running program threads of the search and spidering process of Figure 6.
Referring first to Figures la and b, these together shows a block diagram of a search system 100 according to an embodiment of the present invention.
In Figure la a user terminal 102 is connected to the Internet 114. Further user terminals 104, 106, and 108 are also connected to Internet 114, via LAN (local area network) 110, and Internet gateway 112. Connected to the internet 114 are a plurality of sources of information, represented in Figure la by web servers 116a to e. Data for user searching and for system spidering is stored on web servers 116a to e. The world- wide web, which represents objects in HTML (hypertext markup language) format and transfers data via the HTTP (hypertext transfer protocol) protocol. However, the skilled person will be aware that the Internet also provides access to data via other protocols such as, for example, FTP (file transfer protocol) and Gopher. In the description of the embodiment of the invention which follows for simplicity reference is made to searching data on web servers, although in practice the invention is not restricted to data available via this format.
A search and spidering system web server 118 is coupled to the Internet 114, via a firewall 117 for security. The system web server 118 provides a search system (home) web page including a search applet, that is, a Java (registered trade mark) program for execution within a supporting web browser. The system web server 118 is coupled to web page and applet code storage 120 within which the applet is stored as a signed jar (Java archive).
A digital signature authenticates the Java applet as originating from the search system service provider. When the Java applet is downloaded to a user terminal a window is displayed together with the name of the service provider and a certification authority and the user is asked whether or not to trust content from the service provider. The digital signature authenticates the origin of the Java applet as the service provider and the user is thus provided with sufficient information to enable the applet to be trusted. Once the applet has been marked as trusted it is given extended permissions by the web browser which allow it to perform the functions described below, such as reporting indexed content data to the service provider.
Web browsers such as Microsoft Internet Explorer (registered trade mark) and Netscape Navigator (registered trade mark) automatically recognize a signed Java applet and implement such security procedures. Providing a web page including a signed Java applet is the preferred implementation of the system, but in other embodiments other security arrangements may be employed.
The search system home web page is a static web page 122 comprising graphics and an HTML tag 124 including a URL (uniform resource locator) pointing to the Java applet in code storage 120.
Referring now to Figure lb, this also shows the web server 118 and code storage 120 of Figure la, together with a data collection server 122 and a query servicing server 124. Each of servers 118, 122, and 124 has a separate URL. The URL of web server 118 is
accessed by a user's web browser to download the system home page; the URLs of servers 122 and 124 are accessed by the Java applet code running on the user's machine. The data collection server 122 includes data collection code storage 122a and is coupled to a system data store 126. The query servicing server 124 includes query serving code storage 124a and is coupled to a user data store 128, as well as to the system data store 126 for returning search results. Some or all of the stored code and/or data may be stored on a removable storage medium, illustratively shown by disk 130.
Broadly speaking, the data collection server 122 manages data collection or spidering functions for the system and query servicing server 124 handles user queries. In a preferred embodiment of the system, search results are provided to a user together with a so-called "URL tax" of sites which the user's computer is to spider. For this reason query servicing server 124 is coupled to data collection server 122.
The system wέb server 118, data collection server 122, and query servicing server 124 may comprise computer programs implemented on dedicated machines or, as will be understood by the skilled person, two or more of these servers may be implemented on the same machine.
The system data store 126 preferably includes a list of all known URLs, although in practice at any one time the database will include URLs which are no longer in existence and will not include some new URLs. The basis of such a list is obtainable from the authorities who are responsible for overseeing registration of domain names, such as Network Solutions Inc., although it may be necessary to combine lists of URLs obtained from two or more such authorities. Over time the list may be enhanced by server and user-based spidering as described later. Embodiments of the system may include a subset of known URLs, for example to provide a language-based search facility, rather than attempt to include all known URLs. Associated with each URL is status data including a time stamp indicating when the status data was last updated, a "date last modified" date, normally provided on web pages to indicate when the page was last modified, a checksum based on the web page data, and a web page file size.
The database also includes indexed content data for the web pages (also referred to as URL spidering data) as described in more detail below, and page rating data to provide one or more ratings of, for example, popularity, utility, and the like. The system data store 126 may comprise, in one embodiment, of the order of 1010 URLs and associated data stored in of the order of 1 TB RAID (redundant array of inexpensive disks) storage.
The data store 126 may comprise a relational or object-orientated database, such as an Oracle or DB2 database, or it may comprise a proprietary database as described below. Data within the database is accessed by a user's search keyword although popular combinations of keywords may have their own entries. Taking into account the possibility of searching in a variety of languages, and searching for proper names and acronyms, provision for up to 107 keywords may be necessary.
In an exemplary proprietary format each keyword has its own file comprising a list of URLs referencing that keyword. This URL list is preferably ordered by default criteria so that retrieved search results are automatically provided in order of relevance. The ordering of results where keywords are combined in a search term and the same URL appears under two (or more) keywords may, for example, be based upon the relative position of the URLs concerned in the two lists. Thus when the database is updated new indexed content is preferably inserted at an appropriate place within the relevant ordered list or lists. With this proprietary format images of the files of popular keywords may be held in RAM for speed.
The data collection server 122 provides URLs for spidering to the Java applet running on a computer operated by a user of the search system and receives indexed content data back from the applet for storage in system data store 126. In one embodiment of the system aspects of this process and of system data store 126 are optimized by the system, preferably automatically. This self-optimization may be performed by the data collection code by, for example, making a small modification to a parameter and
measuring any resulting change in system performance to determine whether the performance is improved or detrimentally affected by the modification.
Global parameters which may be modified by such a procedure include the number of keyword combinations having their own separate entry in system data store 126, the number of keyword files cached, and the length of time an unaccessed file is retained in a cache. User ("client")- specific parameters include the number of URLs in each batch sent to the client for spidering, and the URLs selected for spidering, in particular their proximity to the user's URL - the user's "catchment area" - as described further below. The client-specific parameters are preferably optimized separately during each session a user is logged-on, for example, to optimize use of available bandwidth to the user's (client's) computer.
User data store 128 stores data relating to specific users or clients of the search system. Thus in a preferred embodiment user data store 128 comprises user identification data such as a user number, a user name and password for accessing the system, a user e-mail address for marketing purposes and user (search) term data as described in more detail later with reference to the USER TERM table. Optionally the user data store 128 may also store a user internet address (which may be temporary or the address of a gateway). The user term data includes a history of search terms frequently used by a user which can be employed, for example, to generate a news or update service and to alert a user to new websites in which they may have an interest.
The user data store 128 may also include a user rating, for example, a "blacklist" flag which can be used to exclude unwanted users from the system. Preferably the data store also holds each user's normal IP address (this could be the IP address of a company gateway such as gateway 112 of Figure la), for catchment area-related searching as described later.
Other data stored in user data store 128 preferably includes BOOKMARK and FOLDER tables (described later) to store and organize a user's bookmarks. The database may also
store user settings data for storing users' preferences. In one embodiment the user settings data defines the number of results returned by a search, an age cut-off for search result web pages, whether or not the user wishes to take advantage of the user search term storage facility, and if the user does request this facility, the frequency of news updates and an option for e-mail notification of updates.
Referring now to Figure 2, this shows an example of a user's computer which, as illustrated, comprises a conventional, general purpose personal computer 200 suitably programmed.
Personal computer 200 comprises a pointing device 206, such as a mouse, a keyboard 208, and a display 210, all for providing a user interface. An Internet interface 204 is provided for connecting the computer to Internet 114; this may comprise any conventional communications interface such as a modem or a local area network interface (which provides an indirect interface to the Internet). The computer includes a processor 212 which loads and implements program code stored in permanent program memory 218, such as a hard disk drive. Data for use by program code running on the processor is stored in permanent data memory 216 (which again may comprise a hard disk drive) and a working memory 214 is provided for use by processor 212 during its operation. The program code and data in memories 214, 216, and 218 may be stored on a removable storage medium, as illustrated by floppy disk 220. All the components of computer 200 are linked by computer bus 202.
Processor 212 loads and implements a web browser 212a such as Internet Explorer (registered trade mark) or Netscape Navigator (registered trade mark) and, optionally, an e-mail application (not shown). When computer 200 accesses the search system's home web page a signed Java applet 212b also runs in computer 200. This is either downloaded from system web server 118 or loaded from permanent program memory 218 (when the applet code has been cached by web browser 212a following an earlier access to the search system web page)
In use the applet code is also stored in working memory 214, together with a list of URLs spidering, HTML files for web pages retrieved by the user's computer (either for indexing or, equivalently, as search results), indexed content data, and a list of search result URLs. The list of search result URLs may also be stored in permanent data memory 216 together with, optionally, a list of the user's "favourite" bookmarked URL references. The user's bookmarks are also stored in user data store 128 and the list of bookmarks and search results list are only updated if the user chooses to save this data locally.
Web browser 212a includes cryptography code to recognize the Java applet's digital signature and to display a certificate, together with a company name, offering the user a choice of whether or not to trust the service provider. If the "trust" option is accepted web browser 212a gives signed Java applet 212b extended permissions, for spidering web pages and reporting indexed content data to the service system provider. Permanent data memory 216 may store data indicating that applet code from the search system service provider is always to be trusted.
Referring now to Figures 3 a to 3 c, these together show a flow diagram of a user registration and background spidering process. The flow chart illustrates steps performed by search/spidering applet code running on a user's personal computer 200. In particular, the flow chart shows a background spidering process which runs continuously on computer 200, according to the available processing and communications bandwidth, when the user is not performing a search. Preferably the process continues to run in the background during a search, although bandwidth limitations may cause the process to run slowly. As described in more detail below, in a preferred embodiment the process is a multi-threaded process; the flow chart shows steps in both a master (or control) thread and a spidering thread.
At step S300 the search system home page 122 and signed Java applet are downloaded from system web server 118 to a user terminal such as personal computer 200. As explained with reference to Figure la, web page 122 includes a URL to the Java applet
code, which is downloaded separately from the web page text and graphics. If the user has previously accessed the search system home page the applet code and, in some instances the web page text and graphics, may be locally cached on the user's machine. The search system may force an update of such locally stored applet code by, for example, changing the applet's file name.
At step S302 the user's web browser 212a runs the downloaded applet 212b which, at step S304, establishes a socket connection with data collection server 122. The socket comprises a bi-directional virtual connection between the applet and the data collection server: Once the socket is established, at step S306 the applet sends initialization data to the data collection server 122 comprising, for example, an applet version number. The applet then, at step S308, receives a list of URLs for spidering from the data collection server. Associated with each URL is a date retrieved from system data store 126 indicating the last date (and/or time) when the data in data store 126 associated with that URL was verified and/or updated. Also associated with each URL is a checksum, again retrieved from data store 126, calculated from the web page data pointed to by the URL. The checksum is, in one embodiment, calculated using the entirety of the web page data including HTML tags, although in other embodiments data within HTML tags may be ignored.
The applet may process each URL sequentially, downloading content from a first URL, indexing this and reporting back to the data collection server, and then processing the next URL. However, it is more efficient if the applet processes a plurality of URLs in parallel, for example, using a separate thread for each. Web pages from some URLs will download more quickly than web pages from others and a multi-threaded process facilitates making use of this. Thus, at step S310, the applet selects a first batch of URLs to be processed from the list of URLs received, for example the first ten URLs in the list, and starts a new thread for spidering each one. The process illustrated up to step S310 is the master or control thread; step S312 is the first step of one of the new URL spidering threads created at step S310. At step S310 the control thread halts and waits.
At step S312 the URL spidering thread of the applet sends a URL header request to the URL it is processing, requesting header data from that URL. The header data includes a "date last modified" - i.e. the date at which the web page was last updated, and web page summary data.
The applet receives URL header data from the URL to be processed and, at step S314, checks whether or not the header data includes a date-last-modified for the web page. If there is no date-last-modified the applet proceeds to step S318 in Figure 3b, otherwise the applet checks, at step S318, whether the date-last-modified is later than the URL date received from data collection server 122.
If the date the web page was last modified is later than the URL date the thread again proceeds to step S318; otherwise the thread proceeds to step S338 of Figure 3c. At step S338 the main control thread checks whether or not all the URLs received at step S308 have been processed. If they have not the existing thread, which has just finished processing its last URL - that is the spidering thread of step S312 et seq, is reassigned to a new URL to be processed (step S340) and the process then loops back to step S312. Otherwise, if all the URLs received from the data collection server have been processed, the applet requests a new list of URLs for processing from the data collection server at step S342. The main control thread then again reassigns the completed thread to a new URL and, again, the process then loops back to step S312.
The Java code handles signalling between the master/control thread and the URL spidering threads, enabling the control thread to detect when a spidering thread completes.
Referring to step S318 of Figure 3b, if the date the web page was last modified is later than the corresponding date in system data store 126, at step S318 the applet URL spidering thread requests the full web page data from the URL, excluding any data such as graphics and included pages indicated by links within the page. Then, at step S320, the applet caches the downloaded web page in case the user should wish to preview the
web page contents, as described later. This caching function is provided by the applet 212b rather than the web browser 212a.
At step S322 the applet calculates a checksum for the downloaded web page and, at step S324, checks whether the calculated checksum is equal to the checksum associated with the URL received at step S308 from the data collection server. If the checksums are the same the process continues at step S336 where the applet sends the URL (or a URL identifier) and the results of the date and checksum checks back to data collection server 122. The date is returned because the web page date-last-modified may have been updated without any change in the web page content. The process then continues at step S338, as described above.
If, at step S324, the system checksum and the checksum calculated from the web page differ the applet then proceeds to analyse the web page contents and report back to the data collection server, which stores the results of the analysis in system data store 126. More particularly, the process continues at step S326 at which the applet stores links to other pages and sub-pages (frames) in the downloaded web page in working memory 214 for return to the data collection with compressed indexed content data, as described later.
Following this, at step S328, the applet compiles a list of all words on the web page except for HTML tags. Preferably such "words" are not restricted to dictionary words but include acronyms and, more generally, alphanumeric character strings. This is useful when searching for product numbers, specifications, invented names and the like.
Once the list of words has been compiled the applet, at step S330, discards unwanted words from the list. A list of these unwanted words is stored within the applet itself and comprises common English (and other language) words such as "the", "and", "&", and certain obscene and/or offensive words.
For each word remaining in the list, at step S332 the applet determines a word rating. The word rating may be determined from one or more of word frequency, the relative font size of the word as compared with other text on the page, and the word's location, for example, whether it appears in a heading, a URL, a hypertext link, an HTML tag, or in some other location. Other conventional word rating methods may also be employed.
Once the rating for each word has been determined the applet, at step S334, compiles compressed URL spidering data comprising URL identifying data, a current date (either from the user's personal computer 200 or, preferably, as supplied by the search system), a page checksum, the word list and word rating data for each word, a list of links from the page as stored by the applet at step S326, and a page file size. The indexed content data in system data store 126 is drawn from this URL spidering data. At step S334 the applet compresses this URL spidering data and sends it to data collection server 122 for updating system data store 126. The URL spidering thread then halts while, at step S338, the control thread checks whether or not all URLs have been processed and, if they have not, the control thread reassigns the spidering thread to a new URL and the process begins again at step S312.
The function of the applet is downloading and indexing ("spidering") web page data has been described but the applet is not restricted to downloading HTML data. For example, in a preferred embodiment the applet also spiders data in Adobe (Registered Trade Mark) postscript (opdf) format, as well as data in other formats. The applet may also index content contained within multimedia documents, data files or other objects.
Referring now to Figure 4, this shows a flow diagram of a process for downloading web page 122 and its associated applet from system web server 118.
At step S400 web server 118 receives a request for the search system home page from web browser 212a of user's computer 200. The web server then, at step S402, sends the text and graphics for the home web page to the user's browser. The web browser then determines, at step S404, whether or not the applet is cached in the computer's
permanent memory 218 and, if the applet is cached, the process ends at step S410. If the applet is not cached (or, equivalently, if the applet's file name has been changed) at step S406 the web server 118 receives a request for the applet from the user's web browser, the applet having its own specific URL. The web browser then, at step S408, retrieves the applet from code storage 120 and sends the signed Java applet, as a signed JAR, to the web browser where the user is asked by the web browser whether or not to trust content (i.e. the applet) from the service provider. The process then ends, again at step S410.
Figure 5 shows a flow diagram for the background spidering process described with reference to Figure 3a, as implemented on data collection server 122. Thus, at step S500 (which corresponds to step S304 of Figure 3a) data collection server 122 is contacted by applet 212b and a socket connection is established between a data collection server communication process thread and a background spidering thread of an applet running on user's computer system 200. Each of the many user computer systems connected to the data collection server at any one time is allocated a separate socket connection and a separate process thread on the server.
At step S502 the data collection server receives initialization data from the applet including, for example, a version number of the applet which the data collection server can use to select a data communications protocol and/or data format for communicating with the applet. At step S506 the data collection server receives a request for a URL list from the applet for background spidering and, at step S508, the data collection server determines the next URLs which are to be updated. This determination may be made based upon recency, popularity, proximity, or on some other basis.
A determination based upon recency may, for example, select for updated spidering those URLs for which the greatest time has elapsed since they were last updated or, additionally or alternatively, may include new URLs which have not been spidered. A determination based upon popularity may be arranged to ensure that those URLs most frequently appearing in search results are most frequency checked and if necessary
updated. Selection of URLs by proximity is described in more detail below. In some embodiments a combination of two or more of these criteria may be employed in order to determine which URLs are next to be sent to a user's computer for spidering to update the URL's records.
Where an applet has already established a socket connection with the data collection server and requests further URLs for spidering (as in step S342 of Figure 3c) the process is entered at step S504 and, again, at step S506 the data collection server receives a request for a list of URLs to spider from the applet.
At step S510 a list of the selected URLs is sent to the applet and, at step S512, the selected URLs are marked in system data store 126 as "pending", for example by means of a flag. The "pending" flag indicates that a URL has been selected for updating but has not yet been updated and the selection (at step S508) preferably ensures that once a URL has been marked as pending it is not again selected for updating by a different user. Preferably the "pending" flag has a timed expiry so that if no spidering results relating to that URL are received from a user's computer after a predetermined interval the URL is again made available for selection for spidering by the same or another user. This ensures that those URLs wliich are dispatched for spidering but which are not in fact spidered, for example because computer 200 is switched off before they are processed, may be re-selected. The "pending" flag is also cancelled once updated spidering data relating to that URL is received from a user's computer.
At step S514 the data collection server waits for spidering data to be received from an applet (corresponding to the data sent at steps S334 and S336 of Figure 3). The data collection server receives URL spidering data from one of the many applets running on the plurality of users' computers which may be connected at any one time to the' search system, at step S516. A separate data'reception process is started for each return from a system user so that in practice, at any one time, there will be a plurality of concurrent reception processes operating on data collection server 122. Such processes may be implemented in a conventional manner on data collection server 122 using Java.
At step S518 the data collection server checks whether the received URL spidering data comprises indexed content data (corresponding to the data sent by the applet at step S334 of Figure 3) or merely bibliographic data (such as that sent at step S336 of Figure 3). If the received URL spidering data does not contain indexed content data the data collection server, at step S520, updates the bibliographic data for the relevant URL in system data store 126 with information indicating when the URL was last checked. This information is received from the applet and comprises a time stamp and, if available, a data-last-modified for the web page. The system then loops back to step S514 to wait for further spidering data from the same or another applet. Alternatively, where step S516 is implemented as a plurality of concurrent processes, the process receiving data from the applet halts or waits, although the socket connection to the applet remains open (since one process is allocated to serve each user's computer).
If the received data was determined, at step S518, to include indexed content data the bibliographic data in system data store 126 is updated, at step S522, in a corresponding way to that described with reference to step S520. In addition, however, at step S524 (bibliographic) checksum data for the updated indexed content is also written into system data store 126. Also at step S524 the data collection server writes updated indexed content data into system data store based upon the word list and rating data received from the applet as described above with reference to step S334 of Figure 3. The process then again loops to step S514, waiting for further data from the user's applet. Preferably a user or applet identifier, such as a username, is also stored to indicate the origin of the new or updated indexed content data, to help detect and reduce the risk of fraud by, for example, unauthorised passing of data into the system data store.
Referring back to step S508 above, URLs for updating by a user's computer's applet may be selected partially or completely on the basis of whether or not they are within a URL "catchment area" defining URLs of a selected or predetermined proximity to a user's effective IP address.
In a preferred embodiment the system data store 126 stores a list of URLs for substantially every web page on the Internet to be covered by the search system. Some of these web pages are new and have never been spidered, and some may need checking for updates, for example, because they were last checked more than 24 hours previously. The data collection server prioritises the URLs to be spidered according to how recently they were last checked and, starting with the least recently checked pages, URLs are sent to instances of applet 212b residing on the computers of users who are currently connected to the search system.
The URLs to check may be selected substantially at random, for example, for security reasons, to reduce the risk of biased or erroneous data being submitted to the database. However in other embodiments of the system the spidering process can be made more efficient by selecting URLs a user's computer receives for spidering based upon the physical or logical connection of a user's computer to the Internet in relation to the physical or logical locations of the URLs to be spidered. More specifically, there are likely to be fewer bandwidth bottlenecks to locations on the Internet (or other network) which are close to the user's computer as compared with those which are more distant. For example, if a user connects to the Internet via Internet Service Provider A, that user's computer's applet is likely to be able to spider websites hosted by that Internet Service Provider more easily than websites hosted by another Internet Service Provider who is physically and logically more distant. This strategy is effectively "cyber green" since it tends to reduce the level of long-distance IP traffic.
An Internet address comprises four 8-bit octets normally written in decimal notation, for example, 18.104.22.168. A first portion an Internet (IP) address defines a computer network and a second portion of the address defines a computer coupled to the network. Computer networks are identified by network numbers and IP routers generally store a table of such network numbers together with corresponding IP addresses for gateways into the networks. Thus, in the foregoing example, 193.243 may define the network number of a computer network. In many cases it is convenient for network operators to
assign sub-networks to different sets of host addresses within the network so that, for example, 193.243.1 defines a first sub-network and 193.243.2 defines a second sub-network. It can therefore be seen that an Internet address usually reflects the underlying physical structure of a computer network, at least to a degree. Domain name servers translate between domains and Internet addresses typically by working down a tree from a root/top-level domain name server. The allocation of domain names is overseen by ICANN who appoint country-based domain name registrars.
From the foregoing discussion it can be seen that one strategy to identify IP addresses which have a good chance of being close to the IP address of a user's computer is simply to truncate the IP address to identify a network number or sub-network address. Typically an Internet Service Provider allocates an IP address to a user's computer when the user logs onto that ISP, the address being selected, often at random, from a range of IP addresses assigned to that Internet Service Provider. Thus to identify or filter candidate URLs for spidering according to "proximity" to the IP address of a user's computer the system merely has to identify a subset of URLs for which a selected number n of the candidate URL's IP address most significant bits match the corresponding most significant bits of the user's IP address. For example, if the IP address of a user's computer is 22.214.171.124 a candidate URL with an IP address of 126.96.36.199 may be considered within the user's catchment area because the first portions of these two addresses match.
The value of n selected determines the catchment area of a user's computer. In a simple embodiment this could be fixed at, for example, 16 bits. In a more sophisticated example the number of bits may be selected according to the class of IP address. Class A Internet addresses are reserved for large networks and use only the first octet for the network number (addresses 1 to 126); class B addresses are for standard size networks and use the first two octets for the network number; class C addresses are for small networks and use the first three octets for network numbers. Thus n may be small, for example n = 8; for class B addresses n may be larger, for example n = 16; and for class C addresses n may be larger still, for example n = 24.
More generally, a subset of IP addresses based upon "proximity" may be selected using a so-called subset mask, that is, a 32 bit number with selected (most significant) bits set to one. Where subsets are contiguous within a network each subset can access the other subsets without passing traffic through other networks. Additional/alternative proximity determinations may be based upon Classless Inter-Domain Routing (CIDR).
In a still more sophisticated system traceroute, a standard utility, may be employed to determine the route datagrams take between two hosts and the "proximity" can be determined accordingly, for example by counting the number of hops in the route. The hop count is a ftinctionally significant measure of the distance between two computers connected to the Internet since a datagram may pass through a large number of different networks before reaching its destination, even when that destination is geographically close at hand.
In another embodiment the server gives each applet a list of URLs to ping and the applets report the ping times back to the server. The server maintains a list of URLs it wants spidered together with an average ping time for each URL determined from the average of all previous applet pings to that site. The server then selects URLs for a particular applet depending on how important it is to spider that particular URL and the applet's ping time to that URL. So when an applet has a particularly short ping for a URL compared to the average ping time it receives an instruction from the server to spider it.
Referring now to Figure 6, this shows a flow diagram of a search process implemented using an applet on a user's computer. This process operates in parallel with the URL download and spidering processes described with reference to Figure 3.
At step S600 the user enters a search term into the applet running on the user's computer or, alternatively, a search term is selected from a historical list of previously conducted
searches. The search term may comprise a single keyword or a combination of keywords linked by logical operators such as "KEYWORD1 and KEYWORD2".
At step S602 the applet sends the search request to query servicing server 124 and, at step S604, receives a list of search results and "tax" URLs back from the query servicing server. Each search result in the list comprises a URL, preferably together with additional information such as a title and/or an indication of the content of the web page pointed to by the URL. This additional information may be retrieved during the distributed spidering process and stored in system data store 126 in association with its corresponding URL. Both the search result URLs and the tax URLs each have an associated date and checksum, and optionally file size and language data. The search result URLs are flagged to differentiate them from the tax URLs.
As described above, the URLs which are stored in system data store 126 are organized in association with search term keywords and are ranked by their relevance. The list of search results received by the applet is ordered by relevance and thus when the applet displays the list of search results, at step S606, these are simply displayed in the same order in which they have been provided to the applet. The user may, optionally, re-order the displayed search results according to other criteria such as, for example, date.
At step S608 the applet identifies a first batch of URLs to begin spidering. The spidering process is preferably carried out by a plurality of concurrently running URL spidering threads in a broadly similar manner to that described with reference to Figure 3. Thus the steps of Figure 6 from step S600 to step S601 are preferably steps performed by a master or control thread of the applet wliich, in a preferred embodiment, is a GUI thread which also manages the interface provided for a user by computer system 200.
In a preferred embodiment some of the URL spidering threads are allocated to spidering search results and others of the threads are allocated to spidering URLs comprising the URL tax. For example where the applet creates ten URL spidering thread instances,
five of these may be assigned to spidering search result URLs and five to spidering tax URLs. Thus, at step S610 the applet starts a new thread for each URL to be spidered. Preferably the search result URLs to be spidered, although selected initially by the applet are, indirectly, amendable by the user. In such an embodiment the applet detects which search results the user is viewing, for example by detecting result list scroll events, list re-ordering, and list item deletion, and controls the URL spidering threads accordingly to spider, for example, URLs being viewed, URLs in the order that they are being viewed, and to cancel spidering of deleted items.
As illustrated in Figure 6 the master or GUI thread effectively halts at step S610, waiting for events from the user such as the scroll events described above and, at step S612, spidering of a plurality of URLs commences (the process steps for only one of these spidering threads is illustrated in Figure 6).
At step S612 an applet URL spidering thread requests a complete web page from the URL assigned to it for processing. The spidering thread retrieves both header and text data on the web page but does not retrieve objects embedded within the page accessed via fiirther URLs, such as sub-frames. At steps S614 and S616 the applet URL spidering thread receives a data stream from the URL until reception is complete, when the process continues to step S618. During data reception the spidering thread sends reception status data to the master GUI thread to indicate events such as, "waiting for a response from the URL", "page downloading" and "time out and halt". This information may be used, for example by the applet, to optimize the balance between the number of threads assigned to search result URLs and a number assigned to tax URLs.
Once reception of the web page is complete the URL spidering thread caches the downloaded web page for the GUI thread to display on request, and sends "download complete" status data to the GUI thread (step S618). The GUI thread preferably displays an indication of the status of the web page download from each URL, for example as a traffic light to indicate "waiting", "downloading" and "ready". In the case of a URL
spidering thread which is spidering a tax URL preferably no status data is provided to the GUI thread since, in general, the tax URLs are not displayed.
In a preferred embodiment of the applet threads spidering search result URLs are given priority over threads spidering tax URLs so that, in effect, the taxation operates as a background process and has only a small impact upon the user's available bandwidth. This prioritisation may be implemented straightforwardly using the Java Virtual Machine.
Steps S620 to S628 correspond to steps S326 to S334 of Figure 3. At step S620 the applet stores links on the retrieved web page pointing to other objects such as program code, graphics, other web pages, sub-frames and the like. Then, at step S622, the thread compiles a list of all "words" on the page except for HTML tags and, at step S624, discards unwanted words from the list. Then, at step S626, the thread determines a rating for each listed word and, at step S628, sends compressed spidering data to data collection server 122 in a corresponding way to step S334 of Figure 3. At step S630 the spidering thread, which by then has processed the URL assigned to it, is reassigned to a new URL which may either be a search result URL or a tax URL. The process then continues again at step S612. If the user has modified the search result list, for example as described above, event data is received from the GUI thread and one or more existing URL spidering threads are reassigned to spider new (search result) URLs, whether or not they have completed processing of the URLs initially assigned to them.
Referring now to Figure 7, this shows a flow diagram of a computer program running on query servicing server 124 for providing search results to an applet running on a user's computer. At step S700 the query servicing server 124 receives a search request, including a search term or keyword, from an applet on a user's computer. At step S702 the query servicing server retrieves search result URLs from system data store 126, already ranked in the order in which they will be presented to the user. This is because, as has been described above, when indexed content data from URL spidering processes is written to system data store 126, it is written in order of relevance to an associated
keyword. Where a search term comprises two or more keywords search result URLs are retrieved in the manner which has already been described in connection with Figure 1.
At step S704 the query servicing server 124 then requests a list of tax URLs from data collection server 122. These tax URLs are preferably determined according to the same criteria as the background spidering URLs, as described with reference to Figure 5. In other embodiments tax URLs may be selected according to additional or different criteria from those used to select URLs for background spidering, for example to preferentially update the system data store with information relating to websites of businesses having a relationship with the search system service provider. At step S706 the query servicing server then sends both the search results and the tax URLs back to the applet, for display to the user, and for spidering. In some embodiments the search results and tax URLs are locally cached on the query servicing server 124 and transmitted to the user's applet in batches, to facilitate the applet's data handling and to make it easier for the search system to keep track of which URLs should be being spidered.
Once the search results and tax URLs have been sent to the applet, the process ends at step S708.
Referring now to Figure 8, this shows elements of a graphical user interface (GUI) thread suitable for use with the process of Figure 6. At step S800 the GUI thread displays a list of search results in the order they are received from query servicing server 124. Then, at step S802, the GUI thread awaits an event, for example initiated by a user. Such events may include, for example, a modify result display event (such as a scroll event, re-order list event, or delete item event as mentioned above), a page preview event, a select item event, a bookmark item event and (not initiated by the user) a web page download status update event.
On receipt of a status update event from a URL spidering thread (step S804) the GUI thread, at step S806, displays updated status information for the relevant URL. On
receipt of a modify result display event from a user (for example, by operation of a scroll bar) at step S808 the GUI thread displays a modified list of search results and then, at step S810, sends data relating to the modify result display event to one or more URL spidering threads as necessary, for example to reassign spidering threads to process new URLs.
On receipt of a preview event (for example, by the user clicking on a preview region such as a URL title) at step S812 the GUI thread displays a simplified rendering of the downloaded web page, for example a text-only display in a supplementary window. If a hypertext link is selected (for example, by a user clicking on the link) the GUI thread, at step S814, opens a new browser window for the selected URL for displaying data from the selected URL. If the data at that URL has previously be cached by the applet, so that a cached version of the data is available, this cached version is displayed. After each event the GUI thread returns to step S802, to await the next event. As the skilled person will appreciate, preferably the GUI thread is able to process more than one event in parallel.
Figure 9 shows exemplary dataflows 900 for a user search process and for a background spidering process. Once the search system home web page is downloaded to user terminal 102 the terminal's web browser makes a URL request 902 to search system web server 118 and applet data 904 is downloaded to user terminal 102. The user then enters a search term into the graphical user interface provided by the applet and a query 906 comprising this search term is sent to query servicing server 124. The query servicing server then returns a URL list 908 comprising search results and a URL tax to user terminal 102. URL requests 910 are then issued to web servers 116a-e comprising web servers storing web pages indicated by the search results and web servers to be spidered in accordance with the URL tax. The web page data 912 is then returned from these web servers to user terminal 102, where it is processed by the applet. The compressed URL spidering data 914 resulting from the web page processing (comprising indexed content data) is then sent to the data collection server 122 for storing in the system data store 126. Generally compressed spidering data from a plurality of web pages is
reported to the data collection server, data from each page being reported by a separate thread running within the applet.
As a background process the data collection server 122 also provides a URL spidering list 916 to the user terminal 102. This process first sends URL page header requests 918 to web servers 116a-e (which are merely exemplary of all the web servers connected to the Internet) and web page headers 920 are consequently returned to the use terminal. Then, where necessary, the background spidering process issues URL requests 922 for full web page data and these web pages 924 are then returned for processing. Compressed URL spidering data 926 is then reported to data collection server 122 in the same way as with the search and tax URL spidering process.
Figure 10 shows an exemplary graphical user interface 1000 for presentation to a user on personal computer 200. The user interface comprises a conventional browser window 1002 within which a secondary window 1004 is provided by the applet's grammatical user interface (GUI) thread. This secondary window comprises a field 1006 for entering a search term and an adjacent (search) button 1008. A window 1010 displays a list of search results including a field 1012 displaying a title and URL for each result. A second field 1014 indicates whether or not a web page for the search result is locally cached on the user's computer and, if it is cached, on what date it was cached. The field 1014 also includes an indication 1016 of the download status of a web page to indicate, for example, that the associated web page is not active, that no response has yet been received from the web server, that the page has been accessed but is not yet fully downloaded, and that the page has been fully downloaded. A bookmark and relevance field indicates the likely relevance of a web page to the requested search term, and indicates whether or not the page has been bookmarked. A scroll bar 1020 is provided to allow a user to scroll up and down the list of search results in a conventional manner. A preview window 1022 displays a scrollable preview of the text within the web page.
Aspects of a second embodiment of the system, broadly similar to that described above, will now be described, starting with the applet.
When the user visits the system's home page a signed Java applet of about 100k is downloaded onto the user's machine. This figure is currently based on what is deemed an acceptable download for the majority of web users on modem dialups. Once the applet is downloaded the first time, it should remain cached as long as the user does not clear the browser's cache. Updates to the applet can be released so as to force a re- download. The bulk of the applet code can be saved to a special area on the disc which remains even if the user clears his browser's cache, if desired.
Once the applet has been downloaded, the user must then authorise the applet to have full access to his machine's resources by clicking a Grant button on a window that appears. The user is then invited to enter a query in the form of one or more keywords. The applet contacts the system server which returns a list of URLs (Uniform Resource Locators i.e. web page addresses) related to the query, which the applet displays in a table. The table shows the Date the page was last modified (according to the system database), the document Title, the URL (or, in other embodiments, just the domain), and a rating for the site.
When the applet receives a URL, it attempts to contact the site and download the page or document there, updating its status and dates columns as it does so. Here document is used generally to include video, audio text and multimedia files, games and other similar types of information. When a page has been downloaded (several are downloaded simultaneously by utilising Java's inbuilt thread (multi-tasking) support) it can be previewed in a preview pane of rich text (i.e. colours, fonts, size, bold, italic but in one version, with no images or audio) or alternatively an HTML frame, by hovering the mouse pointer over its entry in the list.
Hence as pages are being downloaded, the user is able to see which ones have been contacted and cached, which ones are still pending and which ones have moved or been
deleted. Once cached, the user can see the size of the page, and by moving the mouse over it is able to preview it and get an idea of what the page consists of. At this stage the user may bookmark the site for later reference, or he may choose to view the actual page.
Viewing is performed by clicking on the entry in the list (or on the URL, if shown). This brings up another browser window, with its address bar disabled to prevent confusing the user because the page is displayed from a local cache (i.e. a file on the hard disc) rather than a web location. Whilst viewing the full page, the user is free to follow links as in a normal browsing session, although any links followed take the user to actual web pages rather than local cache files. Having viewed the page, the user may then decide to bookmark it. Bookmarking is the "marking" of a location so that a user can return to it, for example by storing a reference to the location in a folder.
Bookmarking is preferably performed by clicking on the checkbox next to the entry in the list. Note that if the user has previously bookmarked this exact URL (either for this query or a different one, whether in this session or a previous session) then the entry in the list will already have its checkbox checked.
When the user bookmarks a particular site the act of bookmarking advantageously casts a vote or recommendation for that page. There may be two forms of bookmarking, the stronger marking a persistent interest in a site, and the weaker marking of an article to be read later. Alternatively, this distinction can be automatically deduced by the applet observing when the user subsequently views a page via the bookmark from a Bookmark Viewer.
The system uses these votes to assign a two-fold recommendation-rating or score for the URL. The first is with regard to the search terms used in the query and the second is a general vote in the page as being of high quality. Thus when a user performs a query, the results he receives are ranked by other users' bookmark-recommendations. This process happens in real-time, so a popular new site can be highly ranked very quickly.
The order of the pages returned for particular query is by votes with regard to this or similar queries, but the "general quality" vote is also displayed alongside each page. A fuller discussion of how ranking works is provided below in the context of Query Servicing.
Apart from the standard searching features outlined above, the applet may have a number of other features, including, a Bookmark Viewer allowing users to reorganise their bookmarks, check for out of date ones, and view the bookmarked pages, which preferably registers a vote for the page with the system, and also indicates to the applet that this bookmark is to be more highly rated amongst the user's list. The applet may also offer the user 'today's favourite query' and 'today's favourite site'.
Turning now to the system server of this second embodiment, this comprises a central repository of data which indexes the web. Its function is to collate information collected by its clients and to service requests from clients to access this data in an ordered fashion. Preferably, it does not perform the collection or crawling itself.
This server consists of three main subsystems: a database, a data collection subsystem and a query request servicing subsystem.
The database, stores data relating to what pages exist on the web, what keywords (or more generally, terms i.e. words or phrases) are associated with these pages, and how each page is ranked according to each term. It also contains user data including their bookmarks. This database may be implemented as a standard relational database, or as a custom data structure.
The data collection subsystem or process, is the recipient of processed and compressed data prepared by the applets relating to new and updated pages. This data is incorporated into the database, replacing anything which is out of date. An important part of this process is that it is done in real time, i.e. the database is constantly kept fully up to date. This subsystem is also responsible for accepting user votes for sites in terms
of bookmarking. The bookmarks are inserted into the users database entry and a vote is registered linking the site with a search term.
The query request servicing subsystem answers queries from users, essentially of the form "most relevant pages for <keyword(s)>", which are submitted via the applet. The response consists of a list of URLs, ordered with the most relevant first, which is fed back into the applet for display to the user. Associated with this response is also one or many URL tax items (see below), which the applet must check and report back to the system. This process preferably has a very high performance as volumes of requests are typically measured in millions per day. Achieving this performance is helped by the fact that entire web pages do not need to be served for each request; instead a compact list of URLs which the applet can display in a user-friendly fashion will suffice.
Referring now to the features of the applet in more detail, the applet is signed, which means that a digital certificate or signature has been applied to the binary file which comprises the applet such that an end user may be confident of where the applet originates from. This instructs the browser to provide the user with the option of marking the applet as trusted.
To understand the implications of this procedure, it is helpful to outline the security model which Java (Regd. T.M.) applies to applets. To prevent the execution of malicious code when a user browses a website containing Java (Regd. T.M.) applets, most browsers' Java Virtual Machines (JVM) have certain restrictions which determine what an applet can and cannot do. These restrict access to the local file system and to the network at large. Normally, an applet is only permitted (by, e.g., a browser) to make a network connection to the server from which it was downloaded. This could prevent the functionality in the system applet of going to different websites and downloading their pages. However, by signing the applet, and prompting the user to mark the applet as trusted, these restrictions are lifted.
The two most popular web browsers, Internet Explorer (Regd. T.M.) and Netscape (Regd. T.M.) currently employ different signing mechanisms. Therefore, to provide a signed applet which both browsers can recognise as signed it is preferable to sign it twice, once using Internet Explorer's (Regd. T.M.) scheme and once using Netscape's (Regd. T.M.).
One way for the applet to communicate with the server is to use Remote Method Invocation (RMI). If sockets are implemented, this provides an adequate alternative (RMI is implemented on top of sockets), although where users are behind corporate or ISP firewalls this generally prevents the use of sockets. An alternative uses HTTP requests, although this imposes a performance overhead as HTTP is a relatively heavyweight protocol, and is stateless.
Preferably both socket-based and HTTP-based versions of the system protocol are implemented so that users who are able to make use of the more efficient sockets version can do so.
Java (Regd. T.M.) has inbuilt and efficient support for threads, that is, the ability to set multiple processes running concurrently within the same program (i.e. multi-tasking). This provides a number of advantages. To download say 10 web pages simultaneously, using threads, there is no need to manage switching between pages as packets arrive randomly across the Internet. Since the JVM automatically allocates CPU resources to each of the threads, if one is held up waiting for the network to provide data, then the CPU is freed to work on another task. In the case of the system applet, whilst it is waiting for a response from the website it is attempting to download a page from, another thread can be processing a previously downloaded page, thus optimising usage of available resources. Moreover, from the user's point of view threads make the graphical user interface (GUI) responsive, in that whilst heavy processing is going on in the background, or a page is taking time to download, the user interface still responds interactively.
Preferably, the GUI gets one thread to itself, which controls all the others, preventing the controls hanging (becoming non-responsive) when something is happening in the background.
Each page being downloaded preferably also gets a thread to optimise the trade off between network and CPU. This includes URL tax pages (described below).
Advantageously, the thread that downloads the page will then, having updated the GUI, begin processing the page if necessary. This thread will then pass information stating that the page has not been modified since last checked, or a fresh analysis of the page as required. In this way information is sent to update the system on a page by page basis as and when it is available. This is advantageous as the applet might be terminated at any stage by the user closing his browser or moving to another page.
When the applet receives a list of URLs, it displays them on the GUI in the order received which is ranked in descending order of relevance as determined by the system using its relevance and voting data. Additionally or alternatively, the results can be ranked using data from the local user only. The applet then begins to contact each of the sites in the list and to download from them. It starts at the top of the list, downloading the pages most likely to be what the user is looking for.
Referring now to Figure 11 , this illustrates a plurality of concurrently running threads of the search, spidering and user interface aspects of the applet, showing different stages of page downloading and processing. A single thread is responsible for downloading one web page and, if necessary, processing the web page data and transmitting the result back to the system. A GUI thread 1100 is also shown. In Figure 11 a spidering thread first waits 1110 for a response from a web server then downloads 1120 the web page and finally processes and transmits 1130 indexed content data back to the search system server. In the illustration an exemplary slow thread 1102 and fast thread 1104 are shown together with a thread 1106 which does not receive a response from the web server and times out.
The exact number of threads which results in optimal performance can be determined by empirical means but it is typically of the order of 10. The applet may measure its own performance to optimise the number of threads it creates. This is useful as many factors affect the optimal number, e.g. network bandwidth availability, CPU availability, JVM use and physical memory availability.
A further preferred feature that improves the performance from the user's point of view is that the applet monitors the user scrolling through the list of URLs to ensure that it concentrates on items cunently visible in the scrolling window. For example, say ten items are visible in the list without scrolling, then if the user quickly scans the first ten items and determines, perhaps by looking at the titles, that he's not interested without actually waiting for the preview to appear and scrolls onto the next ten items, then the applet operates to focus on getting those items downloaded. The applet preferably continues to download the previous items in case the user decides to return to them, but places a higher priority on the cunently visible items. In this way the applet is seen to keep in touch with the user and is able to present previews and cached copies of the pages with a minimal delay.
As the applet download pages, it stores them in a special directory on the local hard disc set aside for this purpose. This is facilitated because the applet is signed and trusted and therefore has access to system resources (such as the file system) that untrusted applets do not have (see above section on signing). Once the HTML files are on disc, the applet can cause the browser to view these files as if they were real web pages, thus allowing the pages to be viewed almost instantaneously once selected.
One potential difficulty with this approach is that often web pages contain a lot of images which can be large files compared with the actual HTML file itself and therefore can cause a delay to the display of the page. Therefore it is desirable for the applet to automatically download any such associated files into the same cache directory to allow very rapid display of pages even with many large images (provided there was sufficient
time for all the files to download). This also requires the applet to rewrite the master HTML page such that URLs for all associated files point towards the local cached versions. In order to avoid problems with broken images if not all images have been cached before the page is viewed, only those URLs for which a local cache file has been obtained are rewritten, the rest continue to point to the original source. This means the browser uses its normal iterative display algorithm which displays all loaded elements and then fills in other elements such as images and frame contents as and when they complete downloading.
Note that the cached HTML files on disc can also be used for the preview functionality (see below) to enable a full rendering of the page (utilising the browser HTML renderer) on a small scrollable pane rather than the simplified rendering performed within the applet itself. The applet may provide this as a user option depending on which the user finds most helpful; this will typically be dependent on the network and CPU resources available to the given user.
In order to preview a page, the user simply hovers his mouse pointer over the list entry of interest. If utilising the browser's rendering engine, then the applet issues a command specifying the file to display, and whether to display it in a separate window or a particular pane. Preferably there is a pane as part of the system's home page that is set apart for this purpose; if a separate window is used then it preferably has its address bar and toolbar suppressed to save space (and to prevent confusing the user with a the presence in the address bar of a local filename instead of the actual URL).
If rendering the pages within the applet then a Java (Regd. T.M.) pane is created within the screen area that belongs to the applet, and this is populated with a much-simplified representation of the HTML, which is purely text but which has some of the characteristics of the full HTML such as colours and font size. Again this pane may be placed in a separate window - the advantage of doing this is that the user has complete freedom in how he organises the layout. The interface may provide a detach button
which allows the fixed pane to become a separate window, with the option of re- attaching it.
When ready to view a full page, the user clicks on the page title in the list, wliich highlights as the mouse rolls over it indicating that it can be clicked. This preferably uses the same mechanism described above to display the full page, again from local cache if available; this is advantageously shown in a new browser window. The user may then choose to follow links within the site, in which case un-rewritten URLs are followed taking the user to areas within the actual site and browsing may continue as usual. On closing the window, the user is presented with the previous browser window containing the system applet, ready to continue looking for his page of interest.
When the applet has downloaded the text for a particular URL it begins processing it, but only if the system marked this URL as wanted for update, and if the page has a date modified later than the date the system has on record. If the page is wanted for update but has the same date modified as the system has on record, then this information is , returned to the system so it knows that it doesn't have to check this page again for another period of time specified for updating.
Those pages which have changed since last checked and which the system therefore requires a new analysis of, are processed to determine the keywords present and to obtain a relevance ranking for the keywords based on relative frequency and significance within the page (e.g. large headings carry more weight). This processing is performed in a thread for each page, preferably the same thread that originally downloaded it.
The applet begins by building the word list for the page, that is it lists all the words that appear on the page, and assigns each word a rating based on its importance. This is determined by that word's frequency and whether it appears in the title, headings, links etc. The applet has built into it (preferably in such a way that it may be dynamically updated without downloading the whole applet again) a list of words that are too
common to be useful for searching except when searching by exact phrase. Preferably, there is also a list of words that are deemed inappropriate for allowing searching on. These words will also be discarded. Thus a list is obtained containing for each non- trivial, non-offensive unique word a ranking, for example on a scale of 1-255 (1 byte) or 1-65535 (2 bytes). This provides a very compact analysis of the page as there is no duplication and all HTML tags and common words have been removed. In addition to the word list, a list of contained URLs is also returned to the system, in this way new pages are discovered and analysed.
Once a page has been analysed, the resulting data is transmitted to a server for central organisation and storage. This is done as the next stage in the sequence handled by the thread that downloaded the page (see above section on threads). One embodiment of a format for this data is shown below:
CHANGED = 1
This is for a fresh analysis. For a page that has been checked and found to have not been modified, the following might be sent:
URL D CHANGED = 0
This simple message instructs the system that this page has been checked as required. The system notes this and then checks again in the period of time specified for updating.
The URL_ID is supplied by a server when it sends the list of URLs, and the system uses its own local clock to determine the DATE_CHECKED field for its database entry.
When the user clicks the checkbox next to a search result, that page is bookmarked within the applet. This means the user can return to it at later time for previewing or viewing. It also sends a vote to the system to rate this site with regard to the search terms the user was using. The user's bookmarks are sent to the user database on the system server so that they can be retrieved by the user at a later date.
The user is identified by means of an HTTP cookie, a small piece of data which the web server sends to the browser and which the browser keeps a copy of between sessions. The browser then automatically sends this back to the server whenever it visits that site again, thus enabling the server to identify the user and retrieve personal information for them. Thus, when the user returns to the system site he is automatically and transparently logged in without having to take any action. There is also an option for regular users to register with the site. With this they choose a username and password and can then logon to the system from any computer connected to the Internet and retrieve their personal bookmarks. For both types of user there is also the option of saving their bookmarks to local disc and of exporting them in a format (such as HTML)
such that they can be imported into popular browsers such as Internet Explorer (Regd. T.M.) and Netscape (Regd. T.M.). Similarly, users can import bookmarks from their browsers into their system accounts.
As bookmarks stored on the system server occupy disc space, it is preferable for accounts which have not been accessed for a given period of time (say three months) to be deleted. The user is sent a warning email a week before this is going to happen, and if the user accesses the system within that week then the account becomes active again. The email contains the user's username and password in case they have forgotten it and a link to the system that automatically logs them on, making it as simple as possible for the user to begin using the system again. The email also contains an attached file, in portable HTML format, containing all the user's bookmarks so that even if the user's account is deleted, they still have a record of their bookmarks which can be used directly from the email, saved to local disc, imported into a browser or, if the user creates a new system account at some point in the future, imported back into the system.
In order to organise a user's often large number of bookmarks, there is preferably provided a special Bookmark Viewer or Bookmark Manager activatable by the user. This may be a pane on the side of the normal system applet view which may display contracted and expanded folder contents view formats. Bookmarks are preferably organised into Folders, which are hierarchical and can be expanded or contracted to show the next level down in the hierarchy, in the standard Windows (Regd T.M.), Tree Diagram paradigm. By default, when a user bookmarks a site, it is placed in a folder based on the query that the user performed. This may be automatically made hierarchical based on multiple query terms. Users may leave the bookmarks in the folders the applet assigns for them, or they are free to move them around at will, change folder names and create new folders. If changing a folder name, the system prompts the user to ask if he wants bookmarks for the original query terms to still go into that folder or not (if not, another folder will be created based on the query terms again).
At any stage, the user can hover his mouse over his bookmarks, causing them to be previewed in the normal preview window. Clicking on them brings up a new browser window containing the bookmarked page. It is quite likely that often the user will click on a bookmark before the applet has had a chance to cache it (the only warning the applet having received being the time the user had his mouse pointer hovering over the link), in which case the browser window will open with the actual URL, not the name of a locally cached file.
Extra features in the form of a personal results functionality are available on subsequent visits for users who have already visited the system and performed one or more queries (see below). With this optional feature selected, on returning to the system site, before entering new search terms, the applet automatically selects a number of queries from the user's history, up to, for example, ten (this number is preferably user-configurable). A search is then performed for each of the queries, returning only pages which are highly rated by other users and which are relatively new, e.g. less than 1 year, 1 month, 1 week or 1 day old. If there are no such sites for a particular query then nothing is shown. This cut-off criteria is preferably user-configurable, for example in terms of votes cast or recency of modification). These results are then displayed in the same manner as normal search results or in a different format and in the same part of the window with the document title/URL, short extract and ranking information.
Preferably, a priority value is maintained for each query; this may be incremented whenever that query is run by the user. A small header indicating which of the user's queries each set of pages are in response to may also be provided. Again, by hovering his mouse over the entry, a preview is shown to the user and by clicking on it the full page comes up in a separate window as normal. This effectively provides a "web magazine".
The queries chosen for such a Personal Results or web magazine page, if there are more than the specified number to choose from, are preferably selected based on any or all of the importance of the query to the user which is determined by seeing how recently the
user performed this query, how often the user performs the query, and how many bookmarks he has which relate to this query. It is also possible to allow the user to promote a query to a higher significance thereby guaranteeing its inclusion in the Personal Results search.
If the user proceeds to perform a new query, then the new search results replace the Personal Results. However, these may be returned to at any stage (including on the user's first visit to the system) by clicking on a Personal Results button.
The queries that generate the Personal Results page preferably implement the usual spidering functionality of any other query, including the URL tax (see below). This is advantageous as people may use the system as their home page or often view their Personal Results without actually doing fresh queries, in which cases the system benefits from their spidering input as soon as they logon.
To provide a comprehensive and up to date index of the web, the system relies on the processing capability and bandwidth of its users, in particular, the applets running on their browsers. When users perform queries against the central system database a list of matching URLs is returned. If the system needs an update for any of these pages, it informs the applet, wliich returns to the system a fresh analysis of the page containing the word list and the set of URLs that the page links to.
However, there are potentially many pages that should preferably be checked every day but that will not necessarily appear high enough in a search results list to be checked frequently enough. Therefore, for every set of URLs that users are interested in, preferably a number of URLs are returned purely for the purpose of spidering, that is, in order to get the applet to check, and if necessary, analyse them. This may be refened to as a "URL Tax". The ratio of spidering URLs to search result URLs will be termed a URL tax percentage.
If the applet completes downloading all pages the user is interested in (or at least the next ten in the list say) it then begins checking more URLs above and beyond the normal tax percentage. This allows efficient use of high-bandwidth users who leave the system open in their browser after finishing using it.
An exemplary database scheme for the system will now be described, giving the main features and data items contained. The actual schema used does not need to follow this exact structure as this will largely depend upon the data structure used to implement the database i.e. a custom architecture or a standard relational database management system.
URL ID URL LAST UPDATE LAST CHECKED RATING TODAYS VOTES
In the URL list, each URL is assigned a unique compact ID of preferably five or more bytes. 5 bytes allows for a trillion unique URLs, but depending on the exact implementation 8 bytes may be simpler to store.
A field TODAYS VOTES counts the number of times this page has been bookmarked today, so it is set to zero every 24 hours. A RATING field is a score based on the number of bookmarks and other votes (i.e. people viewing the page from their bookmark viewer) which accumulates with time, although when pages are freshly analysed this number is reduced, for example by halving its value.
Here, a term is used to mean a "word" (preferably with capitals and punctuation suppressed) or a set of words which are grouped together as a phrase, although preferably the order is not significant, hi this context, "word" may include words in
more than one language, proper nouns, and combinations of letters, numbers and other characters such as the AMD "3DNow!" trademark. TERM__ID is a 4 byte integer allowing for 4 billion unique search terms. TEXT is a word or phrase as a character string e.g. "clinton". RATING counts the accumulated number of times this term is used for querying and TODAY is the RATING count for the last 24 hours.
A TERM table contains all unique words to be indexed against, and in addition it contains the most popular phrases that searches are performed against. The number of phrases stored selected dependent on the system's resources.
A RATING table provides cross-references between particular pages and search terms that they are relevant to. Data in a RATING field is preferably a combination of a static rating of a page with respect to a particular term, i.e. how often a particular word appears on a particular page and a dynamic rating i.e. based on bookmarked pages that users associate with a search term. Depending on the exact implementation (in particular the request servicing algorithm) the static and dynamic ratings may be stored in separate fields.
There are some tables which are not related to the core dataset, rather they hold information related to users i.e. bookmarks and frequently asked queries, as described below. In the embodiment described with reference to Figure 1 these tables are held in user data store 128.
A USER table holds important user-related information. An entry for each unique user is created the first time they visit the system site. As session tracking is performed using
HTTP cookies without the need for the user to register and sign in, initially, USERNAME, PASSWORD and EMAIL fields are blank, and only a USER D is stored. This is what the cookie stored on the user's browser contains in order to identify the user on a return visit.
If and when a user decides to register with the system, he must enter a unique username as well as a password and email address.
A USER TERM table stores queries that a particular user performs on a regular basis. Each user will have a number of entries in this table, preferably with an upper limit to prevent the table growing too large (therefore the primary key for this table is USER_ID + TERM). If the phrase or word appears in the master term table as on page 25 then a TERM_ID is referenced thus saving space in the database otherwise the term appears in full as a text string. The priority field indicates the importance of this query; preferably it is automatically incremented whenever the user performs that query. In addition the value may be manually edited by the user to indicate when they are particularly interested in a query. This value is used when generating the automatic Personal Results page.
A BOOKMARK table stores the users' bookmarks. The primary key for this table is USERJD + URLJD this ensures that each user may only bookmark a given page once. TERM indicates for which query a given page was bookmarked and this information is used for ranking pages with respect to search terms. A FOLDER JD is used by the Bookmark Viewer/Manager to organise the bookmarks hierarchically. URLJD
signifies the web page address efficiently by providing a cross-reference to the URL table.
USER ID FOLDER ID FOLDER NAME PARENT ID
A FOLDER table enables bookmarks to be organised hierarchically into folders. By default when the user bookmarks a page a folder is created with the name formed by the search queries used, if not already existing for this user. The user can rename folders and create subfolders. If a folder is a subfolder of another folder than its PARENT JD points to that folder, otherwise PARENT JD is null, indicating that it is a top-level folder.
The database may be implemented on an RDBMS (Relational Database Management System) or on a proprietary or other data structure. The dataset is also veiy large and simply structured. A custom design data structure is one efficient and cost-effective solution.
In one implementation, each table consists of a file on disc. There is a portion of each table held in physical memory (i.e. cached) at all times. Only specific operations are allowed on the database; these are the ones that allow the data to be updated as information is received from the applets, and that allow queries to be performed against the database by the applets. Optimised C++ routines perform these operations on the cached portions of the tables and also keep the full disc versions up to date.
The interface to the system database comprises a Java (Regd T.M.) servlet, optimised for network operations, thus enabling a large number of applets to be simultaneously connected to the system. Integration of the front and back-ends of the database is by implementing the C++ methods as Java Native Interface (JNI) methods, that is so that they comply with a standard interface allowing the servlet JVM to make direct method
calls on them. There are also servlets (conveniently, mostly written in C++) that continually sort data and ensure that the database is self-consistent.
In a prefened embodiment, virtually all the data in the system database is contributed by the clientside Java (Regd. T.M.) applets, apart from initial spidering information. It is thus desirable to ensure that the process of data collection from clients is efficient, with minimal overhead for both the client and server, and that the information incorporated into the system is accurate.
One important consideration is security, in particular the potential for hackers to create, by reverse engineering, malicious variants of the system applet. These could, for example, attempt to feed inaccurate information back to the database, or attempt to disrupt the normal operation of the system by means of a Denial Of Service (DOS) attack where large quantities of queries or data is directed at the server in an attempt to overload it.
Denial Of Service attacks are a common problem for all Internet based services, systems and networks, hackers using the normal process of sending queries, but in such quantities that potentially the server cannot cope. One way to address this potential vulnerability is to provide means to monitor the network traffic and the system server(s) itself to check that no such attacks are in progress and that the system is running smoothly and is coping with the number of interactions it is receiving. It is possible to provide means to track the trends in traffic on daily and weekly cycles, and to keep ahead of demand for the service as the system's popularity grows. Thus, if a DOS attack is launched, a sudden increase in activity beyond the normal cycles can be detected and defensive measures taken. These may involve blocking traffic from certain IP domains or addresses and if necessary closing down the system until the source of the attack has been identified and shut down.
Similarly, means to monitor activity on the system site and data flowing into and out of the database can identify discrepancies that signify hacker attack, at which point appropriate measures can be taken.
Another potential risk is that of inaccurate information being systematically fed into the system database. For example, a person might consider the artificial boosting of page rankings to artificially boost the traffic to their website by sending in large numbers of fictitious votes, or possibly to reduce the traffic going to a competitor's site by sending wrong analyses of pages.
As described above, a so-called dynamic page rating is obtained by counting votes that are cast whenever a user bookmarks a site enforced by the database schema. However, it is only possible for each unique user to bookmark a page once. It would be theoretically possible to obtain many virtual USER JDS by visiting the system site many times, each time the user clearing the cookie(s) from their browser, so that the system would consider the user new every time. A solution to this problem is to provide means to monitor the IP address from which new users are originating. Many users, for example more than 10, (especially over a short space of time, for example, 1 week, 1 day or 1 hour) originating from the same IP address can be detected indicate a potential problem (especially if these users are indeed casting many votes for the same web page). Measures can then be taken to block this activity, for example by issuing all users from the same or a particular suspect IP address with the same USERJD.
To detect malicious activity where the applet is modified to transmit inaccurate analyses of pages, it is prefened that the applet cannot request which pages it analyses - rather it is sent commands from the central server. Therefore, as long as the server includes means to keep track of which applets have been sent which spider-request URLs, this restricts the possibility of a malicious third party sending inconect information for any web page at will.
A hacker might also or instead simply want to reduce the quality of information in the system as a whole. This risk can be alleviated by providing means to ensure that each web page is analysed by at least two different clients before being incorporated into the database. If these two clients do not agree on the analysis, then a third (or further) client is enlisted, and the two (or more) clients that agree have their data incorporated. This technique can be termed multiple-spidering. This technique introduces a potential overhead into the system that could reduce the spidering power of the system. However, there are normally so many clients performing spidering on behalf of the system that this is unlikely to be a significant issue, and indeed provides a further guarantee of the quality of the data. It is nevertheless possible to restrict the impact of this overhead, by close monitoring of aspects of the system.
Rather than multiple-spidering every single page that needs analysing, a small sample (say 1%) of pages may be multiple-spidered by default. If a significant number of these pages are rejected then this can be taken to indicate a problem of this kind. At this point (potentially automatically), the number of pages being multiple-spidered is gradually increased, up to the maximum of 100% so that the quality of the system's data is once more assured. At this point the malicious third party is defeated, no more contentious page analyses will be detected and the system can once again reduce the percentage of multiple-spidering back to the original low background level.
For single word or recognised-phrase queries, a list of URLs associated with that word or phrase is returned, ordered by their recommendation scores. Multiple word queries which appear in the phrase database will be treated in a similar fashion.
For multiple word queries which don't appear in the single word or phrase table, the product of the recommendation-rating for each word in the query is used to rank the URLs.
In one embodiment, two copies of the term lists are maintained - one ordered by rank and the other ordered by URLJD. In an alternative embodiment, a word list associated
with each URL is maintained. The best strategy for a given application can be readily determined by experiment.
The lists, and in particular the ordering of them, are advantageously updated whenever a query invokes it. A flag on each list will indicate whether it is up to date with ordering.
To ensure that all pages are kept up to date, each user-query also results in one or more 'processing-tax' URLs being sent to the user, unrelated to the user's query, but which the system wants updated. A list of URLs may be sent to have their modified dates checked, and those that have changed are processed as described above. The processing tax may have to be pitched as high as 33% or even 50% in order to achieve the level of spidering required but lower levels such as 15%, 10%, 5%, 2%, 1% or less can also be used. This can be tuned according to the needs of the system at a particular stage in its development as a web authority. The tax rate can also be dependent upon a user's bandwidth, for example to levy a higher tax on "rich" users, i.e. those people with high speed connections.
Any pages which have been non-contactable for a certain number of days are preferably deleted from the list. Less highly-rated pages will thus have a shorter time-to-live than more important pages. In other words, the time-to-live for a particular page is preferably determined by a function of that page's importance and the continuous amount of time it has been non-contactable.
Preferably the data collection and processing (web page spidering) procedures described above as being performed by the applet are also implemented in conesponding code on the data collection server (although there may be no need to establish a socket connection). Following start-up the data collection server may then begin to populate the database using its own spidering processes. Initially, there will be relatively few users, so the system will have time to do its own searches. Gradually however it will be able to afford to spend less time searching and will have to spend more time interacting with clients. Concomitantly as time goes on the central data collection server will need
to do less and less spidering itself, instead relying on the resources of the growing numbers of users. There is thus an elegant trade-off between available CPU and bandwidth and the number of users. Furthermore as the size of the web expands exponentially in terms of number of documents so the number of potential system users spidering the web also grows exponentially.
Initial spidering is thus preferably done by the server itself, utilising the same optimised distributed mechanism that is used to manage the many client applets once the system site is up and running with a large set of regular users. In a prefened embodiment a special version of the applet is therefore provided which has no or a very limited GUI, which has the same interface to the server as the normal applet and which spends substantially all its time requesting long lists of pages that need checking and returning the results in batches. This process is run on the server machine and any other available machines with permanent Internet connections. One way to build up the initial list of URLs is to query DNS servers for the complete set of registered domain names, (for example, for .co.uk domains); alternatively this information may be purchased (for example from Network Solutions for top-level .com domains).
The foregoing embodiments of the system have been described with reference to the Internet, but the present invention is also applicable to other networks such as intranets, extranets, local and wide area networks, WAIS (Wide Area Information Servers) -based networks and wireless networks. Moreover although it is preferable to employ Internet and web-based technology this is not essential and the invention may be adapted for use with other systems in which applications are shared between machines which communicate with each other, for example over a network. Thus the invention is also applicable to mobile phone-accessed networks such as networks accessed by means of i-mode or WAP (Wireless Application Protocol).
No doubt many other effective alternative anangements will occur to the skilled person and it should be understood that the invention is not limited to the described
embodiments and encompasses modifications apparent to those skilled in the art lying within the spirit and scope of the claims appended hereto.