WO2001004783A2 - Method and apparatus for providing localized searching - Google Patents

Method and apparatus for providing localized searching Download PDF

Info

Publication number
WO2001004783A2
WO2001004783A2 PCT/US2000/018826 US0018826W WO0104783A2 WO 2001004783 A2 WO2001004783 A2 WO 2001004783A2 US 0018826 W US0018826 W US 0018826W WO 0104783 A2 WO0104783 A2 WO 0104783A2
Authority
WO
WIPO (PCT)
Prior art keywords
search
computer
web site
subscriber
report
Prior art date
Application number
PCT/US2000/018826
Other languages
French (fr)
Other versions
WO2001004783A3 (en
Inventor
Miles B. Kehoe
Mark L. Bennett
Wolf L. Logan
Eric C. Williams
Aahz
Original Assignee
Searchbutton.Com
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Searchbutton.Com filed Critical Searchbutton.Com
Priority to AU60848/00A priority Critical patent/AU6084800A/en
Publication of WO2001004783A2 publication Critical patent/WO2001004783A2/en
Publication of WO2001004783A3 publication Critical patent/WO2001004783A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • Web masters looking to provide a search function for their sites have had limited options.
  • One approach is to attempt to use a generic search engine such as AltavistaTM, or HotbotTM, along with search arguments that attempt to limit the search to a particular web site. For example, if the web master of "www.example.com" wanted to provide searching with AltavistaTM, they could develop a search that limits the results to "+url:www.example.com”. Because the generic search engines will not regularly spider the web master's site as frequently as she/he might update it, the results can become out of date quickly. Additionally, the search results from the generic search engines are not presented in a format that is consistent with a given site's format, but rather in the format of the search engine's other pages.
  • a subscriber can sign up her/his web site for indexing with a remote search.
  • the remote search provides a small search code to the subscriber.
  • the search code is an HTML link to a search form for the web site.
  • the subscriber then includes the search code on her/his web site.
  • the search form page and the search results page can be customized by the subscriber to look like the web site.
  • Some embodiments of the invention include a system comprising a web site computer, a visitor's computer, and a remote search computer.
  • the web site computer includes code for requesting a search form for the web site from the remote search computer.
  • the code is typically a link embedded in a web page on the web site.
  • the visitor's computer can access the code from the web site and then access the search form on the remote search computer.
  • the visitor's computer can then supply one or more search terms on the search form.
  • the remote search computer then performs a search of the web site using the one or more search terms.
  • the results are returned to the visitor computer.
  • the remote search computer is distinct from the web site computer.
  • Fig. 1 illustrates a system for providing localized search capabilities according to some embodiments of the invention.
  • Fig. 2 is a process flow diagram for subscribing a web site to the local search systems.
  • Fig. 7 is a process flow diagram for searching a web site according some embodiments of the invention.
  • Figs. 8-9 illustrate an example of local search.
  • Fig. 10 illustrates a back end system used by some embodiments of the invention.
  • the network 106 is a network such as the Internet and/or combinations of other networks.
  • the network 106 includes a private intranet coupled via a firewall to the Internet.
  • the subscriber 100, the service provider 102, and the visitors 104A-B would be local to the intranet while the remote search 108 could be located outside the intranet and coupled in communication with the intranet.
  • the subscriber 100 is anyone with authority to request a search feature for a given web site (e.g. the web site stored at the service provider 102 as the web pages 112A-B).
  • the subscriber 100 accesses network 106 with a computer.
  • the subscriber 100 is a web master for a particular site, e.g. the intranet administrator, an individual for their personal home pages, a site maintainer, a content manager, a support manager, etc.
  • the web pages 112A-B can be standard hypertext markup language
  • HTML hyperText Markup language
  • XML extensible markup language
  • PDF portable document format
  • MicrosoftTM OfficeTM documents and/or other types of web pages.
  • the web site is hosted on a service provider 102 that the subscriber does not control, the subscriber could also be in control of the web site. This might arise when a company has a web site hosted on a computer they control, but they prefer to use the remote search 108 to avoid the need to deploy customized search software.
  • the only authority the subscriber 100 has at the service provider 102 is the ability to update files within her/his web site directory.
  • the visitors 104A-B are visitors using computers to access the web site over the network 106. Visitors can use standard web browsers such as NetscapeTM NavigatorTM, from Netscape Communications, Mountain View, California, to access the web site. Using the web browser, the visitors 104A-B can view web pages (e.g. the web page 112A-B) of the web site and follow links on the web pages.
  • web pages e.g. the web page 112A-B
  • the subscriber UI 114 can include options to allow subscribers (e.g. the subscriber 100) to customize the appearance of the search form page and the search results page for their web site generated by the search UI 116 for visitors (e.g. the visitors 104A-B). This allows the search form page and search results page generated by the search UI 116 to look more like the web site itself, e.g. colors, logos, fonts, and/or other elements.
  • the search UI 116 provides an interface to visitors to the search function of web sites subscribed to the remote search 108.
  • the search UI 116 provides a search form page for visitors (e.g. the visitors 104A-B) to enter search terms and a search results page for showing visitors the search results.
  • the search system 118 comprises the back end components of the remote search 108.
  • the search system 118 includes indices, databases, site lists, subscriber user interface data, spider processes, and/or database engines. Spider processes are processes for working with portions of web sites, e.g. pages. Spiders are also sometimes called crawlers.
  • the term spiders refers to the various processes used by the search system 118 to retrieve, index, and/or process web sites.
  • the search system 118 is described more fully in connection with Figures 10 and 11.
  • multiple levels of service are offered by the remote search 108.
  • a free advertising based level of service and a subscription level of service are offered.
  • subscribers such as the subscriber 100 pay no fees, but their search form page and/or search results page may include advertising.
  • subscribers such as the subscriber 100 pay a fee, e.g. $300/year, to receive the search feature and no advertising is shown.
  • the system can automatically revert, or degrade, to the advertising subscription by interesting advertising rather than disconnecting the search feature.
  • This degradation can also be used in the provision of other types of services over the Internet with multiple levels of services. For example, this could be extended to Internet chat services, bulletin board services, web provided services, and/or other services provided over the Internet.
  • the subscriber will be able to verify that her/his web site has been fully indexed.
  • the subscriber is not required to create the search form manually, but rather simply inserts the search code in web pages on the web site to enable the remote search 108.
  • Figure 2 is a process flow diagram for subscribing a web site to the local search according to some embodiments of the invention. This could be used by the remote search 108 to allow subscribers (e.g. the subscriber 100) to request the search service for their web sites.
  • Figures 3-6 are used to illustrate the subscriber sign up process according to the process of Figure 2.
  • the subscriber 100 signs up for the search service using the subscriber UI 114.
  • the subscriber UI 114 presents a fill out HTML form over the World Wide Web to the subscriber 100.
  • Figure 3 shows the main page of the web site of the subscriber 100, the web page 112 A.
  • the web site in this example is a homeowners association web site at ⁇ http://www.ventanadelmar.org/>.
  • Figure 4 shows the web site entrance to the subscriber UI 114.
  • Existing subscribers can enter by using their subscriber information in area 400 and new subscribers such as the subscriber 100 can enter through the sign up link 402. The features and functions available to existing subscribers are discussed in greater detail below.
  • the subscriber clicks on the sign up link 402 and is presented with a fill out HTML form shown in Figure 5 to subscribe to the search function.
  • the form 500 includes a number of questions that provide the remote search 108 the information to sign up the subscriber and identify the web site.
  • the subscriber 100 is asked provide her/his electronic mail address in form area 502, select a password in form area 504, and identify their web site in form area 506.
  • the subscriber would provide the address "vdm@ventanadelmar.org" in form area 502, a password in form area 504, and the uniform resource indicator (URI) for the web site in form area 506 (e.g. "http://www.ventanadelmar.org/").
  • URI uniform resource indicator
  • the subscriber is offered a selection of service levels.
  • the form area 508 allows the subscriber 100 to select between a free service and a paid service. If the subscriber 100 selects the paid service, she/he can be prompted to provide additional payment information on a separate fill out form.
  • two additional questions are asked.
  • One question concerns whether or not the web site includes adult content. This allows the remote search 108 to ensure that adult related advertising is not provided to non-adult sites.
  • Another question asked by some embodiments of the invention is whether or not the subscriber 100 has the authority to request the search function for the web site. This is asked to assure that the person subscribing the web site has the authority to grant permission to index the site for intellectual property reasons, e.g.
  • the web site may be categorized by the user. This could be used to distinguish between commercial, non-profit, and private sites as well as identify the topic of the site, e.g. "Finances”. Advertising preferences may be available, e.g. to allow the user to select different types or categories of ads. Other marketing and demographic questions might also be asked. All of these questions serve several purposes. One purpose is to help the provider of the remote search better understand the subscribers. Another purpose is to help the subscribers and the remote search 108 select the best advertising for the site when the free service is used. Returning to Figure 2, the process can operate in parallel. The remote search 108 will begin to index the web site at step 204. This is discussed in greater detail in conjunction with Figure 11.
  • Table 1 includes representative examples of HTML versions of the search code for inclusion on the web site of the subscriber 100. This makes adding search capabilities to a web site as simple as adding a link in HTML.
  • the search code is provided as part of an electronic mail message to the subscriber 100 with instructions for adding the search code to a web page.
  • a revised version of the home page e.g. the web page 112 A, is provided to the subscriber by electronic mail with the search code included.
  • Other embodiments use other techniques for communicating the link (e.g. posting it on the remote search 108).
  • the electronic mail message with the search code may contain hypertext links to instructions for including the search code on a web page (e.g. the web pages 112A-B). Once, the search code is included on the web page (e.g. the web page
  • the web site is search enabled.
  • Figure 6 shows the web site of the subscriber 100 after it is search enabled with search such as by search button 600 to a search form page.
  • the subscriber 100 can modify the appearance of the search form page and the search results page to better match the style and look of her/his web site. This can also be directly accessed through the subscriber UI 114 when a subscriber (e.g. the subscriber 100) provides her/his information in area 400.
  • Typical options for customizing the appearance of the search form page and search results page include: specifying a title, options for providing the URI of a banner image, options for providing the URI of a logo image, options for selecting colors for page elements, options for providing the URI of a background image, and/or other options. These options allow the subscriber 100 to blend the appearance of the pages provided by the search UI 116 to visitors to match the appearance of the web site. In some embodiments, colors for the search form page and search results page are automatically selected based on color selections in the home page of the web site of the subscriber 100. For example, if the subscriber has a black background with yellow text on her/his home page, then the remote search could automatically provide those colors as a default option for the subscriber 100.
  • the subscriber 100 can select from several different layouts for the search form and search results. In other embodiments, the subscriber 100 can design a custom layout for the search form and the search results. These layouts can control which elements appear on the search form and the search results and where those elements appear. The customization process can be performed using the subscriber UI 114.
  • the subscriber UI 114 also provides several options to subscribers (e.g. the subscriber 100) for managing the search functionality.
  • Status information informs the subscriber 100 about when her/his web site was last indexed and/or other information, e.g. when it will next be indexed, how many pages were in the site, etc.
  • Maintenance options allow the subscriber 100 to manually request that her/his web site be re-indexed.
  • the subscriber 100 can update the appearance options for her/his search form page and search results page at any time as well.
  • the subscriber 100 can request a list of the most popular searches for a given time period, e.g. last month, last quarter, etc. This lets the subscriber 100 understand what visitors (e.g. the visitors 104A-B) are interested in finding on her/his web site and/or what the visitors are having difficulty finding on her/his web site. For example, if "driving directions" is the most common search, the subscriber 100 could modify her/his web site to make links to driving directions more prominent.
  • Other embodiments of the invention provide reports on the most frequent users of the search function in a given time period. This allows the subscriber 100 to understand who is searching their web site, e.g. users from America OnlineTM.
  • Another type of report provided by some embodiments of the invention is a summary of searches that returned no results. This allows the subscriber 100 to better understand what visitors were looking for and perhaps modify web pages or extend her/his web site to include the information. For example, if visitors were frequently searching for "prices", the subscriber 100 could extend her/his web site to include the basic pricing for her/his services.
  • one embodiment of the invention allows subscribers to access the raw search data comprised of search terms and result information. Still other embodiments allow statistics from the remote search 108 to be viewed on a web page at a subscriber's web site. Also, some embodiments, provide click thru information to the subscriber. Click thru information tells the subscriber which pages in the results were most often clicked on by visitors.
  • Figure 7 is a process flow diagram for searching a web site according some embodiments of the invention. This could be used by visitors (e.g. the visitors 104A-B) to search the web site of the subscriber 100.
  • a visitor e.g. the visitor 104A
  • the search button selects the search button on a web page (e.g. the web page 112A).
  • the visitor 104 A could click on the search button 600 of Figure 6.
  • the remote search 108 provides a search form page via the search UI 116.
  • the search form page might look like the search form page of Figure 8. This could be the search page reached when a visitor clicks on a link provided by the search code such as the search button 600.
  • the search form page includes a logo 804 selected by the subscriber 100 and a subscriber selected title 806, e.g. "Search Page”.
  • the visitor 104A enters her/his search terms into the search form page.
  • the user could type "gondola" into the area 800 and signal on the search button 802. Additional options can be provided to allow for help with searching and using more advanced search techniques, e.g.
  • the free service of the remote search 108 is shown in Figure 8. As such, advertising appears on the search form page. Using the paid service, the advertising above the logo 804 would be omitted and/or replaced with subscriber selected advertising.
  • the search terms can actually be more complex than keywords, visitors can search for documents modified since a specific date and/or construct boolean search expressions.
  • the search code includes a hyperlink to a "What's New" query that could be displayed along side the search button 600.
  • the remote search 108 can display all documents modified since a predetermined period, e.g. 30 days, etc. In some embodiments, the predetermined period is selected by the subscriber.
  • the remote search 108 provides the search results page to the visitor.
  • the search results page includes hyperlinks to pages containing the search terms the visitor can click on. When the visitor clicks on the hyperlink, the visitor will be shown the corresponding page. In some embodiments, the visitor can enter a new search directly into the search results page.
  • the search for "gondola” resulted in the search results page shown in Figure 9.
  • the subscriber provided logo 804 and title 806 can appear.
  • the results can appear in context using one entry (e.g. the entry 900) for each matching page.
  • a score 902 may be shown for each document to indicate how highly the document ranked relative to others with the search terms.
  • the HTML title of the document may be shown as a link to the document 904.
  • a description 906 of the document may follow along with an indication of the date the document was last modified 908.
  • Area 910 allows a visitor to submit an additional search directly from the search results page. If appropriate, advertising may appear on the search results page.
  • Figure 10 illustrates a back end system used by some embodiments of the invention. This could be used to provide a highly distributed implementation of the remote search 108.
  • subscribers e.g. the subscriber 100
  • visitors e.g. the visitor 104A
  • server administrators e.g. the administrator 1000
  • the director 1002 might include an IP traffic director such as the Cisco DistributedDirector, from Cisco Systems, Inc., San Jose, California. This provides traffic distribution between geographically disperse sites. This allows the remote search 108 to be geographically distributed with automatic load balancing.
  • additional local directors 1004-1008 may be used to further distribute the different functions of the remote search 108.
  • a Cisco LocalDirector from Cisco Systems, Inc., San Jose California, may be used as the local directors 1004-1008.
  • Cisco Systems, Inc. San Jose California
  • the local directors 1004-1008 balance loads across servers performing the same tasks.
  • the local director 1004 balances loads across the computers providing the subscriber UI 114.
  • the local director 1006 balances loads across the computers providing the search UI 116A-C.
  • the local director 1008 balances loads across the computer providing an administrator UI 1010 to the remote search 108.
  • the distributed local subsystems are coupled to the search system 118.
  • the local directors 1004-1008 also provide fail-over capabilities.
  • each search UI 116A- C providing live searching to visitors has a local copy of the current index 1018 separate from the search system 118. This improves performance and reliability.
  • two of the search UI 116A-C can be providing active searches while another is being loaded with the most current indices. Once the new indices are verified, the inactive search UI brought into active with the new index. Then, one of the other search UIs is made inactive.
  • Some embodiments of the invention do not include either the director 1002 or the local directors 1004-1008, others include only some of the local directors 1004-1008, based on what sort of load balancing features are desired by the operator of the remote search 108.
  • the search system 118 includes spiders and database engines 1022.
  • the search system 118 also includes user interface data 1014, sites 1016, a database 1020, and an index 1018.
  • a file system such as the file system 1024 may be coupled to the search system 118.
  • the file system 1024 can be used to store web pages and other information for the remote search 108.
  • the file system 1024 can be accessed by the subscriber UI 114, the search UI 116, and the administrator UI 1010 as appropriate.
  • the UI data 1014 includes the appearance customization provided by subscribers (e.g. the subscriber 100) and is used by the search UI 116A-C to generate the search form page and search results page according to subscriber preferences.
  • the UI data 1014 is stored in a database such as the database 1020. In other embodiments it is kept in a separate location.
  • the sites 1016 is list of the sites to be indexed. In some embodiments, the sites 1016 is included as a table within the database 1020. In other embodiments, the sites 1016 are kept in a separate location. In some embodiments, the sites 1016 includes a list of uniform resource indicators (URIs) for sites that are indexed. The sites 1016 may also include other information such as type of content, contact information, meta-data about the web site, subscription information including payment information, and/or other information. For example, the site 1100A might co ⁇ espond to the homeowners association web site and include the URI of the web site: "http://www.ventanadelmar.org".
  • URIs uniform resource indicators
  • the index 1018 is an index of web pages. Each index 1018 can include the search results for multiple web sites in the sites 1016. After the index 1018 is updated and verified, it can be transfe ⁇ ed to one of the computers servings as the search UI 116A-C. This provides a high degree of reliability and reduces contention for access to the index 1018 because only the spiders in the search system 118 directly access the index 1018.
  • the search UI 116A-C can access distinct copies.
  • the database 1020 is used to maintain state information by the various spiders. This supports a highly parallel and highly distributed process for indexing subscriber web sites as described in conjunction with Figure 11.
  • the database engines allow the spiders to access the database 1020 as needed. In a typical embodiment, an SQL database is used as the database 1020.
  • Figure 11 is a process flow diagram for indexing a web site according to some embodiments of the invention.
  • the process is designed to be highly distributed and thus be capable of operating in a highly parallel fashion as well. Each of the steps can occur simultaneously on appropriate data.
  • the dispatcher spider 1102 is operating on the sites 1016, the index spider 1114 can be adding to the index 1018.
  • the process will be described from start to finish for a single web page on a single web site.
  • the dispatcher spider 1102 reads the address of a web site from the sites 1016, e.g. the site 1100A.
  • the dispatcher then adds the appropriate pages to the database 1020 in the page table. For example, consider how the dispatcher might operate on the site 1100A, "http://www.ventanadelmar.org/".
  • the first step might be to add pages 1101A-C to the database 1020 for standard web page locations, e.g. variations of "index.html”, “index.shtml”, “default.htm”, etc. So for example, the page 1101 A might be
  • the frequency with which a particular web site, e.g. the site 1100C, is re-indexed may depend on system rules, e.g. once every twenty-four hours automatically, and subscriber requests, e.g. index my web site now.
  • the pre-filter spider 1104 verifies that the page (e.g. 1101 A) should be indexed by testing the page against some rules.
  • Typical rules may include limiting the index to pages no more than n levels of links deep and limiting the index to pages within the same web tree, e.g. within the "www.ventanadelmar.org/" web space.
  • a "robots.txt”, or equivalent file, for robots associated with the web site can be considered at the pre-filtering stage.
  • the robots.txt file is used as part of the robot exclusion standard for describing the pages that should not be indexed by spiders and search engines.
  • the pre-filter spider 1104 may use certain rules based on the multi-purpose Internet mail extensions (MIME) type of a page (e.g. the page 1101 A) and/or the file extension (e.g. ".html"). Pages that should be indexed can be flagged in the database for the retrieve head spider 1106.
  • MIME multi-purpose Internet mail extensions
  • the retrieve head spider 1106 retrieves the header portion of web pages marked for indexing in the database 1020.
  • the retrieve head spider 1106 is retrieving the web page 112A from the service provider 102.
  • the header can be retrieved separately from the body to save bandwidth and processing time.
  • the header information can be stored in the database 1020 for access by the post-filter spider 1108.
  • the post-filter spider 1108 analyzes the header information to further determine if the document should be indexed, or re-indexed. For example, if the last modified date has not changed from the date of the document as it currently appears in the index, then the web page can be skipped. Otherwise, the page is marked in the database for retrieval. Other rules can exclude certain types of documents, e.g. image files, or certain types of documents, e.g. documents under 1 kB.
  • the retrieve body spider 1110 retrieves the web pages marked by the post-filter spider 1108.
  • the retrieve body spider 1110 retrieves the body of the web page 112A from the service provider 102.
  • the body can be stored in the database 1020 or a queue pending further processing.
  • the analyzer spider 1112 analyzes the retrieved web pages. Additional pages may be added to the database 1020 as a result of the analysis.
  • the analyzer spider 1112 can also extract the title of the page and generate a checksum for the contents. If the checksum is computed based on a normalized version of the retrieved page, the checksum will remain constant irrespective of minor changes to advertising banners, etc. This allows an additional determination to be made as to whether or not the web page has changed and should be re-indexed. Additionally, the analyzer spider 1112 can identify hyperlinks to new documents and add those documents to the pages 1101 A-C for processing by the spiders. As appropriate, a META tag corresponding to directives for robots for each web page can be used to control the analysis process.
  • the indexer spider 1114 indexes the body content from the queue and indexes it in the index 1018.
  • the index 1018 is rolled out to the search UI 116 once the index has been verified. This is used by some embodiments of the invention to ensure high availability of the indexes by reducing contention between spidering processes and visitor searches.
  • collections of data other than web sites are indexed.
  • an electronic collection of documents stored on a file system could be indexed by some embodiments of the invention.
  • indexes could be generated for net news articles, electronic mail archives, and/or the contents of a database.
  • any electronic data collection could be remotely searched using embodiments of the current invention.
  • the HTTP referrer field is used by the remote search 108 to match the search service with the search site.
  • the referrer field is used as secondary confirmation that the site id requested matches the referring site. For example, if "http://www.example.com/" is indexed by the remote search with id 12345 and "http://www.company.com/" is indexed with id 12346, then the referrer field could act as a double check on the site id.
  • the remote search code respond with a configuration error if the referrer and the site id do not match.
  • the id 12346 goes with referrers from "http://www.company.com/” so visitors from "http://www.example.com " would see an error message.
  • the referrer would override the provided site id and the search form for "http://www.example.com/" would be provided.
  • the web browser itself could be used as the search form.
  • the search code could be a reference to a plug-in and/or a Java applet that provides the search form.
  • Other embodiments allow the location area of the web browser to be used as the search form, e.g. instead of typing a URI in the location a visitor types her/his search terms and presses enter after clicking on a link provided by the search code.
  • the remote search 108 is included in one or more computer usable media such as CD-ROMs, floppy disks, a hard disk installed on a computer and/or other media.
  • the electromagnetic wave form comprises information such as the remote search 108 and/or the search code.
  • the subscriber UI 114 might be accessed by a subscriber 100 over a network.

Abstract

A method of providing a remotely hosted local search for a web site is described. A subscriber can sign up her/his web site for indexing with a remote search. The remote search provides a small search code to the subscriber. Typically, the search code is an HTML link to a search form for the web site. The subscriber then includes the search code on her/his web site. Visitors to the web site can signal on the search code provided link to receive a search form page from the remote search. The visitor can enter her/his search terms to receive a search results page from the remote search. The search form page and the search results page can be customized by the subscriber to look like the web site. The remote search can index the web site periodically, and upon a subscriber request, so that the most current results are provided. In some embodiments, multiple levels of service are offered, e.g. free service supported by advertising and a subscriber paid service. The remote search works well for subscribers with Internet Service Provider hosted web sites because there is no need for software to be installed or any need to do anything other than modify the web pages from which the subscriber wants the search feature available.

Description

METHOD AND APPARATUS FOR PROVIDING LOCALIZED
SEARCHING
BACKGROUND OF THE INVENTION
Field of the Invention This invention relates to the field of Internet site development and use.
In particular, the invention relates to a method for providing localized search capabilities using a remote search host. Description of the Related Art
Web masters looking to provide a search function for their sites have had limited options. One approach is to attempt to use a generic search engine such as Altavista™, or Hotbot™, along with search arguments that attempt to limit the search to a particular web site. For example, if the web master of "www.example.com" wanted to provide searching with Altavista™, they could develop a search that limits the results to "+url:www.example.com". Because the generic search engines will not regularly spider the web master's site as frequently as she/he might update it, the results can become out of date quickly. Additionally, the search results from the generic search engines are not presented in a format that is consistent with a given site's format, but rather in the format of the search engine's other pages. Another approach is to install custom search software on the web site itself. One example is Ultraseek™ from Infoseek Software, Sunnyvale, California. Using Ultraseek™, a web master installs custom software on a server machine and can then set up the server to search their web site. This requires the web master to have the ability to set up the Ultraseek™ software on a server and have the disk space, memory space, and technical skill to host the information. These software packages can be expensive, e.g. $995 for a basic license for Ultraseek™. Further, they depend on the user's ability to install specialized software on a server machine. Many web sites are remotely hosted such as on Internet service provider (ISP) computers. These web sites typically do not have the ability to run dedicated search servers for users. Further, users are restricted, as such they typically are not allowed to software. The previous techniques do not allow a web master to easily set up a search feature for their site without installing customized software or relying on a general search engine. Accordingly, what is needed is an improved method of providing localized searches.
SUMMARY OF THE INVENTION
A method of providing a remotely hosted local search for a web site is described. This allows web sites to easily offer a search capability to visitors without the use of specialized programs or software on the web site. Instead, a small piece of HTML code can be added to the web site to allow visitors to access a remotely hosted search feature.
Visitors of a web site can search the web site by clicking on a link on the web site. The link connects the visitor's web browser to a search form generated by the remote search. The visitor can provide her/his search terms on the form and the remote search will prepare search results with matching portions of the web site.
A subscriber can sign up her/his web site for indexing with a remote search. The remote search provides a small search code to the subscriber. Typically, the search code is an HTML link to a search form for the web site. The subscriber then includes the search code on her/his web site. The search form page and the search results page can be customized by the subscriber to look like the web site.
The remote search can index the web site periodically, and upon a subscriber request, so that the most current results are provided. In some embodiments, multiple levels of service are offered, e.g. free service supported by advertising and a subscriber paid service. The remote search works well for subscribers with Internet Service Provider hosted web sites because there is no need for software to be installed or any need to do anything other than modify the web pages from which the subscriber wants the search feature available.
Some embodiments of the invention include a system comprising a web site computer, a visitor's computer, and a remote search computer. The web site computer includes code for requesting a search form for the web site from the remote search computer. The code is typically a link embedded in a web page on the web site.
The visitor's computer can access the code from the web site and then access the search form on the remote search computer. The visitor's computer can then supply one or more search terms on the search form.
The remote search computer then performs a search of the web site using the one or more search terms. The results are returned to the visitor computer. The remote search computer is distinct from the web site computer.
BRIEF DESCRIPTION OF THE FIGURES
Fig. 1 illustrates a system for providing localized search capabilities according to some embodiments of the invention.
Fig. 2 is a process flow diagram for subscribing a web site to the local search systems.
Figs. 3-6 illustrate an example of the set up of the local search.
Fig. 7 is a process flow diagram for searching a web site according some embodiments of the invention.
Figs. 8-9 illustrate an example of local search. Fig. 10 illustrates a back end system used by some embodiments of the invention.
Fig. 11 is a process flow diagram for indexing a web site according to some embodiments of the invention.
DETAILED DESCRIPTION
The remotely hosted local search system enables web sites to easily be search enabled by adding a small amount of hypertext markup language (HTML) code to a web page. This in turn allows visitors of the web sites to search for content within the web site. No software needs to be installed at the web site; therefore, even web sites hosted by an Internet service provider for individual users can be easily search enabled without software or common gateway interface (CGI) programs. A program is a sequence of instructions that can be executed on a computer. A computer refers to a computer, a group of computers coupled in communication, and/or some other type of computing device. The remote search can offer this service for free with advertising support and/or as a paid subscription service for web site owners, also called subscribers. Further, the remote search can allow subscribers to customize the appearance of the search forms and search results provided to users so that they match the appearance of the web site. The description of some embodiments of the invention is organized as follows. First, an overview of the system components is provided along with a discussion of the general operation of the system. Then, the process of subscribing to the remote search is discussed in conjunction with an example. Finally, the use of the search system by visitors is discussed in conjunction with an example.
A. System Overview
Figure 1 illustrates a system including some embodiments of the invention. This could be used to provide remotely hosted searching for web sites hosted throughout the Internet.
The following paragraph lists the elements of Figure 1 and describes their interconnections. Figure 1 includes a subscriber 100, a service provider 102, visitors 104A-B, a network 106, and a remote search 108. The remote search 108 includes a subscriber user interface (UI) 114, a search UI 116, and a search system 118. The service provider 102 includes a data 110. The data 110 includes a web site comprised of the web pages 112A-B. The subscriber 100, the service provider 102, the visitors 104A-B and the remote search 108 are coupled in communication with the network 106.
The following describes the uses of the elements of Figure 1. The network 106 is a network such as the Internet and/or combinations of other networks. For example, in some embodiments, the network 106 includes a private intranet coupled via a firewall to the Internet. In such an embodiment, the subscriber 100, the service provider 102, and the visitors 104A-B would be local to the intranet while the remote search 108 could be located outside the intranet and coupled in communication with the intranet.
The subscriber 100 is anyone with authority to request a search feature for a given web site (e.g. the web site stored at the service provider 102 as the web pages 112A-B). The subscriber 100 accesses network 106 with a computer. Typically, the subscriber 100 is a web master for a particular site, e.g. the intranet administrator, an individual for their personal home pages, a site maintainer, a content manager, a support manager, etc. The web pages 112A-B can be standard hypertext markup language
(HTML) web pages, extensible markup language (XML) web pages, images, portable document format (PDF) files, Microsoft™ Office™ documents, and/or other types of web pages. Although in this example, the web site is hosted on a service provider 102 that the subscriber does not control, the subscriber could also be in control of the web site. This might arise when a company has a web site hosted on a computer they control, but they prefer to use the remote search 108 to avoid the need to deploy customized search software. In this example, the only authority the subscriber 100 has at the service provider 102 is the ability to update files within her/his web site directory.
The visitors 104A-B are visitors using computers to access the web site over the network 106. Visitors can use standard web browsers such as Netscape™ Navigator™, from Netscape Communications, Mountain View, California, to access the web site. Using the web browser, the visitors 104A-B can view web pages (e.g. the web page 112A-B) of the web site and follow links on the web pages.
The remote search 108 receives subscription requests over the subscriber UI 114 from subscribers (e.g. the subscriber 100). Using the subscriber UI 114, the subscribers (e.g. the subscriber 100) can request that their web site be indexed for search capabilities. The remote search 108 will index the web site via the network 106 using the search system 118. The subscribers (e.g. the subscriber 100) will be provided a small piece of search code. The subscribers (e.g. the subscriber 100) can include the search code in their web site (e.g. by inserting it in the web page 112A). Typically, the search code is an HTML code for including a link to a search form page.
The subscriber UI 114 can include options to allow subscribers (e.g. the subscriber 100) to customize the appearance of the search form page and the search results page for their web site generated by the search UI 116 for visitors (e.g. the visitors 104A-B). This allows the search form page and search results page generated by the search UI 116 to look more like the web site itself, e.g. colors, logos, fonts, and/or other elements.
The search UI 116 provides an interface to visitors to the search function of web sites subscribed to the remote search 108. The search UI 116 provides a search form page for visitors (e.g. the visitors 104A-B) to enter search terms and a search results page for showing visitors the search results. The search system 118 comprises the back end components of the remote search 108. For example, the search system 118 includes indices, databases, site lists, subscriber user interface data, spider processes, and/or database engines. Spider processes are processes for working with portions of web sites, e.g. pages. Spiders are also sometimes called crawlers. As used herein, the term spiders refers to the various processes used by the search system 118 to retrieve, index, and/or process web sites.
The search system 118 is described more fully in connection with Figures 10 and 11. In some embodiments, multiple levels of service are offered by the remote search 108. In one embodiment, a free advertising based level of service and a subscription level of service are offered. In some embodiments, with the free advertising based level of service, subscribers such as the subscriber 100 pay no fees, but their search form page and/or search results page may include advertising. In some embodiments, with the subscription level of service, subscribers such as the subscriber 100 pay a fee, e.g. $300/year, to receive the search feature and no advertising is shown. According to some embodiments, if the subscriber 100 does not pay initially, or when it is time to renew their subscription level of service, the system can automatically revert, or degrade, to the advertising subscription by interesting advertising rather than disconnecting the search feature.
This degradation can also be used in the provision of other types of services over the Internet with multiple levels of services. For example, this could be extended to Internet chat services, bulletin board services, web provided services, and/or other services provided over the Internet.
This provides a number of advantages for the subscriber over previous systems. The subscriber will be able to verify that her/his web site has been fully indexed. The subscriber is not required to create the search form manually, but rather simply inserts the search code in web pages on the web site to enable the remote search 108.
The process for signing a web site up for the remote search 108 will now be described. Then, the process for visitors to use the remote search 108 will be described. B. Subscriber Setup
Figure 2 is a process flow diagram for subscribing a web site to the local search according to some embodiments of the invention. This could be used by the remote search 108 to allow subscribers (e.g. the subscriber 100) to request the search service for their web sites. Figures 3-6 are used to illustrate the subscriber sign up process according to the process of Figure 2.
First, at step 202, the subscriber 100 signs up for the search service using the subscriber UI 114. In some embodiments, the subscriber UI 114 presents a fill out HTML form over the World Wide Web to the subscriber 100. In this example, Figure 3 shows the main page of the web site of the subscriber 100, the web page 112 A. The web site in this example is a homeowners association web site at <http://www.ventanadelmar.org/>. At present, the web site as shown in Figure 3 lacks a search capability. Figure 4 shows the web site entrance to the subscriber UI 114. Existing subscribers can enter by using their subscriber information in area 400 and new subscribers such as the subscriber 100 can enter through the sign up link 402. The features and functions available to existing subscribers are discussed in greater detail below.
In this example, the subscriber clicks on the sign up link 402 and is presented with a fill out HTML form shown in Figure 5 to subscribe to the search function. The form 500 includes a number of questions that provide the remote search 108 the information to sign up the subscriber and identify the web site. The subscriber 100 is asked provide her/his electronic mail address in form area 502, select a password in form area 504, and identify their web site in form area 506. Here, the subscriber would provide the address "vdm@ventanadelmar.org" in form area 502, a password in form area 504, and the uniform resource indicator (URI) for the web site in form area 506 (e.g. "http://www.ventanadelmar.org/").
In some embodiments, the subscriber is offered a selection of service levels. In this example, the form area 508 allows the subscriber 100 to select between a free service and a paid service. If the subscriber 100 selects the paid service, she/he can be prompted to provide additional payment information on a separate fill out form. In some embodiments, two additional questions are asked. One question concerns whether or not the web site includes adult content. This allows the remote search 108 to ensure that adult related advertising is not provided to non-adult sites. Another question asked by some embodiments of the invention is whether or not the subscriber 100 has the authority to request the search function for the web site. This is asked to assure that the person subscribing the web site has the authority to grant permission to index the site for intellectual property reasons, e.g. copyright and trademark restrictions. Still other embodiments, may ask additional questions. For example, the web site may be categorized by the user. This could be used to distinguish between commercial, non-profit, and private sites as well as identify the topic of the site, e.g. "Finances". Advertising preferences may be available, e.g. to allow the user to select different types or categories of ads. Other marketing and demographic questions might also be asked. All of these questions serve several purposes. One purpose is to help the provider of the remote search better understand the subscribers. Another purpose is to help the subscribers and the remote search 108 select the best advertising for the site when the free service is used. Returning to Figure 2, the process can operate in parallel. The remote search 108 will begin to index the web site at step 204. This is discussed in greater detail in conjunction with Figure 11.
Meanwhile, the subscriber 100 can be provided with search code to link to the remote search 108 at step 206. Table 1 includes representative examples of HTML versions of the search code for inclusion on the web site of the subscriber 100. This makes adding search capabilities to a web site as simple as adding a link in HTML.
Figure imgf000010_0001
Figure imgf000011_0001
Table 1
In some embodiments, the search code is provided as part of an electronic mail message to the subscriber 100 with instructions for adding the search code to a web page. In other embodiments, a revised version of the home page, e.g. the web page 112 A, is provided to the subscriber by electronic mail with the search code included. Other embodiments use other techniques for communicating the link (e.g. posting it on the remote search 108).
The electronic mail message with the search code may contain hypertext links to instructions for including the search code on a web page (e.g. the web pages 112A-B). Once, the search code is included on the web page (e.g. the web page
112 A) and the modified web page is uploaded to the service provider 102, the web site is search enabled. Figure 6 shows the web site of the subscriber 100 after it is search enabled with search such as by search button 600 to a search form page. Finally, at step 208, the subscriber 100 can modify the appearance of the search form page and the search results page to better match the style and look of her/his web site. This can also be directly accessed through the subscriber UI 114 when a subscriber (e.g. the subscriber 100) provides her/his information in area 400. Typical options for customizing the appearance of the search form page and search results page include: specifying a title, options for providing the URI of a banner image, options for providing the URI of a logo image, options for selecting colors for page elements, options for providing the URI of a background image, and/or other options. These options allow the subscriber 100 to blend the appearance of the pages provided by the search UI 116 to visitors to match the appearance of the web site. In some embodiments, colors for the search form page and search results page are automatically selected based on color selections in the home page of the web site of the subscriber 100. For example, if the subscriber has a black background with yellow text on her/his home page, then the remote search could automatically provide those colors as a default option for the subscriber 100.
In some embodiments, the subscriber 100 can select from several different layouts for the search form and search results. In other embodiments, the subscriber 100 can design a custom layout for the search form and the search results. These layouts can control which elements appear on the search form and the search results and where those elements appear. The customization process can be performed using the subscriber UI 114.
C. Subscriber Options
Once subscribed, the subscriber UI 114 also provides several options to subscribers (e.g. the subscriber 100) for managing the search functionality.
Status information informs the subscriber 100 about when her/his web site was last indexed and/or other information, e.g. when it will next be indexed, how many pages were in the site, etc. Maintenance options allow the subscriber 100 to manually request that her/his web site be re-indexed. The subscriber 100 can update the appearance options for her/his search form page and search results page at any time as well.
A number of reporting options are available that provide important information to the subscriber 100. In some embodiments, the subscriber 100 can request a list of the most popular searches for a given time period, e.g. last month, last quarter, etc. This lets the subscriber 100 understand what visitors (e.g. the visitors 104A-B) are interested in finding on her/his web site and/or what the visitors are having difficulty finding on her/his web site. For example, if "driving directions" is the most common search, the subscriber 100 could modify her/his web site to make links to driving directions more prominent. Other embodiments of the invention provide reports on the most frequent users of the search function in a given time period. This allows the subscriber 100 to understand who is searching their web site, e.g. users from America Online™.
Another type of report provided by some embodiments of the invention is a summary of searches that returned no results. This allows the subscriber 100 to better understand what visitors were looking for and perhaps modify web pages or extend her/his web site to include the information. For example, if visitors were frequently searching for "prices", the subscriber 100 could extend her/his web site to include the basic pricing for her/his services.
In some embodiments, there are additional reporting options available to subscribers. For example, one embodiment of the invention allows subscribers to access the raw search data comprised of search terms and result information. Still other embodiments allow statistics from the remote search 108 to be viewed on a web page at a subscriber's web site. Also, some embodiments, provide click thru information to the subscriber. Click thru information tells the subscriber which pages in the results were most often clicked on by visitors.
P. Searching the Web Site
Figure 7 is a process flow diagram for searching a web site according some embodiments of the invention. This could be used by visitors (e.g. the visitors 104A-B) to search the web site of the subscriber 100. First, at step 700, a visitor (e.g. the visitor 104A) of the web site of the subscriber 100 selects the search button on a web page (e.g. the web page 112A). For example, the visitor 104 A could click on the search button 600 of Figure 6.
Next, at step 702, the remote search 108 provides a search form page via the search UI 116. For example, the search form page might look like the search form page of Figure 8. This could be the search page reached when a visitor clicks on a link provided by the search code such as the search button 600. In this example, the search form page includes a logo 804 selected by the subscriber 100 and a subscriber selected title 806, e.g. "Search Page". Next, at step 704, the visitor 104A enters her/his search terms into the search form page. For example, the user could type "gondola" into the area 800 and signal on the search button 802. Additional options can be provided to allow for help with searching and using more advanced search techniques, e.g. using date ranges, changing sort orders, and/or other options. In this example, the free service of the remote search 108 is shown in Figure 8. As such, advertising appears on the search form page. Using the paid service, the advertising above the logo 804 would be omitted and/or replaced with subscriber selected advertising.
The search terms can actually be more complex than keywords, visitors can search for documents modified since a specific date and/or construct boolean search expressions. In some embodiments, the search code includes a hyperlink to a "What's New" query that could be displayed along side the search button 600. When the hyperlink for the "What's New" query is selected, the remote search 108 can display all documents modified since a predetermined period, e.g. 30 days, etc. In some embodiments, the predetermined period is selected by the subscriber. Returning to Figure 7, at step 706, the remote search 108 provides the search results page to the visitor. The search results page includes hyperlinks to pages containing the search terms the visitor can click on. When the visitor clicks on the hyperlink, the visitor will be shown the corresponding page. In some embodiments, the visitor can enter a new search directly into the search results page.
For example, the search for "gondola" resulted in the search results page shown in Figure 9. Again, the subscriber provided logo 804 and title 806 can appear. The results can appear in context using one entry (e.g. the entry 900) for each matching page. A score 902 may be shown for each document to indicate how highly the document ranked relative to others with the search terms. Additionally, the HTML title of the document may be shown as a link to the document 904. A description 906 of the document may follow along with an indication of the date the document was last modified 908. Area 910 allows a visitor to submit an additional search directly from the search results page. If appropriate, advertising may appear on the search results page. E. Remote Search System Setup
Figure 10 illustrates a back end system used by some embodiments of the invention. This could be used to provide a highly distributed implementation of the remote search 108. In this embodiment, subscribers (e.g. the subscriber 100), visitors (e.g. the visitor 104A), and server administrators (e.g. the administrator 1000) access the remote search 108 through a director 1002. The director 1002 might include an IP traffic director such as the Cisco DistributedDirector, from Cisco Systems, Inc., San Jose, California. This provides traffic distribution between geographically disperse sites. This allows the remote search 108 to be geographically distributed with automatic load balancing.
Then additional local directors 1004-1008 may be used to further distribute the different functions of the remote search 108. A Cisco LocalDirector, from Cisco Systems, Inc., San Jose California, may be used as the local directors 1004-1008. Within a particular geographic subsystem of the remote search 108, the local directors 1004-1008 balance loads across servers performing the same tasks.
The local director 1004 balances loads across the computers providing the subscriber UI 114. The local director 1006 balances loads across the computers providing the search UI 116A-C. The local director 1008 balances loads across the computer providing an administrator UI 1010 to the remote search 108. The distributed local subsystems are coupled to the search system 118. The local directors 1004-1008 also provide fail-over capabilities.
According to some embodiments of the invention, each search UI 116A- C providing live searching to visitors has a local copy of the current index 1018 separate from the search system 118. This improves performance and reliability. For example, two of the search UI 116A-C can be providing active searches while another is being loaded with the most current indices. Once the new indices are verified, the inactive search UI brought into active with the new index. Then, one of the other search UIs is made inactive.
Some embodiments of the invention do not include either the director 1002 or the local directors 1004-1008, others include only some of the local directors 1004-1008, based on what sort of load balancing features are desired by the operator of the remote search 108.
The search system 118 includes spiders and database engines 1022. The search system 118 also includes user interface data 1014, sites 1016, a database 1020, and an index 1018. Additionally, a file system such as the file system 1024 may be coupled to the search system 118. The file system 1024 can be used to store web pages and other information for the remote search 108. The file system 1024 can be accessed by the subscriber UI 114, the search UI 116, and the administrator UI 1010 as appropriate. The UI data 1014 includes the appearance customization provided by subscribers (e.g. the subscriber 100) and is used by the search UI 116A-C to generate the search form page and search results page according to subscriber preferences. In some embodiments, the UI data 1014 is stored in a database such as the database 1020. In other embodiments it is kept in a separate location.
The sites 1016 is list of the sites to be indexed. In some embodiments, the sites 1016 is included as a table within the database 1020. In other embodiments, the sites 1016 are kept in a separate location. In some embodiments, the sites 1016 includes a list of uniform resource indicators (URIs) for sites that are indexed. The sites 1016 may also include other information such as type of content, contact information, meta-data about the web site, subscription information including payment information, and/or other information. For example, the site 1100A might coπespond to the homeowners association web site and include the URI of the web site: "http://www.ventanadelmar.org".
The index 1018 is an index of web pages. Each index 1018 can include the search results for multiple web sites in the sites 1016. After the index 1018 is updated and verified, it can be transfeπed to one of the computers servings as the search UI 116A-C. This provides a high degree of reliability and reduces contention for access to the index 1018 because only the spiders in the search system 118 directly access the index 1018. The search UI 116A-C can access distinct copies. The database 1020 is used to maintain state information by the various spiders. This supports a highly parallel and highly distributed process for indexing subscriber web sites as described in conjunction with Figure 11. The database engines allow the spiders to access the database 1020 as needed. In a typical embodiment, an SQL database is used as the database 1020.
F. Remote Search System Indexing Process
Figure 11 is a process flow diagram for indexing a web site according to some embodiments of the invention. The process is designed to be highly distributed and thus be capable of operating in a highly parallel fashion as well. Each of the steps can occur simultaneously on appropriate data. Thus, while the dispatcher spider 1102 is operating on the sites 1016, the index spider 1114 can be adding to the index 1018. For clarity, the process will be described from start to finish for a single web page on a single web site.
The dispatcher spider 1102 reads the address of a web site from the sites 1016, e.g. the site 1100A. The dispatcher then adds the appropriate pages to the database 1020 in the page table. For example, consider how the dispatcher might operate on the site 1100A, "http://www.ventanadelmar.org/". The first step might be to add pages 1101A-C to the database 1020 for standard web page locations, e.g. variations of "index.html", "index.shtml", "default.htm", etc. So for example, the page 1101 A might be
"http://www.ventanadelmar.org/index.html". Other embodiments of the invention first add the site URI, e.g. "http://www.ventanadelmar.org/" as a page, e.g. the page 1101 A. If the web site does not automatically provide the default page, then the technique described above of adding default page names to the site URI can be used.
The frequency with which a particular web site, e.g. the site 1100C, is re-indexed may depend on system rules, e.g. once every twenty-four hours automatically, and subscriber requests, e.g. index my web site now.
The pre-filter spider 1104 verifies that the page (e.g. 1101 A) should be indexed by testing the page against some rules. Typical rules may include limiting the index to pages no more than n levels of links deep and limiting the index to pages within the same web tree, e.g. within the "www.ventanadelmar.org/" web space.
Also, if available, a "robots.txt", or equivalent file, for robots associated with the web site can be considered at the pre-filtering stage. The robots.txt file is used as part of the robot exclusion standard for describing the pages that should not be indexed by spiders and search engines. Also, the pre-filter spider 1104 may use certain rules based on the multi-purpose Internet mail extensions (MIME) type of a page (e.g. the page 1101 A) and/or the file extension (e.g. ".html"). Pages that should be indexed can be flagged in the database for the retrieve head spider 1106.
The retrieve head spider 1106 retrieves the header portion of web pages marked for indexing in the database 1020. Here, the retrieve head spider 1106 is retrieving the web page 112A from the service provider 102. The header can be retrieved separately from the body to save bandwidth and processing time. The header information can be stored in the database 1020 for access by the post-filter spider 1108.
The post-filter spider 1108 analyzes the header information to further determine if the document should be indexed, or re-indexed. For example, if the last modified date has not changed from the date of the document as it currently appears in the index, then the web page can be skipped. Otherwise, the page is marked in the database for retrieval. Other rules can exclude certain types of documents, e.g. image files, or certain types of documents, e.g. documents under 1 kB.
The retrieve body spider 1110 retrieves the web pages marked by the post-filter spider 1108. Here, the retrieve body spider 1110 retrieves the body of the web page 112A from the service provider 102. The body can be stored in the database 1020 or a queue pending further processing.
The analyzer spider 1112 analyzes the retrieved web pages. Additional pages may be added to the database 1020 as a result of the analysis. The analyzer spider 1112 can also extract the title of the page and generate a checksum for the contents. If the checksum is computed based on a normalized version of the retrieved page, the checksum will remain constant irrespective of minor changes to advertising banners, etc. This allows an additional determination to be made as to whether or not the web page has changed and should be re-indexed. Additionally, the analyzer spider 1112 can identify hyperlinks to new documents and add those documents to the pages 1101 A-C for processing by the spiders. As appropriate, a META tag corresponding to directives for robots for each web page can be used to control the analysis process. For example, <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">, in the body of a web page might be used to direct the analyzer spider 1112 to not mark the web page for indexing by the indexer spider 1114. Also, because "NOFOLLOW" is indicated, this might direct the analyzer spider 1112 not to add additional web pages to the pages 1101 A-C for hyperlinks in the web page.
The indexer spider 1114 indexes the body content from the queue and indexes it in the index 1018.
Periodically, once a sweep through the sites 1100A-C is completed, the index 1018 is rolled out to the search UI 116 once the index has been verified. This is used by some embodiments of the invention to ensure high availability of the indexes by reducing contention between spidering processes and visitor searches.
G. Alternative Embodiments
In some embodiments, collections of data other than web sites are indexed. For example, an electronic collection of documents stored on a file system could be indexed by some embodiments of the invention. Also, indexes could be generated for net news articles, electronic mail archives, and/or the contents of a database. Most generally, any electronic data collection could be remotely searched using embodiments of the current invention.
In some embodiments, the HTTP referrer field is used by the remote search 108 to match the search service with the search site. For example, in some embodiments, the referrer field is used as secondary confirmation that the site id requested matches the referring site. For example, if "http://www.example.com/" is indexed by the remote search with id 12345 and "http://www.company.com/" is indexed with id 12346, then the referrer field could act as a double check on the site id. For example, if the subscriber at "http://www.example.com/" modifies the search code so that the id 12346 is references, then the remote search code respond with a configuration error if the referrer and the site id do not match. Here, the id 12346 goes with referrers from "http://www.company.com/" so visitors from "http://www.example.com " would see an error message. In other embodiments, the referrer would override the provided site id and the search form for "http://www.example.com/" would be provided.
In some embodiments, the web browser itself could be used as the search form. For example, the search code could be a reference to a plug-in and/or a Java applet that provides the search form. Other embodiments, allow the location area of the web browser to be used as the search form, e.g. instead of typing a URI in the location a visitor types her/his search terms and presses enter after clicking on a link provided by the search code.
In some embodiments, the remote search 108 is included in one or more computer usable media such as CD-ROMs, floppy disks, a hard disk installed on a computer and/or other media.
Some embodiments of the invention are included in an electromagnetic wave form. The electromagnetic wave form comprises information such as the remote search 108 and/or the search code. For example, the subscriber UI 114 might be accessed by a subscriber 100 over a network.
H. Conclusion
The foregoing description of various embodiments of the invention has been presented for purposes of illustration and description. It is not intended to limit the invention to the precise forms disclosed. Many modifications and equivalent arrangements will be apparent.

Claims

CLAIMSWhat is claimed is:
1. A system comprising: a first computer including a web site, the web site including a code for requesting a search form for the web site from a third computer, the code including a hypertext link to the search form on the third computer; a second computer having a first program to access the code from the web site and to access the search form on the third computer using the link and the first program to provide at least one search term in the search form; and the third computer including a second program to search the web site using the at least one search term and returning a result to the second computer, the third computer being different than the first computer.
2. The system of claim 1 , wherein the third computer supports searches of a plurality of web sites including the web site.
3. The system of claim 1, wherein the result returned to the second computer includes links to portions of the web site including the at least one search term.
4. The system of claim 1 , wherein the first computer is a server computer for an Internet service provider and the third computer is operated by a company different than the Internet service provider.
5. The system of claim 1 , wherein the first computer provides one or more of a color, a logo, and a title to the third computer and the third computer uses the one or more of the color, the logo, and the title in the search form.
6. The system of claim 1 , wherein the first computer provides a custom layout to the third computer and the third computer uses the custom layout for the search form.
7. The system of claim 1 , wherein the first computer signals a selection of one of a free service and a paid service to the third computer.
8. The system of claim 7, wherein the third computer supplies an advertisement with the search form responsive to the first computer selecting the free service.
9. The system of claim 7, wherein the third computer supplies an advertisement with the result responsive to the first computer selecting the free service.
10. The system of claim 7, wherein the third computer degrades the first computer selection to the free service from the paid service when payment for the paid service is not received by the third computer within a predetermined period.
11. The system of claim 1 , wherein the web site comprises an electronic data collection.
12. The system of claim 1 , wherein the third computer includes an index of the web site, the index updated periodically.
13. A method of providing a remotely hosted search of a web site for a subscriber using a computer, the computer operated independently of the subscriber, the web site hosted on a second computer operated independently of the subscriber, the method comprising: receiving a subscription request from the subscriber on the computer, the subscription request comprising a web site identifier corresponding to the web site and a method of contact for the subscriber; responding to the subscription request by sending a message to the subscriber using the method of contact, the message including a code for requesting a search form for the web site from the computer; and preparing an index of the web site using the computer.
14. The method of claim 13, wherein the subscription request includes a selection of one of a free service and a paid service.
15. The method of claim 13 , further comprising receiving an appearance description from the subscriber on the computer, the appearance description specifying one or more of a color, a logo, and a title for the search form.
16. The method of claim 13 , further comprising: receiving a report request from the subscriber on the computer; and responding to the report request with a document reporting on a plurality of searches previously performed of the web site.
17. The method of claim 16, wherein the document reporting on a plurality of searches previously performed of the web site includes list of searches performed ordered by the number of times same search terms were used.
18. The method of claim 16, wherein the document reporting on a plurality of searches previously performed of the web site includes list of addresses corresponding to a computer used for each search.
19. The method of claim 16, wherein the document reporting on a plurality of searches previously performed of the web site includes a list of search terms used that produced no results.
20. A method of providing a remotely hosted search of a web site to a visitor of the web site using a computer: presenting the visitor to the web site with a search form from the computer; receiving at least one search term from the visitor in response to the search form; providing a result to the visitor using the computer, the result providing links to portions of the web site with the at least one search term.
21. The method of claim 20, wherein at least one of the search form and the result include advertising.
22. The method of claim 20, wherein the search form matches the appearance of the web site.
23. A system comprising: a plurality of web sites locally searchable from a remote search site; the remote search site comprising an index of the plurality of web sites, one or more computer programs for generating the index of the plurality of web sites, a search form generator displaying a customized search form for respective web sites in the plurality of web sites, the customized search form capable of receiving one or more search terms and sending a message including the one or more search terms to a search result generator, the search result generator for displaying a customized search result for respective web sites in the plurality of web sites responsive to the message.
24. The system of claim 23, wherein the one or more computer programs for generating the index of the plurality of web sites operate in parallel using a database to control execution of each of the one or more computer programs.
25. The system of claim 23, wherein the one or more computer programs includes an analyzer spider.
26. A method of providing a service over an Internet with a paid level of service and a free level of service for a subscriber using a computer, the method comprising: receiving a selected level of service from the subscriber over the
Internet, the selection corresponding to one of the paid level of service or the free level of service; providing the service at the selected level; and degrading the service to the free level of service when payment for the paid level of service is not received by the computer.
27. The method of claim 26, wherein the service is a remotely hosted local search for web sites.
28. An apparatus comprising: means for performing a search of a web site using at least one search term and returning a result; means for providing a web site, the web site including a code for requesting a search form for the web site from the means for performing, the means for providing distinct from the means for performing; and means for accessing the code from the web site, accessing the search form on the means for performing using a link provided by the code and providing the at least one search term in the search form.
29. The apparatus of claim 28, further comprising means for displaying search statistics on the web site.
30. The apparatus of claim 28, further comprising means for customizing a layout of the search form.
31. A computer data signal embodied in a carrier wave comprising: a computer program for providing remotely hosted local search for a web site for a computer over a network, the computer program comprising a first set of instruction for providing a search form for the web site to the computer over the network; a second set of instructions for receiving one or more search terms from the computer over the network; a third set of instructions for performing a search of the web site using the one or more search terms; a fourth set of instructions for providing a result of the search to the computer over the network.
32. The computer program of claim 31 , wherein the third set of instructions comprises using an index of the web site to search portions of the web site including the one or more search terms.
33. The computer program of claim 31 , further comprising a fifth set of instructions for generating an index of the web site.
34. The computer program of claim 31 , wherein the result of the search includes a plurality of links to portions of the web site including the one or more search terms.
35. A method of providing a remotely hosted search of a web site for a subscriber using a computer, the computer operated independently of the subscriber, the web site hosted on a second computer operated independently of the subscriber, the method comprising: receiving a subscription request from the subscriber on the computer, the subscription request comprising a web site identifier corresponding to the web site and a method of contact for the subscriber; responding to the subscription request by sending a message to the subscriber using the method of contact, the message including a code for requesting a search form for the web site from the computer; preparing an index of the web site using the computer; receiving a report request from the subscriber on the computer; and responding to the report request with a report on a plurality of searches previously performed of the web site.
36. The method of claim 35, wherein the report includes a raw search data, the raw search data including a search terms and a result information for each of the plurality of searches.
37. The method of claim 35, wherein the report is in a format selected from the group extensible markup language (XML) format, comma separated value (CSV) format, and hypertext markup language (HTML) format.
38. The method of claim 35, wherein the report is in a graphics format.
39. The method of claim 35, wherein the report includes data indicating which portions of the web site were selected in response to searches.
40. The method of claim 35, wherein the report is includable as part of the web site.
41. The method of claim 35, wherein the report includes at least one link, the at least one link coπesponding to a report request for further detail about an item in the document.
42. The method of claim 35, wherein the web site is comprised of pages, and the report includes a representation of pages in the web site that are searchable.
43. The method of claim 42, wherein the representation comprises a list of pages in the web site that are searchable.
44. The method of claim 35, wherein the report includes a list of references found on the web site that refer to non-existent resources.
45. The method of claim 35 , further comprising prior to receiving the report request, automatically sending at least a portion of a report by an electronic mail to the subscriber, the electronic mail including a hyperlink to transmit the report request.
46. The method of claim 35, wherein a second web site is remotely searchable using the computer, and wherein the report request indicates that the report should include information on the web site and the second web site, and wherein the responding to the report request comprises responding with a report on a plurality of searches previously performed of the web site and the second web site.
47. A method of providing a remotely hosted search reporting for a search engine for a subscriber using a computer, the computer operated independently of the subscriber, the search engine hosted on a second computer operated independently of the computer, the method comprising: receiving a subscription request from the subscriber on the computer, the subscription request comprising a method of obtaining search data and a method of contact for the subscriber; responding to the subscription request by sending a message to the subscriber using the method of contact, the message including a code for requesting reports from the computer; receiving a report request from the subscriber on the computer; and responding to the report request with a report on a plurality of searches previously performed using the search engine.
48. The method of claim 47, wherein the second computer is a cluster of one or more computers providing a search engine functionality operated by a first company and the computer is operated by a company other than the first company.
49. The method of claim 47, wherein the report includes list of searches performed ordered by the number of times same search terms were used.
50. The method of claim 47, wherein the report includes list of addresses corresponding to a computer used for each search.
51. The method of claim 47, wherein the report includes a list of search terms used that produced no results.
52. The method of claim 47, wherein the report includes data indicating which results were selected in response to searches.
53. The method of claim 47, wherein the report includes at least one link, the at least one link coπesponding to a report request for further detail about an item in the document.
54. The method of claim 47, wherein the method of obtaining search data comprises a reference, the reference indicating where to obtain search data.
55. The method of claim 47, wherein the method of obtaining search data comprises a uniform resource identifier (URI) for a file containing search data.
56. The method of claim 47, wherein the method of obtaining search data comprises a request for the computer to directly receive search data, and wherein the responding further includes providing a reference to a resource on the computer where search data should be transmitted by the second computer.
57. An apparatus for providing a remotely hosted search reporting for a search engine for a subscriber using a computer, the computer operated independently of the subscriber, the search engine hosted on a second computer operated independently of the computer, the method comprising: means for receiving a subscription request from the subscriber on the computer, the subscription request comprising a method of obtaining search data and a method of contact for the subscriber; means for responding to the subscription request by sending a message to the subscriber using the method of contact, the message including a code for requesting reports from the computer; means for receiving a report request from the subscriber on the computer; and means for responding to the report request with a report on a plurality of searches previously performed using the search engine.
PCT/US2000/018826 1999-07-13 2000-07-10 Method and apparatus for providing localized searching WO2001004783A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU60848/00A AU6084800A (en) 1999-07-13 2000-07-10 Method and apparatus for providing localized searching

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US35224799A 1999-07-13 1999-07-13
US09/352,247 1999-07-13
US52482100A 2000-03-14 2000-03-14
US09/524,821 2000-03-14

Publications (2)

Publication Number Publication Date
WO2001004783A2 true WO2001004783A2 (en) 2001-01-18
WO2001004783A3 WO2001004783A3 (en) 2002-11-28

Family

ID=26997453

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/018826 WO2001004783A2 (en) 1999-07-13 2000-07-10 Method and apparatus for providing localized searching

Country Status (2)

Country Link
AU (1) AU6084800A (en)
WO (1) WO2001004783A2 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999015995A1 (en) * 1997-09-23 1999-04-01 Information Architects Corporation System for indexing and displaying requested data having heterogeneous content and representation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999015995A1 (en) * 1997-09-23 1999-04-01 Information Architects Corporation System for indexing and displaying requested data having heterogeneous content and representation

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BAGER J: "NAVIGATIONSHILFEN" CT MAGAZIN FUER COMPUTER TECHNIK, VERLAG HEINZ HEISE GMBH., HANNOVER, DE, no. 13, 21 June 1999 (1999-06-21), pages 116-118,120-121, XP000828973 ISSN: 0724-8679 *
LAWRENCE S ET AL: "Inquirus, the NECI meta search engine" COMPUTER NETWORKS AND ISDN SYSTEMS, NORTH HOLLAND PUBLISHING. AMSTERDAM, NL, vol. 30, no. 1-7, 1 April 1998 (1998-04-01), pages 95-105, XP004121436 ISSN: 0169-7552 *
WEN-SYAN LI ET AL: "WebDB: a Web query system and its modeling, language, and implementation" RESEARCH AND TECHNOLOGY ADVANCES IN DIGITAL LIBRARIES, 1998. ADL 98. PROCEEDINGS. IEEE INTERNATIONAL FORUM ON SANTA BARBARA, CA, USA 22-24 APRIL 1998, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 22 April 1998 (1998-04-22), pages 216-227, XP010276893 ISBN: 0-8186-8464-X *

Also Published As

Publication number Publication date
AU6084800A (en) 2001-01-30
WO2001004783A3 (en) 2002-11-28

Similar Documents

Publication Publication Date Title
AU2003204104B2 (en) Use of Extensible Markup Language in a System and Method for Influencing a Position on a Search Result List Generated by a Computer Network Search Engine
US6848077B1 (en) Dynamically creating hyperlinks to other web documents in received world wide web documents based on text terms in the received document defined as of interest to user
CN101601033B (en) Generating specialized search results in response to patterned queries
US8082242B1 (en) Custom search
US20110238662A1 (en) Method and system for searching a wide area network
US6141010A (en) Computer interface method and apparatus with targeted advertising
US6718365B1 (en) Method, system, and program for ordering search results using an importance weighting
US7266821B2 (en) Method and apparatus for processing jobs on an enterprise-wide computer system
US20060224593A1 (en) Search engine desktop application tool
US20140289045A1 (en) System and method for a modular user controlled search engine
US20030078918A1 (en) Method, apparatus and system for file sharing between computers
US6608634B1 (en) System and method for demonstration of dynamic web sites with integrated database without connecting to a network
US20040078451A1 (en) Separating and saving hyperlinks of special interest from a sequence of web documents being browsed at a receiving display station on the web
US6625644B1 (en) Process and system for searching webpages within a website
US20110093456A1 (en) Method and system for displaying information
KR20080077458A (en) Method and system for registering and retrieving product informtion
US20020107884A1 (en) Prioritizing and visually distinguishing sets of hyperlinks in hypertext world wide web documents in accordance with weights based upon attributes of web documents linked to such hyperlinks
JP4963619B2 (en) Information search system, information search device, search result screen information generation method, and search result screen information generation processing program
US20060294083A1 (en) Search engine SMS notification system and method
Kapyla et al. Towards an accessible web by applying push technology
JP2000285052A (en) Url conversion method and device
WO2001004782A2 (en) Method and apparatus for providing remote hosted search agents
WO2001004783A2 (en) Method and apparatus for providing localized searching
US20020107700A1 (en) System and process for capturing, storing, maintaining and reporting information regarding databases via the internet
EP1708102A2 (en) Search engine desktop application tool

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

AK Designated states

Kind code of ref document: A3

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: COMMUNICATION PURSUANT TO RULE 69(1) EPC .EPO FORM 1205A

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP