WO2003073324A1 - Systemes et procedes d'indexation de donnees dans un environnement de reseau - Google Patents

Systemes et procedes d'indexation de donnees dans un environnement de reseau Download PDF

Info

Publication number
WO2003073324A1
WO2003073324A1 PCT/US2002/006178 US0206178W WO03073324A1 WO 2003073324 A1 WO2003073324 A1 WO 2003073324A1 US 0206178 W US0206178 W US 0206178W WO 03073324 A1 WO03073324 A1 WO 03073324A1
Authority
WO
WIPO (PCT)
Prior art keywords
indexing
data
index server
network
resources
Prior art date
Application number
PCT/US2002/006178
Other languages
English (en)
Inventor
Brian Mervyn Morrow
Michael Martin Gorlick
Arthur Hughes Muir, Iii
Original Assignee
Endeavors Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Endeavors Technology, Inc. filed Critical Endeavors Technology, Inc.
Priority to PCT/US2002/006178 priority Critical patent/WO2003073324A1/fr
Priority to AU2002252155A priority patent/AU2002252155A1/en
Publication of WO2003073324A1 publication Critical patent/WO2003073324A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation

Definitions

  • the present invention relates generally to systems and methods for indexing resources of electronic devices, and more particularly to systems and methods for indexing, searching, and/or sharing data, files, and/or other resources stored on a plurality of computing devices connected to a network.
  • Search engines are well known tools for finding information accessible via a wide area network, such as the Internet or World Wide Web. These search engines facilitate indexing and searching for data scattered over a large number of servers or other devices connected to the network. Such large scale web search engines generally require a crawler, an indexer, and a query engine.
  • a crawler is an application that "crawls" across the "web,” following links and fetching web pages for the indexer.
  • the crawler similar to a browser, may contact web servers or other devices connected to the network to access the web pages available on the servers.
  • the crawler may extract information from the web pages, e.g., words or phrases extracted from content of the web pages, metatags embedded within the web pages, such as HTML markups, inferences made from the link structure of the web pages (outgoing and/or incoming), and the like.
  • the indexer is a compute-intensive and storage-intensive system that receives the web page information from the crawler.
  • the indexer generally constructs a comprehensive inverted index of every web page uncovered by the crawler.
  • the query engine is an application employed by end users to search the index constructed by the indexer, e.g., to return links to candidate web pages in response to query keywords and/or other criteria (such as language, domain of origin, age, and the like) provided by the end users.
  • query keywords and/or other criteria such as language, domain of origin, age, and the like
  • a plurality of computing devices may be connected to one or more networks and/or to one another, for example, by a local area network. The number and form of computing devices connected to such networks may vary dramatically between enterprises.
  • the devices connected to a particular network may vary widely in capacity, speed, platform, and method of network connection.
  • the devices may include corporate servers, desktop computers, laptops, personal digital assistants, embedded sensor and control networks, and the like.
  • the domination of networks and the proliferation of such devices tends to push data storage out to the "edges" of an enterprise's network, making sharing of resources difficult.
  • the data and documents residing on these devices may be inaccessible to conventional crawlers and indexing engines. For example, there may be no identifiable links that refer to the devices and/or their contents. In addition, some devices may only be intermittently connected to the network. Further, the protocol used by a crawler (such a HTTP) may not be supported by local network devices and/or the contents of the devices may be in a format unknown to a crawler and/or indexer.
  • HTTP HyperText Transfer Protocol
  • shared network volumes may capture only a fraction of the data on a device, and the software required to support access may be unsuitable for small, mobile devices.
  • Data repositories frequently require explicit submission by users of the data stored on their devices, and therefore, the repository contents may not be current or comprehensive with respect to data available on many of the devices.
  • Repository indexing may also rely solely on keywords submitted by the users of respective devices whose data is indexed in the repository, and those keywords may not effectively reflect the data actually stored on the respective devices.
  • Knowledge management systems often rely on proprietary formats for content, restrict content to a small number of formats, and/or are specialized for a narrow domain. Thus, such systems may be ineffective for indexing and searching a broad array of information resources available on a network.
  • the present invention is directed generally to systems and methods for indexing resources available on electronic devices connected to a network. More particularly, the systems and methods of the present invention may facilitate an enterprise, such as a business, educational institution, or other organization, indexing, searching for, and/or sharing resources, such as documents, records, databases, media files, e-mail archives, and the like, that may be available on the enterprise's network.
  • the resources may be stored on any device that may be connected to the network, yet may be quickly found, preferably in a substantially secure environment.
  • a system for generating indexing data stored on a plurality of electronic devices connected to a network is provided.
  • the indexing data may include one or more pieces of information related to a respective device, such as information intake, content, and output; hardware configuration, settings, and status; software configuration, settings, and status; system and control logs; manner, rate, pattern, and frequency of use, and the like.
  • the devices for which indexing data may be generated may include desktop computers, laptops, mobile phones, telephones, printers, fax machines, personal digital assistants, portable digital devices, digital media players and recorders, appliances, heating, ventilation, communication, and electrical systems, sensors and actuators, automotive electronic and mechanical systems, technical, scientific, and medical instruments, machine tools, material handling, manufacturing, assembly, and delivery systems, and the like.
  • each electronic device on the network includes an
  • indexing agent e.g., one or more embedded digital processors, hardware components, and/or software modules, for indexing resources on the respective device.
  • the indexing agent is a web server resident on the respective device, that may include a translator, an authentication module, a presence module, and/or a thin server.
  • the indexing agent generates indexing data that includes content data describing individual resources stored on the respective device, and location identifiers, such as device-specific URLs and URL links, identifying the location of the individual resources associated with the respective content data, or other Uniform Resource Identifiers ("URIs").
  • URIs Uniform Resource Identifiers
  • the indexing agent extracts content-related information regarding the resources stored on the respective device, and stores the generated indexing data as web pages, for example, in HTML or XML format, or alternatively as text.
  • the indexing data may be stored in memory of the respective device for subsequent use or transfer, as described further below.
  • the indexing agent may include a translator, including one or more modules for translating device-specific information into indexing data that may be interpreted by a crawler or indexer. If the information from the device is already in a format that may interpreted by the crawler or indexer, e.g., HTML or XML, no translation may be necessary. For resources that are not already crawler or indexer compatible, however, such as word processor documents, media files, and the like, the indexing agent may extract content information regarding the resources, and the translator may translate the information, for example, into HTML or XML, and then the indexing agent may store the translated information as web pages.
  • a translator including one or more modules for translating device-specific information into indexing data that may be interpreted by a crawler or indexer. If the information from the device is already in a format that may interpreted by the crawler or indexer, e.g., HTML or XML, no translation may be necessary. For resources that are not already crawler or indexer compatible, however, such as word processor documents, media files,
  • Each indexing agent is preferably configured as one or more modules that operate silently in the background substantially undetected by the user of the respective device as processor cycles and/or other related bandwidth become available.
  • the indexing agent may automatically and periodically index desired portions of the device's resources such that the user of the device need not schedule or otherwise activate the indexing agent to create or update the indexing data.
  • the periodic indexing by the indexing agent may generate a complete index of the resources on the respective device, or it may generate an updated index, i.e., only reflecting resources that have changed since a previous indexing.
  • Each device also includes a communication interface for making the indexing data, for example, in the form of HTML web pages, available to a crawler and/or indexer.
  • Such communication interfaces may include a modem, a network interface, such as an Ethernet card, a communications port, a PCMCIA slot and card, an infrared interface, and the like.
  • a system in accordance with another aspect of the present invention, includes a network, a plurality of electronic devices that are at least intermittently comiected to the network, and one or more index servers.
  • Each of the electronic devices preferably includes an indexing agent, such as that described above.
  • the one or more index servers include a search engine that is connected to the network.
  • the index server may be a centralized computer system that includes a crawler, an indexer, and/or a query engine.
  • the crawler is an application that may periodically contact each of the devices connected to the network, and transfer the respective indexing data generated by the indexing agent on the respective device to the indexer.
  • each indexing agent may simply search the respective indexing data whenever a query is received from an authorized search engine, making a crawler and/or indexer unnecessary.
  • each indexing agent may "push" its indexing data directly to the indexer.
  • the indexing agent may be pre-programmed or instructed to update the indexing data at a desired frequency and transfer the updated indexing data to the indexer, or the indexer may periodically poll the indexing agent of each device.
  • mobile devices i.e., devices whose communication interface may be disconnected from the network for extended periods or whose connection to the network is intermittent, may include an indexing agent that includes a presence module.
  • the presence module may automatically register the presence of the device when it is initially docked in or otherwise connected to the network. Once connected, the indexing agent may automatically push its current set of indexing data to the indexer. Alternatively, the indexing agent may generate a new set of indexing data for the device or generate an updated set of indexing data reflecting any changes since the device's last connection, and transfer the new set to the indexer.
  • the generation of indexing data may be generated in a predetermined order of decreasing priority or criticality.
  • the presence module may also provide a presence service, transmitting a notification to all other users of the network of the appearance or connection of the respective device to the network. Alternatively, only a subset of users may "subscribe" to presence notification for a given set of devices for which the subscribers have sufficient access authority. In this manner, the index server or subscriber may be notified when a specific device of interest connects to the network or when any device connects.
  • the device may also include an authentication module that may provide security for the device.
  • the authentication module may authenticate the connection of the device to a given network, for example, by requiring the crawler or indexer to authoritatively identify itself to the device before the indexing agent provides access to the indexing data. This may ensure that the crawler or indexer on the connected network is authorized to access the indexing data on the device.
  • the authentication module may substantially reduce the risk that a mobile device is connected to a foreign network, i.e., connected to a network other than the proper enterprise's network, and provides information to a non-authorized system.
  • the indexing agent may filter access to the indexing data based upon authentication circles, providing increasing levels of access to the indexing data and/or the indexed resources themselves.
  • the indexing agent may protect the device, allowing it to be crawled only by authorized crawlers.
  • the index server may be notified and the crawler from the index server may immediately begin crawling the device, collecting pages from its resident indexing agent. Crawling may continue for as long as the device is connected to the network or until completion of transfer of the indexing data.
  • the network may provide dynamic DNS service that allows the crawler to obtain an IP address of the device even if it is dynamically assigned and changes from one network connection to another.
  • This LP address may be reflected by appropriately modifying or supplementing the location identifiers included in the indexing data to ensure that the index server may be able to subsequently identify the correct device having a particular resource therein.
  • a more selective form of crawling may be utilized. For example, the crawler may wait for direct contact from the device itself, e.g., in which the device informs the crawler of exactly where in the page space of the device to begin crawling. In this manner, the indexing agent on a respective device may instruct the crawler to collect just those pages that have changed since the device was last visited by the crawler. In a further alternative, the indexing agent may inform the crawler of a set of pages to visit according to a predetermined or desired priority.
  • the index server may archive the web pages collected by the crawler.
  • a search engine may be used to view the web pages of a device even if the device is disconnected from the network, since the index server itself has a copy (albeit one that may be out-of-date).
  • the system may facilitate indexing of devices whose communication interface is of low bandwidth or unreliable.
  • Low bandwidth connections may present a particular challenge for crawling the contents of "bandwidth-challenged" devices.
  • the indexing agent may adopt tactics that ameliorate the deficiencies of the connection.
  • the indexing agent and the crawler may have a transport encoding in common such that the indexing agent may compress offline the web pages that it wants crawled and indexed.
  • the indexing agent may direct the crawler to crawl just those pages for which it has generated compressed content. In this manner, the device may make optimal use of its limited bandwidth.
  • the indexing agent may break the indexing data into small, individual "mini-pages," no one of which requires a substantial amount of transmission time.
  • the indexing agent may control the transfer of indexing data to ensure that personal, sensitive, or proprietary information is substantially securely transferred from the respective device to the indexer.
  • the indexing agent and the crawler may establish a secret session key known to them alone that permits the substantially secure transmission of sensitive information from the device to the crawler.
  • FIG. 1 is a schematic drawing, showing a network architecture, according to the present invention.
  • FIG. 2 is a schematic diagram of a computing device including an indexing agent, in accordance with the present invention.
  • FIG. 3 is a flowchart showing a method for indexing resources on an electronic device, in accordance with the present invention.
  • FIG. 4 is a flowchart showing a method for searching indexing data generated by indexing agents in response to a query from a search engine, in accordance with the present invention.
  • FIG. 5 is a flowchart showing a method for implementing a searchable database of indexing data related to the content of resources on a plurality of devices, in accordance with the present invention.
  • FIG. 6 is a block diagram showing an exemplary computer system in which certain elements and functionality of the present invention may be implemented.
  • FIG. 1 is a top-level block diagram illustrating an example of a network arcl itecture, according to an embodiment of the present invention. 02 06178
  • Electronic devices 10, 20, 30, n are each at least intermittently connected to a network 40, and an index server 50 is also connected to the network 40.
  • the network 40 may be a local area network ("LAN"), an Intranet, and/or a wireless communications network.
  • the network 40 may include a plurality of several different types of networks (not shown), including, but not limited to, a LAN, an Intranet, or a wireless network.
  • the network 40 incorporates all of the electronic devices within an enterprise that are capable of sharing information and/or being connected to the network 40.
  • the index server 50 includes a search engine 52 and a database 58 of indexing data related to each of the devices 10, 20, 30, n.
  • the search engine 52 includes an indexer 54 and a query engine 56, and optionally may also include a crawler (not shown), as described further below.
  • the indexer 54 receives indexing data from the indexing agents on the devices 10, 20, 30, n and creates a searchable index stored in the database 58.
  • the query engine 56 may be used to search the database 58 to identify, locate, and/or access resources related to a given query, as described further below.
  • the devices 10, 20, 30, n connected to the network 40 may include computing devices, such as desktop computers or other fixed workstations, and/or mobile or portable devices, such as laptops, personal digital assistants ("PDA's"), wireless access protocol ("WAP") telephones, portable digital devices, and the like.
  • PDA's personal digital assistants
  • WAP wireless access protocol
  • Each of the devices is generally capable of supporting and includes an indexing agent 60 resident on the device, as described further below with reference to FIG. 2.
  • other electronic devices may be included in the network, such as telephones, printers, fax machines, digital media players and recorders, appliances, heating, ventilation, communication, and electrical systems, sensors and actuators, automotive electronic and mechanical systems, technical, scientific, and medical instruments, machine tools, material handling, manufacturing, assembly, and delivery systems, and the like.
  • These devices may also include a resident indexing agent, or, alternatively, they may instead include a server, but may be directly coupled to another device including an indexing agent that may use the server to index resources on the device.
  • FIG. 2 a schematic of an exemplary computing device 10 is shown that includes all indexing agent 60 configured for generating indexing data 70 regarding resources 68 of the device 10.
  • the device 10 may include a number of modules, such as a server 62, a translator 64, an authentication module 66, and/or a presence module 67, that may be controlled and/or accessed by the indexing agent 60.
  • the device 10 may include conventional memory (not shown) for storing the resources 68 and/or the indexing data 70, and one or more processors for performing various functions, as will be appreciated by those skilled in the art.
  • An exemplary hardware architecture for the device is shown in FIG. 6, and described further below.
  • the device 10 includes a conrmumcation interface 72 for connecting the device 10 to the network and/or otherwise communicating with other devices (not shown in FIG. 2).
  • the communication interface 72 may include a modem, a network interface, such as an Ethernet card, a communications port, a PCMCIA slot and card, an infrared interface, and the like.
  • the indexing agent 60 is a specialized, resource- conservative, embedded server, e.g., an HTTP web server.
  • the indexing agent 60 may be relatively small compared to conventional crawlers or servers, such that it may be installed on virtually any electronic or computing device, including personal devices such as personal digital assistants (e.g., Palm Pilots), WAP phones, or embedded micro- controllers, yet it may provide all of the services needed to index local resources on the device.
  • personal devices such as personal digital assistants (e.g., Palm Pilots), WAP phones, or embedded micro- controllers, yet it may provide all of the services needed to index local resources on the device.
  • the term “indexing agent” is used generally herein to refer to such an embedded web server, or any combination of hardware-based components and/or software-based modules that may perform the indexing features described herein.
  • the term “thin server” may also be used to refer to the indexing agent 60, because of its relatively small size compared to conventional servers.
  • the indexing agent 60 may direct the server 62 to access the device's resources 68, e.g., to serve up the device's file system, configuration data, and/or other resources, to facilitate the indexing agent 60 systematically indexing all of the resources to be indexed therein.
  • the indexing agent may be capable of accessing the device's resources 68 directly, without the intervention of the server 62, and the server 62 may be eliminated.
  • the indexing data 70 generated by the indexing agent 60 preferably includes content data associated with respective resources 68 available on the device 10.
  • the content data generally includes information describing individual resources of the device, preferably based upon the content of the individual resources, and/or metadata associated with the individual resources.
  • the indexing agent 60 may store the compatible information, e.g., text, metatags, and the like, as content data.
  • the indexing agent 60 may use the translator 64 to translate information extracted from a particular resource into a format that is capable of being interpreted by a crawler or indexer.
  • the translator 64 may include one or more device-specific modules for translating particular types of resources, such as application files, word processor files, spreadsheets, media files (such as image, audio and video files), databases, Portable Document Format ("PDF") files, and the like.
  • the translator 64 may also translate device configuration data, such as hardware or software settings, into formats that may be interpreted by a crawler or indexer.
  • the translator 64 may translate device-dependent content into Web-standard formats, such as HTML or XML. Consequently, the indexing data may include information and data for which no extractors previously existed, allowing the indexing data to be crawled and extracted by a crawler of any common web search engine. In effect, the indexing agent 60 and translator 64 may act as an extractor for the benefit of a crawler.
  • the indexing agent 60 also generally assigns location identifiers to each piece of content data to identify the location of the individual resources associated with the respective content data.
  • the location identifiers may identify the location of the respective resources within the device's file space, and/or may identify the specific device itself.
  • the location identifiers are device-specific Uniform Resource Locators ("URLs"), identifying a location where the resource may be found, or other Uniform Resource Identifiers ("URI's”), identifying a process for identifying the location, e.g., to identify a portable device that may be located at one more locations in the network.
  • URLs device-specific Uniform Resource Locators
  • URI's Uniform Resource Identifiers
  • the network may assign virtual location identifiers specifically for the benefit of an index server (not shown in FIG. 2).
  • the network e.g., the index server
  • This LP address may be reflected in the location identifiers included in the indexing data 70 to ensure that an index server may be able to subsequently identify the correct device having a particular resource therein.
  • the presence module 67 may provide notice for such mobile devices.
  • the presence module 67 may announce to a network when the device 10 is connected to the network.
  • the authentication module 66 may include a security protocol to ensure that the device 10 is connected to a network that is authorized to access the indexing data 70, as described further below.
  • the indexing agent 60 is preferably configured to operate substantially silently in the background undetected by users of the device 10, e.g., as processor cycles and disk bandwidth become available.
  • the indexing agent 60 preferably automatically and periodically indexes the device's resources 68.
  • the indexing agent 60 may generate a complete index of the resources 68 on the respective device, or it may generate an updated index, i.e., only reflecting resources 68 that have changed since a previous indexing.
  • the indexing agent may access resources on the device in order to extract content-related information from the resources. This may involve the indexing agent directing a server resident on the device to serve up the device's resources, e.g., its file system, memory or other storage devices, peripheral devices, and the like, or may involve the indexing agent accessing the resources directly.
  • the indexing agent may also access other resources, such as software and hardware configuration settings of the device.
  • the indexing agent determines whether the respective resources are already in a web-standard format, such as HTML. If not, at step 114, the indexing agent extracts content-related information from the resources, and then, at step 116, translates the information into a web-standard format, as content data. If the respective resources are already in a web-standard format, content-related information, or content data, is extracted from the resources at step 118 as content data.
  • a web-standard format such as HTML.
  • the indexing agent assigns location identifiers to the content data, associating the content data with respective resources, e.g., using URLs that identify the location of the respective resources within the device's file space or other URIs.
  • the indexing agent may assign a dynamic URI to the content data that identifies the device independent of its specific connection to the network, e.g., provided by the network, as explained above.
  • the indexing agent stores the content data and associated location identifiers as indexing data in memory of the device.
  • the indexing agent stores the content-related information as web pages, for example, using HTML or XML markups, or as text.
  • the indexing data is stored in a format that may be easily crawled by a conventional web crawler or interpreted by a conventional indexer, as described further below.
  • the indexing agent 60 may then make the indexing data 70 available to external devices, e.g., an index server, crawler, search engine, and the like (not shown).
  • external devices e.g., an index server, crawler, search engine, and the like (not shown).
  • FIG. 4 a method is shown wherein the indexing agent may retain the indexing data on the device and respond to queries from a search engine connected to a network.
  • the indexing agent having generated indexing data for the device, may receive a query from a search engine, such as a request from a requestor whether any files on the device include particular keywords.
  • the queiy may be sent by the search engine to all of the devices connected to the ' network, or only to a specific subset of devices, such as those used by participants in a particular project group.
  • the indexing agent may then search the indexing data for content data related to the query at steps 132, 134. If no match is found, the indexing agent responds to the search engine at step 136 with a negative response, or alternatively no response at all. If one or more matches are found, the indexing agent may provide the search engine with information regarding the resource(s) whose content data matched the query criteria. The extent of information provided may depend upon the authority of the search engine and/or the requestor to access the indexing data and/or resources on the device. For example, the response may merely indicate that matches were found, e.g., identifying the device, without providing any further details.
  • the response may include the URL or URI for the resource(s) that resulted in matches, possibly also including the content data that resulted in the match.
  • the response may include transferring the resource itself, e.g., to provide a copy of a file on the device that matches the query to the requestor.
  • This method of serving up indexing data "on the fly” may be suitable for smaller enterprises that include only a limited number of devices.
  • the indexing agent pushes its indexing data to a repository or centralized index.
  • each of the devices 10, 20, 30, n preferably includes an indexing agent (not shown), which may push the indexing data from the respective device to the index server 52.
  • This model brings scaleable, comprehensive, and speedy enterprise-wide indexing and/or searching to any electronic or computing device within an enterprise.
  • an index server 50 may receive indexing data from a plurality of devices 10, 20, 30, n connected to the network 40.
  • the indexing agents (not shown) on the devices 10, 20, 30, n have previously generated the indexing data, including content data describing content of resources on the respective devices, as described above.
  • the indexing agents may transfer the respective indexing data to the index server using one of several models described further below.
  • the indexer 54 may compile the indexing data into a database 58 at step 152, using any known method for creating an inverted index or other searchable database.
  • the indexing data may be received from all of the devices at one time and then compiled, or indexing data from devices may be compiled intermittently, for example, as indexing data becomes available from mobile devices.
  • the index server may store web pages including the indexing data or otherwise retain a copy of the indexing data as stored by the indexing agents on the respective devices. This may be useful for archiving mobile devices, which may not be connected to the network when a query is submitted.
  • the database 58 may then be used to search for resources in response to queries by requestors having access to the database 58, such as co-workers, human resources personnel, security personnel, and the like.
  • the query engine 56 may receive a query, e.g., including keywords or other search criteria, submitted by a requestor.
  • the query engine 56 may access the database 58 at step 156 to search for indexing data related to the query, e.g., to identify any content data that matches the keywords or other criteria submitted by the requestor.
  • the query engine 56 may search the entire database 58 or a subset of the database 58, as will be appreciated by those skilled in the art.
  • the query engine may send a response to the requestor, indicating whether or not any matches were found. If any matches are found, the query engine may also provide additional information to the requestor, depending upon their access authority.
  • the response may include a device URL or URI or otherwise identify the device(s) that includes resources corresponding to content data satisfying the query, possibly identifying the user of the device. This level of response may be sufficient to identify the devices or users that satisfy the query without divulging the actual content of the resources, which may be sensitive, personal, or otherwise inappropriate for the requestor to access or review.
  • the response may include the location identifiers of any resources satisfying the query, either with or without explicitly identifying the device itself, thereby providing access to the resource. This level of response may be appropriate for shared files, such as those that should be available to members of a common project.
  • placing an indexing agent on each of the devices connected to a network may permit a search engine to discover, index, and query resources resident on the devices that were previously unknown, uninventoried, and/or largely inaccessible.
  • a search engine may, with no additional effort, discover and access new sources of content enterprise- wide, as explained further below.
  • the indexing agents and index server may transfer the indexing data between them in a variety of different ways.
  • the indexing agents may generate the indexing data only when instructed to do so by the index server or by the device's user.
  • the users of the devices are not involved in the indexing activity of the indexing agents, i.e., the indexing agent acts autonomously in the background such the users are unaware of and/or not substantially affected by the indexing agents' activities.
  • the devices automatically generate and transfer ("serve up") the indexing data with a predetermined granularity to keep the database substantially current.
  • the indexing agents periodically generate indexing data, e.g., with substantially fixed or predetermined time periods between the generation of each set of indexing data.
  • the indexing data may be a complete set of indexing data reflecting all of the indexed resources on the device.
  • the indexing data may be an updated set including indexing data only for resources whose status has changed since a previous set, e.g. new, edited, or deleted files.
  • the indexing data may be generated "offline," i.e., when the devices are not connected to the network, e.g., at periodic intervals.
  • their indexing agents may automatically initiate the transfer of their respective indexing data to the index server. If the device is disconnected from the network, the indexing agent may discontinue transfer, and store the location within the indexing data where the transfer was discontinued. When the device is again connected to the network, the indexing agent may resume at the location where it left off.
  • mobile devices need not be connected to the network to allow transfer of indexing data in a single session, but transfer may be accomplished incrementally over several successive connections.
  • the index server includes a crawler, such as a conventional web crawler.
  • the crawler is preferably an autonomous robot that systematically and/or periodically contacts each of the devices including resources to be indexed.
  • the crawler may initiate contact successively with the indexing agents on the devices, and exchange "handshakes" to confirm coniiectability, identify itself, and/or to complete a security protocol confirming that the crawler has sufficient authority to access to the indexing data on the respective devices.
  • the indexing agents may serve up their indexing data to the crawler.
  • the indexing agent may offload the task of extracting URLs and content from the crawler by assigning virtual URIs that exist specifically for the benefit of the crawler.
  • the indexing agent may generate a dynamic page that summarizes the content of the original page. This may substantially reduce network transfers and load, and may improve the incisiveness of the indexing.
  • the technique of assigning URIs for the benefit of a crawler may be further extended to create and push device-specific content to a repository or indexer.
  • the indexing agent may have the ability to authenticate and control access.
  • "u" may represent an URL served by an indexing agent "M,” where u denotes content in a format for which no crawler extractor exists, for example, a device-specific content configuration.
  • M may create a virtual URL "v" that, when accessed, may trigger the translation of the content of u into standard HTML. In this manner, M may perform extraction on behalf of the crawler, giving it access to formats for which no crawler extractors have previously existed.
  • the indexing agent may be configured as a crawler-aware and/or indexer-aware server that offers device-dependent and/or content-dependent indexes directly to the crawler.
  • the authentication protocols and access controls of the indexing agent may allow the indexing agent to generate crawler-specific content that is optimized for the indexer of the search engine.
  • the indexing agent executing on a personal digital assistant may generate a summary of the Pilot's memo pad that may be suitable for cross-indexing with a departmental project web site.
  • crawler and indexer may be generalized considerably if the device's indexing agent knows of, and cooperates with, the crawler.
  • the architecture outlined here permits the deployment of enterprise-specific and/or domain-specific crawlers and indexers.
  • Crawlers may be deployed within an enterprise to search for a specific form of content, for example, all content relating to a specific project.
  • the crawler may move from device to device throughout the enterprise's network, and, with the cooperation of the indexing agents onboard each of the network's devices, may be served with just the content sought by the crawler, thereby generating relevant and incisive indices.
  • an indexing agent in accordance with the present invention may facilitate access by crawlers when a target device is not connected to the network as the crawler is making its rounds.
  • the indexing agent may announce its presence to the network when the device connects. This event notification may be propagated to all interested subscribers including, for example, a crawler, thereby allowing the crawler to immediately visit the device and push needed content back to the indexer.
  • the indexing agent may pre-index relevant content for the search engine and, when connected, push the indexing data back to the search engine for inclusion in, and integration with, the enterprise index database, as described above.
  • the indexing agent may "bleed" the indexing data incrementally to a search engine over the span of multiple connections to the network.
  • This strategy may be particularly appropriate for a device that is connected for only brief periods or supports only a low bandwidth connection.
  • the indexing agent on a device may act in a substantially autonomous fashion, it may index the resources on the device, and notify the index server when the device is connected, and/or has an update for the index server.
  • mobile and intermittently connected devices may be intelligently included in enterprise data searches.
  • the index server may not only be able to find resources on all devices in the network, but it may also instantly know the connection status of the device(s) that contained the resource(s) pointed to by the indexing data in the database. This opens the possibility of contacting the user of a discomiected mobile device in real-time to request that the device with critical data be connected to the network as soon as possible.
  • the indexing agent may also be able to filter responses for sensitive data.
  • sensitive data For example, financial, human resources, medical, personal, and/or other sensitive data may be contained appropriately, e.g., using the authentication module of the indexing agent.
  • the index server 50 may include a single server, or it may include a plurality of servers, each sharing a database or generating independent databases, e.g., including different types of compiled indexing data.
  • a single search engine, or a plurality of search engines may be provided.
  • a search engine (C, I, Q) includes three separate, but related, components: a crawler C, an indexer I, and a query engine Q. Each component may be characterized by action (what is done), locale (where it is done), and time (when it is done). In this manner, a taxonomy of search engines may be constructed that characterizes the range of variation available to search engines component with respect to action, locale, and time. Additional information on search engines may be found in S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proceedings of the Seventh International World Wide Web Conference, 1998, pp. 107-118, the disclosure of which is expressly incorporated herein by reference.
  • Crawlers are robot applications that substantially autonomously fetch data, preferably in the form of "web pages," for submission to an indexer of a search engine. Additional information on crawlers may be found in A. Rappaport, Robots & Spiders & Crawlers: How Web and intranet search engines follow links to build indexes, Search Tools Consulting, available at www.searchtools.com, the disclosure of which is expressly incorporated herein by reference.
  • the crawler C Given a location identifier identifying a particular web page, e.g., an URL "u,” the crawler C first may decide whether or not to visit the web page designated by u. If affirmative, the crawler may reach u if C can connect to the host of u, and C has the authority to access the page designated by u. Connectivity and access authority are two separate, but related, considerations. The first is the ability to establish a connection, e.g. a TCP connection, between the crawler and the web server and may vary with the position of the crawler within the network (for example, relative to a firewall) or the network quality of service (such as congestion or routing anomalies).
  • a connection e.g. a TCP connection
  • the crawler may be required to obtain permission to read the page, for example, if it is password protected. Access may also be restricted to a finite set of users or hosts whose identity may be determined, for example, by inspecting a source IP address of the packet stream associated with the host or using cryptographic methods.
  • the crawler may extract whatever links it can for the next round of crawling. Extraction depends on the form and semantics of the content and the extractors available to the crawler. For example, all crawlers may extract links from HTML pages however, few crawlers have the extractors required to lift links embedded within PDF or Microsoft Word documents.
  • a crawler is characterized by: a) the IP address of the crawler host which limits, with respect to network topology, routing, and firewalls, the remote hosts to which the crawler may connect; b) access authority; c) a loading policy that gleans URLs of value from the set at hand; d) an extraction policy that determines if the contents of a web page will yield URLs; and e) a set of extractors for extracting URLs from various forms of content.
  • An “extraction policy” E(pu) is a decision procedure that returns true if links (URLs) can be extracted from pu and false otherwise. E may inspect the URL u, the
  • MIME type of u (contained within the HTTP response) and the page contents pu since all offer valuable hints as to the format and structure of pu. For example, if the MLME type is "html," then pu is a page whose structure is well defined (by the HTML specification) and amenable to the extraction of links. If the MIME type is unspecified (the HTTP response omitted the Content-Type header field), then the crawler may examine the syntax of the URL or the content itself to infer the media type. For example, the URL suffix .wav or .au may indicate (by common convention) an audio file that may contain links (rendered as speech) but whose extraction by machine agents is problematic at best. Some audio formats, however, may provide for the inclusion of digital metadata. The crawler, if equipped with a suitable extractor, may be able to extract that metadata for the benefit of the indexer.
  • a “loading policy” is a decision procedure L(u) that returns true if URL u is deemed suitable for loading and false otherwise.
  • a "page loading policy” determines whether a crawler ignores robot excluded pages and generated pages (such as those produced by CGI scripts) and honors page loading, resource, and time limits with respect to a site or domain. Other considerations may also play a role in the formulation of L.
  • an "access function" A(" ⁇ ", u, P) returns pu if and only if it is possible to access u from " ⁇ " and P grants sufficient authority.
  • a “crawler” C is a tuple (" ⁇ ", A, P, E, G, L), where " " is the location (IP address) of C, A is an access function, P is a set of access permissions, E is an extraction policy, G is a nonempty set of extractors, and L is a loading policy.
  • E is an extraction policy
  • G is a nonempty set of extractors
  • L is a loading policy.
  • FIG. 6 a block diagram illustrates an exemplary computer system 350 in which elements and functionality of the present invention may be implemented according to one embodiment of the present invention.
  • the present invention may be implemented using hardware, software, or a combination thereof and may be implemented in a computer system or other processing system.
  • Various software embodiments are described in terms of exemplary computer system 350. After reading this description, it will become apparent to a person having ordinary skill in the relevant art how to implement the invention using other computer systems, processing systems, or computer architectures.
  • the device 350 includes one or more processors, such as processor 352. Additional processors may be provided, such as an auxiliary processor to manage input/output, an auxiliary processor to perform floating point mathematical operations, a special-purpose microprocessor having an architecture suitable for fast execution of signal processing algorithms ("digital signal processor"), a slave processor subordinate to the main processing system (“back-end processor”), an additional microprocessor or controller for dual or multiple processor systems, or a coprocessor. It is recognized that such auxiliary processors may be discrete processors or may be integrated with the processor 352.
  • the processor 352 is connected to a communication bus 354.
  • the communication bus 354 may include a data channel for facilitating infonnation transfer between storage and other peripheral components of the computer system 350.
  • the communication bus 354 further provides the set of signals required for communication with the processor 352, including a data bus, address bus, and control bus (not shown).
  • the communication bus 354 may include any known bus architecture according to promulgated standards, for example, industry standard architecture (ISA), extended industry standard architecture (EISA), Micro Channel Architecture (MCA), peripheral component interconnect (PCI) local bus, standards promulgated by the Institute of Electrical and Electronics Engineers (IEEE) including IEEE 488 general-purpose interface bus (GPLB), IEEE 696/S-100, and the like.
  • ISA industry standard architecture
  • EISA extended industry standard architecture
  • MCA Micro Channel Architecture
  • PCI peripheral component interconnect
  • the Device 350 also includes a main memory 356 and may also include a secondary memory 358.
  • the main memory 356 provides storage of instructions and data for programs executing on the processor 352.
  • the main memory 356 is typically semiconductor-based memory such as dynamic random access memory (DRAM) and/or static random access memory (SRAM).
  • DRAM dynamic random access memory
  • SRAM static random access memory
  • Other semiconductor-based memory types include, for example, synchronous dynamic random access memory (SDRAM), Rambus dynamic random access memory (RDRAM), ferroelectric random access memory (FRAM), and the like, as well as read only memory (ROM).
  • SDRAM synchronous dynamic random access memory
  • RDRAM Rambus dynamic random access memory
  • FRAM ferroelectric random access memory
  • ROM read only memory
  • the secondary memory 358 may include a hard disk drive 360 and/or a removable storage drive 362, for example a floppy disk drive, a magnetic tape drive, an optical disk drive, and the like.
  • the removable storage drive 362 may read from and write to a removable storage unit 364 in a well-known manner.
  • removable storage unit 364 may include a floppy disk, magnetic tape, optical disk, and the like that may be read from and written to by removable storage drive 362.
  • the removable storage unit 364 may include a computer usable storage medium with computer software and computer data stored thereon.
  • secondary memory 358 may include other similar components for allowing computer programs or other instructions to be loaded into the computer system 350.
  • such components may include interface 370 and removable storage unit 372.
  • secondary memory 358 may include semiconductor-based memory such as programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable read-only memory (EEPROM), or flash memory (block oriented memory similar to EEPROM).
  • PROM programmable read-only memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable read-only memory
  • flash memory block oriented memory similar to EEPROM.
  • any other interfaces 370 and removable storage units 372 that allow software and data to be transferred from the removable storage unit 372 to the computer system 350 through interface 370.
  • the device 350 also includes a communication interface 374.
  • Communication interface 374 allows software and data to be transferred between device 350 and external devices, networks, or information sources. Examples of communication interface 374 include but are not limited to a modem, a network interface (for example an Ethernet card), a communications port, a PCMCIA slot and card, an infrared interface, and the like.
  • Communication interface 374 preferably implements industry promulgated architecture standards, such as Ethernet IEEE 802 standards, Fibre Channel, digital subscriber line (DSL), asymmetric digital subscriber line (ASDL), frame relay, asynchronous transfer mode (ATM), integrated digital services network (ISDN), personal communications services (PCS), transmission control protocol/Internet protocol (TCP/IP), serial line Internet protocol/point to point protocol (SLIP/PPP), and so on.
  • Software and data transferred via communication interface 374 may be in the form of signals 378 which may be electronic, electromagnetic, optical or other signals capable of being received by communication interface 374. These signals 378 are provided to communication interface 374 via channel 376.
  • Channel 376 carries signals 378 and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, a radio frequency (RF) link, or other communications channels.
  • Computer programming instructions also known as computer programs, software, or firmware
  • Computer programs may be stored in the main memory 356 and the secondary memory 358.
  • Computer programs may also be received via communication interface 374.
  • Such computer programs when executed, enable the device 350 to perform the features of the present invention.
  • execution of the computer prograinming instructions may enable the processor 352 to perform the features and functions of the present invention. Accordingly, such computer programs represent controllers of the computer system 350.
  • a computer program product is used to refer to any medium used to provide programming instructions to the computer system 350. Examples of certain media include removable storage units 364 and 372, a hard disk installed in hard disk drive 360, and signals 378. Thus, a computer program product may be a means for providing prograinming instructions to the computer system 350.
  • the software may be stored in a computer program product and loaded into computer system 350 using hard disk drive 360, removable storage drive 362, interface 370, or communication interface 374.
  • the computer programming instructions when executed by the processor 352, may cause the processor 352 to perform the features and functions of the invention as described herein.
  • the invention may be implemented primarily in hardware using hardware components, such as application specific integrated circuits ("ASICs"). Implementation of the hardware state machine so as to perform the functions described herein will be apparent to persons having ordinary skill in the relevant art. In yet another embodiment, the invention may be implemented using a combination of both hardware and software. It is understood that modification or reconfiguration of the device 350 by one having ordinary skill in the relevant art does not depart from the scope or the spirit of the present invention.
  • ASICs application specific integrated circuits

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

L'invention concerne un agent d'indexation intégré à chaque dispositif d'une pluralité de dispositifs connectés à un réseau. Les agents d'indexation génèrent des données d'indexation comprenant des informations concernant le contenu et l'emplacement des ressources sur les dispositifs correspondants. Un serveur d'indexation est connecté au réseau, comprenant un moteur d'interrogation destiné à rechercher les ressources sur les dispositifs. Les agents d'indexation procèdent au transfert des données d'indexation des dispositifs correspondants au serveur d'indexation, soit automatiquement, soit à l'aide d'un moteur de recherche du serveur d'indexation. Le serveur d'indexation compile les données d'indexation des dispositifs correspondants dans une base de données pouvant être consultée. Les agents d'indexation peuvent convertir les données d'indexation de formats spécifiques aux dispositifs en formats pouvant être interprétés par le serveur d'indexation. L'agent d'indexation peut également permettre l'indexation de dispositifs mobiles pouvant être connectés au réseau, et peut restreindre l'indexation à des moteurs de recherche authentifiés.
PCT/US2002/006178 2002-02-26 2002-02-26 Systemes et procedes d'indexation de donnees dans un environnement de reseau WO2003073324A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/US2002/006178 WO2003073324A1 (fr) 2002-02-26 2002-02-26 Systemes et procedes d'indexation de donnees dans un environnement de reseau
AU2002252155A AU2002252155A1 (en) 2002-02-26 2002-02-26 Systems and methods for indexing data in a network environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2002/006178 WO2003073324A1 (fr) 2002-02-26 2002-02-26 Systemes et procedes d'indexation de donnees dans un environnement de reseau

Publications (1)

Publication Number Publication Date
WO2003073324A1 true WO2003073324A1 (fr) 2003-09-04

Family

ID=27765164

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/006178 WO2003073324A1 (fr) 2002-02-26 2002-02-26 Systemes et procedes d'indexation de donnees dans un environnement de reseau

Country Status (2)

Country Link
AU (1) AU2002252155A1 (fr)
WO (1) WO2003073324A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008084386A1 (fr) * 2007-01-12 2008-07-17 Truecontext Corporation Procédés et système pour l'orchestration de services et le partage de données sur des dispositifs mobiles
US8285652B2 (en) 2008-05-08 2012-10-09 Microsoft Corporation Virtual robot integration with search
US8775407B1 (en) * 2007-11-12 2014-07-08 Google Inc. Determining intent of text entry
US10437887B1 (en) 2007-11-12 2019-10-08 Google Llc Determining intent of text entry

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001027805A2 (fr) * 1999-10-14 2001-04-19 360 Powered Corporation Cartes d'index sur des hotes de reseau pour la recherche, l'evaluation et le classement
WO2001046856A1 (fr) * 1999-12-20 2001-06-28 Youramigo Pty Ltd Systeme et procede d'indexage
EP1143349A1 (fr) * 2000-04-07 2001-10-10 IconParc GmbH Méthode et appareil pour la génération d'un indice pour un moteur de recherche

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001027805A2 (fr) * 1999-10-14 2001-04-19 360 Powered Corporation Cartes d'index sur des hotes de reseau pour la recherche, l'evaluation et le classement
WO2001046856A1 (fr) * 1999-12-20 2001-06-28 Youramigo Pty Ltd Systeme et procede d'indexage
EP1143349A1 (fr) * 2000-04-07 2001-10-10 IconParc GmbH Méthode et appareil pour la génération d'un indice pour un moteur de recherche

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008084386A1 (fr) * 2007-01-12 2008-07-17 Truecontext Corporation Procédés et système pour l'orchestration de services et le partage de données sur des dispositifs mobiles
US9401966B2 (en) 2007-01-12 2016-07-26 ProntoForms Corporation Methods and system for orchestrating services and data sharing on mobile devices
US8775407B1 (en) * 2007-11-12 2014-07-08 Google Inc. Determining intent of text entry
US10437887B1 (en) 2007-11-12 2019-10-08 Google Llc Determining intent of text entry
US8285652B2 (en) 2008-05-08 2012-10-09 Microsoft Corporation Virtual robot integration with search

Also Published As

Publication number Publication date
AU2002252155A1 (en) 2003-09-09

Similar Documents

Publication Publication Date Title
US7519726B2 (en) Methods, apparatus and computer programs for enhanced access to resources within a network
JP3967806B2 (ja) リソースの位置を指名するためのコンピュータ化された方法及びリソース指名機構
US6078929A (en) Internet file system
JP4668567B2 (ja) クライアントベースのウェブクローリングのためのシステムおよび方法
US8027976B1 (en) Enterprise content search through searchable links
JP4363520B2 (ja) ピアツーピア・ネットワークにおけるリソース検索方法
US20060206460A1 (en) Biasing search results
JP4671332B2 (ja) ユーザ識別情報を変換するファイルサーバ
JP2016533594A (ja) ウェブページのアクセス方法、ウェブページのアクセス装置、ルーター、プログラム及び記録媒体
RU2453916C1 (ru) Способ поиска информационных ресурсов с использованием переадресаций
JP5320433B2 (ja) 統合検索装置、統合検索システム、統合検索方法
CA2605838A1 (fr) Methode et systeme pour executer une application normalement en ligne dans un mode hors ligne
JPH10116295A (ja) ドキュメントエージェンシーシステム
US20100306833A1 (en) Autonomous intelligent user identity manager with context recognition capabilities
Di Francesco et al. A storage infrastructure for heterogeneous and multimedia data in the internet of things
KR100714504B1 (ko) 유무선 인터넷을 이용한 개인 단말의 컨텐츠 검색 시스템및 방법
JP2009271919A (ja) 電子データを管理するシステム、装置及び方法
WO2003042874A9 (fr) Systemes et procedes d'indexation de donnees dans un environnement en reseau
WO2003073324A1 (fr) Systemes et procedes d'indexation de donnees dans un environnement de reseau
KR20020003674A (ko) 데이타 동기화 시스템 및 그 방법
RU110847U1 (ru) Информационно-поисковая система
JP2002342144A (ja) ファイル共有システム、プログラムおよびファイル受渡し方法
JP2002202955A (ja) セキュアサーバからの認証要求に対して応答を自動的にフォーミュレートするシステムおよび方法
JPH10334002A (ja) 電子メールによる遠隔操作制御システムおよび制御方法ならびに遠隔操作制御プログラムを格納した記憶媒体
KR100879880B1 (ko) 전자캐비넷(e-Cabinet)서비스 제공방법 및시스템

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR

121 Ep: the epo has been informed by wipo that ep was designated in this application
WA Withdrawal of international application
NENP Non-entry into the national phase

Ref country code: JP