IES990276A2

IES990276A2 - An inter-computer communications apparatus

Info

Publication number: IES990276A2
Authority: IE
Inventors: Michael Carlile Val Cassidy
Original assignee: Iesearch Ltd
Priority date: 1999-04-06
Filing date: 1999-04-06
Publication date: 1999-12-29
Also published as: IES81055B2

Abstract

An inter-computer communications apparatus for improving the efficiency and transparency of communications between computer systems and for managing memory to reduce overall network traffic. The method and apparatus described optimise use of system resources by retrieving only selected portions of a target group of information sources, which are stored locally and automatically updated. This allows users to obtain the required information without incurring the overhead associated with network communication traffic providing complete results in a truly real time manner.

Description

An Inter-Computer Communications Apparatus The present invention relates to an inter-computer communications apparatus and more particularly to a method and apparatus for improving the efficiency and transparency of communications between computer systems. The invention also relates to a method for optimising memory management to reduce communication delays and overall network traffic. For the purposes of this specification, the term inter-computer communications refers to communication between remote and local data processing entities.

In many data processing systems, it is common to transfer data from between a number of disparate and often geographically remote sources to a local or target computer system. Occurrences of such transfers have increased at a rate, which was impossible to predict with the advent of the Internet and World Wide Web. As the number of new network users is increasing exponentially, so to are number of data requests, placing unprecedented demands on the bandwidth capability of such networks. This rate of increase shows no sign of abating and in fact is likely to increase, not least because many governments have set targets for percapita connection figures.

Bandwidth tolerances are further tested in that information sources for these networks 20 frequently use different hardware and software platforms to the local target computer. These differences increase the complexity of data transfers often necessitating additional bandwidth provision by network designers and operators. As the numbers of users increase the number of different hardware and software platforms also increase with inevitable problems. When there are a large number of variations it becomes virtually impossible to transfer data in a 25 transparent manner, in that, the data must be converted at each source, into a format suitable for use by the local target computer. Even this is not a viable solution in all situations, for example, when the source is not designed or configured for this type of operation having been developed over a long period of time. Such legacy systems contain large quantities of information, which may be required for the purposes outlined by the local system.

Obviously owners of such source systems wish to unlock the information stored to enable users to fully exploit the new technologies.

BNSDOCID: IE 990276 The data may be transferred for storage or for processing to provide a result, which may then re-transferred to the source. The most common form of transfer for users of the World Wide Web is an information request. As the numbers of users increase, the number of such requests must similarly increase. Search engines or portals normally process these information requests. It has proven impossible to quantify request traffic through even a select few mainstream search engines because of the rate of increase, however, best estimates put this figure at approximately one hundred million requests per month in early nineteen ninety eight. As each of these requests will seek information from approximately one hundred information sources the throughput demands placed on the communication system are enormous.

More efficient web browsers, such as that described in International Patent Application no. WO 98/06033 have undoubtedly improved the efficiency of information request processing, however, they have not addressed the fundamental problems associated with the bandwidth required for real time processing of information requests.

There is therefore a need for an inter-computer communication method and apparatus, which will provide communications between disparate data sources and which will overcome the aforementioned problems.

Accordingly there is provided an inter-computer communications apparatus having link means for connecting the apparatus to a computer system for communicating with a plurality of geographically remote computers using an internet data communications protocol, the link means incorporating, a server for processing information requests by retrieving data from one or more remote computers using the internet protocol, means for retrieving data associated with the information request and identifying 30 a data type and address for the retrieved data, and a translation means for automatically identifying a sequential dataset for the BNSDOCID: IE 990276 retrieved data address, the apparatus performing the sequential steps of:5 initiating a domain name seek function using the server to retrieve an interrogation routine stored in local memory; automatically identifying a target address for a target from a predefined array of target addresses, extracting and compiling a resource locator associated with the address and linking the server to the target; retrieving and streaming un-interpreted source code before parsing the retrieved code to discard pre-selected code segments identified by code headers to generate residual code; piping the residual code stream to a stack for sequential accessing to extract a domain name; and checking a local datastore for a datastore content value corresponding to the 20 extracted domain name and in response to a no match condition appending the extracted domain name to the datastore.

Preferably, the apparatus performs the further step of automatically generating a unique refreshable timestamp identifier for each datastore content value.

Ideally, the apparatus performs the further steps of :accessing the datastore to define an access subset by reading the timestamp identifier for each value and comparing the timestamp identifier with a pre-set value; formatting a resource locator for each access subset entry and addressing a resource associated with the formatted resource locator; BNSDOCID: IE 990276 retrieving and streaming un-interpreted source code before parsing the retrieved source code to discard pre-selected code segments identified by code headers to generate residual code; piping the residual code stream to the stack for sequential accessing to extract a domain name; checking the local datastore for a value corresponding to the extracted domain 10 name and in response to a no match condition appending the extracted domain name to the datastore; and deleting white space from text extracted from the source code to create a condensed character stream and appending the stream to a database.

According to another aspect of the invention there is provided an inter-computer communications method performing the sequential steps of: 20 initiating a domain name seek function using the server to retrieve an interrogation routine stored in local memory; automatically identifying a target address for a target from a predefined array of target addresses, extracting and compiling a resource locator associated with the address and linking the server to the target; retrieving and streaming un-interpreted source code before parsing the retrieved code to discard pre-selected code segments identified by code headers to generate residual code; piping the residual code stream to a stack for sequential accessing to extract a domain name; BNSDOCID: IE 990276 checking a local datastore for a datastore content value corresponding to the extracted domain name and in response to a no match condition appending the extracted domain name to the datastore; · automatically generating a unique refreshable timestamp identifier for each datastore content value; accessing the datastore to define an access subset by reading the timestamp identifier for each value and comparing the timestamp identifier with a pre-set value; formatting a resource locator for each access subset entry and addressing a resource associated with the formatted resource locator; retrieving and streaming un-interpreted source code before parsing the retrieved source code to discard pre-selected code segments identified by code headers to generate residual code; piping the residual code stream to the stack for sequential accessing to extract a domain name; checking the local datastore for a value corresponding to the extracted domain name and in response to a no match condition appending the extracted domain name to the datastore; and deleting white space from text extracted from the source code to create a condensed character stream and appending the stream to a database.

The invention will be more clearly understood from the following description of an embodiment thereof with reference to the accompanying drawing, given by way of example only, in which: BNSDOCID: IE 990276 Fig. 1 is a flow diagram illustrating operation of an inter-computer communications apparatus formed in accordance with the invention.

For the purposes of this description, specific system architectures, processors, memory 5 devices, timing and performance details have been omitted in order not to unnecessarily obscure the present invention. Thus, the constituent components of the invention have been described in terms of functionality, as many ways of achieving the said functionality will be readily apparent to those skilled in the art.

An inter-computer communications apparatus according to the invention is connected to a computer system to allow communications with a large number of geographically remote computers using Transmission Control Protocol/Intemet Protocol. (TCP/IP).

The apparatus connects to the computer system using a server, which processes Hypertext Transport Protocol (HTTP) requests. HTTP is the foundation of the World Wide Web (WWW) where the simplest through to the most complex of browsers use HTTP to issue requests to WWW servers and to receive and to display the response to those requests. The server retrieves information using this protocol from local systems using the method described in detail below. Retrieved data may be one of a number of types depending on the request, for example static data which might include text, graphics or other forms of binary data used to build images on the local system. This data is typically stored in a hierarchical file system in UNIX and is identified using a Uniform Resource Locator (URL). For data of this type the URL identifies the file which is to be transmitted. A translation mechanism of the apparatus is used to identify a sequential dataset to which the URL relates. For example, the URL /w/x/y/z will be taken to refer to the dataset w.x.y member z. Similarly the URL lafold will be taken to refer to the sequential dataset a.b.c. Once translated, the apparatus locates the dataset or dataset/member combination and returns the appropriate information to the computer system in response to the request as described below. The contents of the located data are identified using either a logical member type or the last level of the dataset name identified by the URL.

The previously known method of processing information requests is now described to SNSDOCID: IE 990276 facilitate understanding of the current invention and to highlight the important technical advantages associated therewith.

When a user requests information relating to a search from a search engine, the HTTP 5 client generating the search request, attempts to connect to the machine address where an HTTP server is running based on an address provided by the user associated with a given search engine. The HTTP server of that search engine is normally listening for incoming requests on a TCP/IP port.

The HTTP engine server normally accepts the connection at which time the client is free to send data. This data may include search criteria or relate to the selection of a preset information grouping on the server. The HTTP server may also elect not to accept the request at which time the connection will be broken. This may occur where the HTTP server only wishes to service requests from certain TCP/IP addresses. Requests may also be refused due to heavy network traffic as interpreted by Call Accept Criteria (C AC) functionality incorporated into the engine server which controls access against server performance, to service accepted requests as efficiently as possible.

The HTTP client sends the HTTP request to the HTTP server encapsulating various levels of information in standard HTTP headers. Request content is also sent and will normally include length and type HTTP headers to enable the HTTP server to interpret the content correctly. The HTTP server receives the request over the TCP/IP connection and begins processing the request to identify resources the server then links to appropriate storage and communications links to locate the URL’s, which identify requested data. The result thus located by the server is then returned to the request source. Once received, the data is processed, which, in the case of a browser, means displaying the output for the user, and closes it’s end of the TCP/IP connection. This process is repeated for each information request or URL pages of results, to which the information request refers.

The corresponding functions of the current invention are now described with reference to Fig. 1. In step 1 the server initiates a domain name seek function. This function in tum retrieves and interrogation routine stored in local memory in step 2. A target address is BNSDOCID: IE 990276 identified by this routine in step 3 from a predefined array of target addresses A. In step 4, a URL associated with the address identified in step 3 is compiled before linking the server to the target in step 5.

Once the link of step 5 has been established the source HTML code is retrieved in step 6. It is important to note that the source HTML is not interpreted by a browser but streamed for processing. This processing is initiated in step 10 where the streamed HTML code is parsed to automatically remove pre-selected code segments. These segments are identified by headers and represent a significant portion of the overall code length. Once all static elements have been removed in step 10 a check is performed in step 11 to identify and remove code portions relating to embedded programs. These programs which form the active components of the source HTML again represent a significant portion of the remaining code. Once the removal from the streamed code has been completed in steps 10 and 11 the residual code stream is piped to a stack in step 12. This stack is sequentially accessed in step 15 and contains a list of domain name to be accessed. As the name is accessed from the stack in step 15 a check is performed in step 16 to determine which the address exists in a local datastore. If the address is not found in the datastore, the address is appended to the datastore in step 17. If the address is found in the data store the sequential accessing is continued. This process continues until the entire stack created in step 12 has been accessed.

When this process is completed, the data store represents a list of sites to be accessed. The data store also includes timestamp identifier for each entry. This identifier designates the last access date for a given site or HTML reference. Entries appended to the data store in step 17 will have a null identifier automatically added indicating that this address has not previously been known to the server.

At a desired time the server data store is access and a subset of the site details defined in step 50. This subset may be defined as a list of entries having a null identifier or may be specified by sites not visited in a given period. The subset defined in step 50 is sequentially accessed in step 51 and a URL for each entry in turn is formatted in step 52. The server receives the formatted URL in step 53 and attempts to link to the site in step 55. If a BNSDOCID: IE 990276 communications error is detected in step 56 the next entry of the subset is processed as described in steps 50 to 55. If no communications error is detected in step 56 the home page text is extracted from the HTML code as described in steps 1 to 12 in step 57. All white space is deleted from the text in step 58 to leave a condensed character stream. The character stream is then appended to a database entry in step 60. The database entry has appended thereto, the URL formatted in step 52, the domain name and the character stream together with the date. Other domain names located on the page are accessed as described above to define a tree structure until all sites have been accessed.

It will of course be understood that the invention is not limited to the specific details described herein, which are given by way of example only, and that various modifications and alterations are possible within the scope of the invention.

Claims

CLAIMS:

1. An inter-computer communications apparatus having link means for connecting the apparatus to a computer system for communicating with a plurality of geographically 5 remote computers using an internet data communications protocol, the link means incorporating, a server for processing information requests by retrieving data from one or more remote computers using the internet protocol, means for retrieving data associated with the information request and identifying a data type and address for the retrieved data, and a translation means for automatically identifying a sequential dataset for the 15 retrieved data address, the apparatus performing the sequential steps of:initiating a domain name seek function using the server to retrieve an 20 interrogation routine stored in local memory; automatically identifying a target address for a target from a predefined array of target addresses, extracting and compiling a resource locator associated with the address and linking the server to the target; retrieving and streaming un-interpreted source code before parsing the retrieved code to discard pre-selected code segments identified by code headers to generate residual code; 30 piping the residual code stream to a stack for sequential accessing to extract a domain name; and BNSDOCID: IE 990276 checking a local datastore for a datastore content value corresponding to the extracted domain name and in response to a no match condition appending the extracted domain name to the datastore. 5

2. An inter-computer communications apparatus as claimed in claim 1 wherein the apparatus performs the further step of automatically generating a unique refreshable timestamp identifier for each datastore content value.

3. An inter-computer communications apparatus as claimed in claim I and claim 2 10 wherein the apparatus performs the further steps of :accessing the datastore to define an access subset by reading the timestamp identifier for each value and comparing the timestamp identifier with a pre-set value; 15 formatting a resource locator for each access subset entry and addressing a resource associated with the formatted resource locator; retrieving and streaming un-interpreted source code before parsing the retrieved source code to discard pre-selected code segments identified by code headers to 20 generate residual code; piping the residual code stream to the stack for sequential accessing to extract a domain name; 25 checking the local datastore for a value corresponding to the extracted domain name and in response to a no match condition appending the extracted domain name to the datastore; and deleting white space from text extracted from the source code to create a condensed 30 character stream and appending the stream to a database. BNSDOCID; IE 990276

4. An inter-computer communications method performing the sequential steps of: initiating a domain name seek function using the server to retrieve an interrogation routine stored in local memory; automatically identifying a target address for a target from a predefined array of target addresses, extracting and compiling a resource locator associated with the address and linking the server to the target; 10 retrieving and streaming un-interpreted source code before parsing the retrieved code to discard pre-selected code segments identified by code headers to generate residual code; piping the residual code stream to a stack for sequential accessing to extract a domain 15 name; checking a local datastore for a datastore content value corresponding to the extracted domain name and in response to a no match condition appending the extracted domain name to the datastore; automatically generating a unique refreshable timestamp identifier for each datastore content value; accessing the datastore to define an access subset by reading the timestamp identifier 25 for each value and comparing the timestamp identifier with a pre-set value; formatting a resource locator for each access subset entry and addressing a resource associated with the formatted resource locator; 30 retrieving and streaming un-interpreted source code before parsing the retrieved source code to discard pre-selected code segments identified by code headers to generate residual code; BNSDOCID: IE 990276 piping the residual code stream to the stack for sequential accessing to extract a domain name; 5. Checking the local datastore for a value corresponding to the extracted domain name and in response to a no match condition appending the extracted domain name to the datastore; and deleting white space from text extracted from the source code to create a condensed 10 character stream and appending the stream to a database.

5. A method and apparatus substantially as hereinbefore described, with reference to and as illustrated in the accompanying drawing.