WO2006131753A2

WO2006131753A2 - Compressing data for distributed storage across several computers in a computional grid and distributing tasks between grid nodes

Info

Publication number: WO2006131753A2
Application number: PCT/GB2006/002124
Authority: WO
Inventors: Rhys Newman; Jeff Tseng; Matthew Dovey; Steven Young
Original assignee: Isis Innovation Limited
Priority date: 2005-06-10
Filing date: 2006-06-09
Publication date: 2006-12-14
Also published as: WO2006131753A3; GB0511797D0

Abstract

A distributed computer system in which a data set may be received, compressed and at least parties thereof transmitted and stored on remote provider nodes having storage resource available for use. The data set may be compressed using a global dictionary, possibly available centrally, such that only the compressed index stream needs to be stored on the remote provider nodes.

Description

Computer Grid

This invention relates generally to the provision of a computer grid and, more particularly, to a system for administering a computer grid by acting as a broker between the users of computer resources and the providers of spare/idle computer resources.

When a computer is operating but not actively performing computations for a user, it is said to be idle. Because of their incredible speed, modern computers are idle most of the time, not only when they are running screen savers, but even when they are being used in common tasks, such that a large proportion of the world's computational capacity currently goes to waste. Equally, it is considered that a large proportion of the available data storage capability also lies unused at any one time.

Grid computing, has been proposed and explored as a means for bringing together a large number of computers of wide ranging locations and often disparate types for the purpose of utilising idle computer processor time/or unused data storage by those needing processing or storage beyond their own capacity. While the development of public networks, such as the Internet, has facilitated communication between a wide range of computers all over the world, grid computing aims to facilitate not only communication between computers, but also coordination of the processing power and storage capacity of those computers in a useful manner. Most simply, "jobs" (i.e. processing or data storage tasks) are submitted to a managing entity or "broker" of the grid system, which in turn causes the job to be performed by one or more of the computers on the grid.

However, while the concept of grid computing (either processing or data storage) holds great promise, the execution of the concept has not been without difficulties. For example, WO2005/060201 Al describes a system for a grid based data storage in which a non- transparent sequence key identifies a plurality of target clients on the grid computing system, and a backup copy of the data from a source client is stored on the plurality of target clients according to the non-transparent sequence key. This provides a certain minimum security level for the data backed up on the plurality of target clients, and deals with the issue of stored data being lost from an unreliable data storage facility. However, there are still a number of practical, legal and technological reasons that continue to prevent widespread deployment of (computer) grid based technology.

The first aspect of the present invention is concerned with the practical use of spare storage available on many computers in the world whose storage capacity is not completely utilised by their owners in their current function.

One significant consideration, not only in the field of grid data storage, but in all areas of data file storage, transmission and retrieval, is that of compression. Conventional compression techniques are used to remove (or at least reduce) redundancy within a data file, thereby reducing the number of bits required to represent a file and thus conserving bandwidth (for transmission) or memory (for storage). This is achieved by identifying redundant sections of the file and replacing them with a single copy of that section and a codeword at each occurrence of that section in the compressed stream representative of the complete section.

A well known lossless compression algorithm is called Huffman Coding, wherein the basic idea is to assign short codewords to input blocks having a high probability and longer codewords to those having lower probabilities. Consider the following example. Consider each of the following letters as a symbol with its respective probability: A (0.12), E (0.42), I (0.09), O (0.30), U (0.07). First, the two symbols with the smallest probability are combined into a new symbol with both letters by adding the probabilities. The latter step is repeated until there is onl one (combined) symbol left with a probability of 1. The resulting code can be represented in the form of a tree (as shown in Figure 5), with each of the left branches being labelled 0 and the rigr branches labelled 1. The codeword for each of the letters comprises the sequence of O's and l's that lead to it on the tree, starting from the symbol with the probability of 1. hi this example, therefore, the codewords for each letter are:

A- 100 E - O I - 1011 O - ll U - IOlO

It should be noted here that many other types of lossless compression will be well known to a person skilled in the art, and the present invention is not intended to be limited in any way to a particular type of compression.

Viewed in essence, therefore, a compressed file consists of.^' a dictionary section wnere the common sections are stored (once only) and a compressed data stream containing codewords representative of the items in the dictionary, or sections left unreferenced in the dictionary and therefore, intended to be read verbatim. Thus, files with large areas of constant values or patterns therefore compress very well. Conventionally, a compression routine takes an input file and produces an output file (dictionary + compressed index stream) which enables any "uncompressor" to reverse the process without the need for any additional information.

However, every compressed file contains the respective dictionary section as well as the index stream. Thus, every time a compressed files needs to be stored or transmitted, the dictionary section needs to be stored or transmitted as well as the index stream itself. It is therefore an object of the first aspect of the present invention to provide a method and system of data compression whereby the resultant data stream required to be stored (or transmitted) is significantly reduced relative to conventional lossless compression techniques, without loss of data.

In accordance with the first aspect of the present invention, there is provided a method of compressing a data set comprising a plurality of data files comprising identifying recurring sections of data between said plurality data files, storing a single copy of each identified recurring section of data in a global dictionary and generating in respect of each data file, an index stream including, instead of an identified recurring section of data, a reference to the respective recurring section of data in said global dictionary.

Thus, instead of just using redundancy within each file to compress each file separately, redundancy between a whole set of files is used to produce respective index streams and a single global dictionary which can be used to decompress all of the data files. The first aspect of the present invention extends to a global directory for use in the method defined above, said global dictionary comprising a plurality of fragments of data, representative of recurring section of data identified between said data files each fragment being identifiable with a unique identification code.

Preferably, the global dictionary is stored remotely from the index streams, and may be made available for use across a communications network such as the Internet. Thus, in respect of a large number of data files, the global dictionary may be built using redundancy identified across all of the data files and then stored on one or more machines, whereas the resultant index streams can be stored elsewhere, possibly by the owner(s) of the original data files. The global dictionary can therefore facilitate better compression than one built up using a single data file, as in conventional compression techniques, as it can be much bigger and can exploit redundancies between data files as well as within each data file. Anyone who wishes to decompress a data file can do so, provided they can connect to the computer (or computer grid) on which the dictionary is stored and get access thereto. As the global dictionary will be larger and more complete than the equivalent individual dictionaries built for each data file separately, the compression ratio will be significantly higher and without the need to store the dictionary with every index stream the compressed files are significantly smaller than when using conventional compression techniques.

Preferably, when a data file is required to be compressed, the global dictionary is updated so as to provide an optimal resource built from statistics of all data files stored by any user using a system employing the claimed method.

In the case of a grid-based data storage facility, only the dictionary would need to be stored on the grid; the index streams can be stored elsewhere (possibly by the owner(s) of the data bearing in mind that the resultant index streams would be minute compared with their original data. Thus, the grid administration and the owners of the spare data storage capacity being used, would only be liable for correctly storing the global dictionary (or parts thereof) and would not be responsible for storing user data.

Also in accordance with the first aspect of the invention, there is provided a system for compressing a data set comprising a plurality of data files, comprising means for identifying recurring sections of data between said plurality of data files, means for storing a single copy of each identified recurring section of data in a global dictionary and means for generating in respect of each data file, an index stream including, instead of an identified recurring section of data, a reference to the respective recurring section of data in said global dictionary.

Once again, providing a single global dictionary in respect of a large number of data files means that relatively very small index streams are all that need to be additionally stored. As a result, a massive data storage resource can be provided for users without needing massive resources.

It will be appreciated that any one of a number of compression techniques can be used to identify redundancies between the data files (and within each file, where appropriate) and build the global dictionary, and the present invention is not intended to be limited in this regard.

The first aspect of the present invention extends further to a distributed computing system comprising one or more user nodes and a plurality of remote provider nodes each having unused data storage resources for use by one or more of said user nodes, said system being arranged to receive a data set for storage, said data set comprising a plurality of data files, means for compressing said data set using the method defined above so as to generate or update a global dictionary and a plurality of respective index streams, and means for storing said global dictionary on one or more of said provider nodes.

The system preferably includes a server for facilitating remote access by a user node to said global dictionary to enable an index file to be decompressed.

As intimated above, a major difficulty that must be considered in the execution of the concept of grid computing is that any provider of unused storage resources (or indeed, computational capacity) and offering the resource for use by third parties via a grid computing system may be liable (or deemed complicit) if someone stores illegal (e.g. pornographic) material data on their machines.

In accordance with a second aspect of the present invention, there is provided a distributed computing system comprising one or more user nodes and a server being arranged and configured to receive a data set for compression means for accessing a global dictionary as defined above, identify, within said data set, fragments of data which are in said global dictionary and replacing said identified data fragments in said data set with said unique identification code so as to generate a compressed data set, said server being arranged and configured to forward said compressed data on to one or more remote nodes for storage or processing.

Preferably, the server is further arranged and configured to receive a compressed data set, look up said identification codes in said global dictionary so as to identify the respective data fragments corresponding thereto, replacing said identification codes in said compressed data with said identified corresponding data fragments so as to reconstruct said data set, and forwarding said reconstructed data set on to one or more remote nodes for storage or processing.

Preferably, means are provided whereby, when a data set is received by the server for compression, additional data fragments identified within said data set are assigned a respective identification code and entered into said global dictionary so as to extend or update it.

The system may comprise means for transmitting one or more data fragments of said global dictionary for one or more remote provider nodes for storage, and means for recording the identity of a provider node to which said one or more data fragments have been transmitted for storage so as to facilitate subsequent access thereto. The data fragments of the global dictionary may be encoded prior to storage thereof.

Thus, some or all of said data fragments of said global dictionary may be encrypted and/or compressed individually using conventional compression techniques.

Thus, all data to be stored on unused space throughout the computer grid will represent, in the most part, any fragments of the data thus compressed by use of the system defined above, and each fragment many be encrypted before being stored on a different respective computer resource. If only the central server or "broker" knows where each fragment is stored, and only it knows the key to decrypt the encrypted fragments of data, then no provider's machine can be deemed to be storing the data set (or parts thereof) or have access to the stored data set, such that the individual or commercial entity providing free space to the broker could not therefore be held liable for the content of the data set being stored.

The provider nodes selected to store one or more data fragments of said global dictionary may be selected arbitrarily from a set of resources available for use by said server. The set of available resources may be selected depending on the level of reliability of data storage offered thereby. The service level defining the level of reliability of data storage may be selected by the originator of said data set.

The second aspect of the present invention additionally provides a means for managing global dictionary storage with respect to remote resources which may or may not be particularly reliable, depending on the service level offered by the provider nodes. For example, a provider node may offer unused storage resources for use by the user nodes, whereby no notice is required to be given to the broker that the provider node is reclaiming the offered storage resources for its own use. In this case, a portion of the global dictionary could be deleted without warning. Thus, in a preferred embodiment of the invention, provision may be made for managing global dictionary storage with respect to less reliable remote resources to provide a greater level of reliability to a user, depending on a service level selected thereby.

In one exemplary embodiment of the second aspect of the invention one or more of said data fragments of said global dictionary may be replicated (before or after encryption or compression) and each replicated data fragment may also be transmitted to different respective provider nodes for storage thereby. Thus, if one data fragment is lost due to an unreliable resource, the duplicate of that data fragment, stored on a different resource, can be used to complete the global dictionary and thereby ensure that compressed data sets can be decompressed on demand.

In another exemplary embodiment of the invention, checksum data may be generated in respect of said data fragments of said global dictionary stored on separate respective resources and that checksum data, which may again be encrypted, may then be stored on another remote resource. Thus, a data set submitted for storage by a user node is stored in the form of X + Y data, items on X + Y different respective resources, wherein the X + Y data items comprise X data portions and Y sets of checksum data. Depending on the size of Y relative to X and the checksum algorithm used, the original X data may be reconstructed using any of Z available data items where X <= Z <= X+Y. Thus, a data fragment can be protected from loss due to the actions of unreliable resources on which the pieces are stored. Of course, some or all of the X + Y data items can also be duplicated and the duplicate data items stored in different respective resources, for additional reliability, depending on the reliability required by the system.

In fact, the number of copies stored will reflect the importance of the data fragment (or dictionary entry), data set and the reliability of the resource(s) used to store it. Of course, important entries may only be stored on resources which offer a guarantee (e.g. 1 hour's notice before reclaiming storage space) of service. Less important entries can then be stored on machines where (little or) no notice will be given, in which case the system can ensure sufficient redundancy exists (in the manner set out above) to meet the overall reliability required by users of the system. The invention provides a model for mapping such requirements onto the reliability (or otherwise) of the storage resources available for use (i.e. the likelihood data compressed by the system will be decompressable at a later date using the data fragments stored on said storage resources.

Another problem that can arise with conventional grid computing systems is that if a computationally time-consuming job is submitted by a user to the broker, it may be difficult to secure sufficient idle time on one provider node to complete the processing task. For example, a task may be submitted that is expected to take 12 hours to complete, whereas the individual machines available to process the task may only be "unused" or idle for, say, 2 hours at a time. It is therefore very difficult to schedule such a long job, and even if an available resource starts to execute the job, it may have to be stopped suddenly so that the resource owner (who has priority) can use the resource.

hi accordance with a third aspect of the present invention, there is provided a distributed computing system comprising one or more user nodes and a plurality of remote provider nodes each having idle computational capacity for use by said one or more user nodes, said system further comprising a server for facilitating two-way communication between said one or more user nodes and said provider nodes, said server being arranged and configured to receive from a user node a processing task for execution by said distributed computing system, define a virtual machine for execution of said processing task in terms of resources required to complete said processing task, allocate said processing task for execution to a first provider node having idle computational capacity, monitor said provider node executing said task for continuing availability of said idle computational capacity, and interrupt and then move said processing task for execution to a second provider node having idle computational capacity if the idle computational capacity of the first provider nodes becomes unavailable.

The virtualisation of the platform on which the processing task will be executed guarantees isolation of the processing task for the physical machine on which it is being executed at any time, i.e. the task only "sees" the single virtual machine running it, irrespective of the physical hardware executing the task.

Beneficially, the processing task is periodically "check pointed", i.e. a snap shot of the status of the processing task is periodically obtained by the provider node currently executing the task, and the state of the task at that time is transmitted to a backup server (which may be the same as the server heretofore referred to). In this case, when a processing task is interrupted because the idle computational capacity of the provider node becomes unavailable, the processing task is interrupted, moved to and re-started on another provider node using the state of the task as last check-pointed. Thus, in a preferred embodiment, the server can use check-point to restart a processing task on another provider node (not necessarily at the beginning, given that the check-point data represents the state of the processing task some way through its execution) if the first provider node becomes unavailable. In this way, only a small fraction of the total job's computer time is wasted by unpredictable resource availability.

In one exemplary embodiment, the server may be arranged and configured to commence execution of a processing task on a plurality of provider nodes, either simultaneously or with different (e.g. staggered) start times, so that multiple check points relating to the same processing task are available and the probability of a significant loss due to the sudden unavailability of a resource is minimised. One of the problems associated with moving a job around the distributed computing system during its execution due to changing availability of hardware is that every time a job is moved, all network transfers will be broken because the IP address of the physical machine to which a job is being moved will not necessarily be the same as that of the machine from which it is being transferred.

Thus, in accordance with a preferred embodiment of the third aspect of the invention, the system is arranged and configured such that all network communication between the grid job on said provider node (whose idle processor time is currently being used by the system) and the rest of the communications network (Internet) is tunnelled through a specified server. This specified server will be hereinafter referred to as a "network proxy server" and may be the same machine as the above-mentioned server (hereinbefore referred to as the "broker", but not necessarily). This is preferably achieved by providing means in the provider nodes for transmitting network data using a predetermined transmission protocol (e.g. TCP on Port 443) to said specified ("network proxy") server, said server being arranged and configured to retransmit said data onward to the intended destination. Thus, irrespective of the provider node from which the data originates, the rest of the machines on the Internet with which said grid job communicates see only a constant IP address for the grid job.

The virtual machine defined in respect of a processing task will have a defined network connection (again isolated from the physical machine on which it is being executed), and all network connections made by the grid job are "tunnelled" through the network proxy. The main advantages of this include:

(a) all network activity appears to be coming from the same IP address for. any one processing task, irrespective of which (or how many) physical machines actually perform the execution of the task (and in whatever order);

(b) the server has control (which cannot be overridden or bypassed by any grid job executing on the said provider node) as to which external network resources can be accessed by each grid job. Any attempt by a grid job to connect to an unauthorised external resource can be prevented; and

(c) patterns of behaviour in network traffic can be monitored so that problems, such as Denial of Service attacks, can be identified.

Although not an essential aspect of this tunnelling feature, this does make it simple to ensure all network traffic between the grid job on the provider node and the server is transmitted via a single TCP port (443 in the above-mentioned example). If enabled, this additional feature makes it relatively simple to configure any firewalls between the provider nodes and the server to allow the network data traffic through.

Yet another issue that can arise within a grid computing system is that the virtual machine required to run a particular processing task may require multiple CPUs using MPI (Message Passing Interface) communications between the processes on each CPU. When the virtual machine is distributed to multiple remote resources (i.e. each CPU defined within the virtual machine is actually realised by respective remote machines), the required MPI communications cannot occur directly between each remote resource because of firewall restrictions.

In accordance with a fourth aspect of the present invention, there is provided a distributed computing system comprising one or more user nodes and a plurality of provider nodes each having idle computational capacity for use by said one or more user nodes, said system further comprising a server for facilitating two-way communication between said one or more user nodes and said provider nodes, said server being arranged and configured to receive from a user node a processing task for execution by said distributed computing system, define a virtual machine for execution of said processing task in terms of hardware required to complete said processing task, said server being arranged and configured to realise said virtual machine by distributing said processing task among a plurality of provider nodes, and to facilitate interconnect communications between hardware of respective provider nodes across a dedicated virtual network by configuring tunnelling of said communications through a specified server. Once again, the specified server, hereinafter referred to as the "network proxy" server, may be the same machine as the earlier-mentioned server, but not necessarily.

Thus, network traffic is tunnelled/routed to the network proxy server through the communications channel that is allowed between the server and the provider nodes, MPI communications can be performed on an internal virtual network (so the server acts as an MPI switch, in this case). Other multi-processor architectures are also envisaged using this type of virtual machine tunnelling technique, including shared memory multi-processors and other virtual interconnect hardware implemented in the virtual machine clients where the interconnect communications are tunnelled to the server.

One of the advantages of an exemplary embodiment of the fourth aspect of the invention, whereby the interconnect communication facilitated by the server is MPI communication between respective processes running on multiple central processors is that the MPI code defined for execution of a multi-processor processing task does not need to be modified to enable the task to be executed within a distributed computing system.

Still further, in standard solutions to running a virtual machine on a remote resource, there is an initial delay before useful computation can be done, while the "image" of the virtual machine (disk, memory, etc) is transferred to the remote resource. It will be appreciated, for example, that a virtual machine for executing a processing task is defined in software in terms of hardware required, such as type and number of CPU's type and size and memory (e.g. RAM), which memory is associated with each item of hardware, etc. and it is these details that form the above-mentioned "image" of the virtual machine. For some virtual machines, the size of the image could be large, such that transfer of the image is prohibitive with respect to the computation job being performed (i.e. expensive in-file transfer resource and in its taking too long relative to total computation time).

In accordance with a fifth aspect of the present invention, there is provided a distributed computing system comprising one or more user nodes and a plurality of remote provider nodes each having idle computational capacity for use by said one or more user nodes, said system further comprising a server for facilitating two-way communication between said one or more user nodes and said provider nodes, said server being arranged and configured to receive from a user node a processing task for execution by said distributed computing system, define a virtual machine for execution of said processing task in terms of resources required to complete said processing task, and allocate said processing task for execution to one or more provider nodes having idle computational capacity, said server being further arranged to transfer data representing said resources defining said virtual machine to a provider node executing said processing task on a piecemeal basis as it is required for execution of said processing task.

As a result, the virtual machine clients can begin useful work without having the complete virtual machine images at the outset. Virtual machine image data is supplied to the client on demand (for example, in response to a request from a provider node on behalf of a grid job running thereon) over a network connection between the configurating machine (server) and the remote resource(s). In one exemplary embodiment, pre-emptive caching techniques may also be employed so that relevant elements of the virtual machine image data is supplied just in time for use in executing the processing task. Thus, the provider node(s) may be arranged and configured to pre-empt the next request for virtual machine image data based on a previous request. Alternatively, means may be provided to recognise a particular process so that virtual machine image data requirements in respect thereof can be preempted. The resources defining the virtual machine preferably include remote access to random access memory (RAM), i.e. the virtual machine is able to simulate computer hardware with greater RAM than is physically available on a particular provider node by using memory (RAM) on other machines, such access being facilitated by the server via the network.

Finally, it may be desirable to facilitate conditional supply of spare or idle resources and use thereof. For example, a resource owner may offer spare resources to the broker on condition that none of the owner's competitors may use the resource, although the general public would be acceptable; or a company may offer resources on condition that tiiey are not used for a specific set of activities.

In accordance with a sixth aspect of the present invention, there is provided a distributed computing system, comprising one or more user nodes and a plurality of provider nodes each having spare resource(s) in the form of unused data storage resources and/or idle computational capacity for use by one or more user nodes, the system further comprising a server for facilitating two-way communication between said one or more user nodes and said provider nodes, wherein said server maintains data representative of each provider node and any predefined conditions in respect thereof relating to acceptable or unacceptable uses thereof, and is arranged to receive a data storage or processing task from a user node and allocate said task for execution by one or more provider nodes having spare resources, said one or more provider nodes being selected to execute said task only if said task comprises an acceptable use thereof as specified by said respective conditions.

Thus, the server or broker maintains a list of machines and their acceptable uses, and ensures that no inappropriate jobs are sent to machines. The same idea applies to unused storage.

In one exemplary embodiment, the server may be arranged and configured to maintain data representative of different payment scales for different respective uses of one or more of the provider nodes, such that the owner of a provider node can receive differing payments depending on the type of resource being used or the type of job being run (or type of data being stored), such that when tasks are allocated to a provider node, the server determines the type of resource and/or use, determines the payment required to be made to the owner of the provider node to which a task is allocated, and ensures appropriate accounting is kept to pay the owner the price agreed.

In all cases, the provider nodes are preferably provided with a virtual machine platform (in the form of, for example, a daemon process) for controlling execution of a submitted processing task and monitoring various parameters including idle time/spare storage, input policies regarding, for example, times of availability of the respective resource(s), credits/payments accrued, etc.

These and other aspects of the present invention will be apparent from, and elucidated with reference to, the embodiment described herein. For simplicity, the central "broker" server, the "network proxy" server and the "backup" server referred to herein are considered to be the same physical machine and therefore these terms will be used interchangeably herein. However, as noted previously, there may be situations in which different machines are actually used to provide one or more of these server functions.

An embodiment of the present invention will now be described with reference to the accompanying drawings, in which:

Figure 1 is a schematic block diagram of a computer grid system according to an exemplary embodiment of the present invention;

Figure 2 is a schematic flow diagram of a data storage process employed in a computer grid system according to an exemplary embodiment of the present invention;

Figure 3 is a schematic block diagram of a computer grid system according to an exemplary embodiment of the present invention;

Figure 4 is a schematic block diagram of a computer grid system according to an exemplary embodiment of the present invention, illustrating data transfer therein; and

Figure 5 is a schematic diagram illustrating the principal of the Huffman coding lossless compression technique.

Referring to Figure 1 of the drawings, a computer grid system 10 according to an exemplary embodiment of the present invention comprises a plurality of grid computers 12 (or "providers") connected, via a data communications network 14, to a server 16 (hereinafter referred to as a "network proxy"). The network proxy 16 is connected, again via a data communications network 14, to a plurality of client computers 18. The network proxy 16 is arranged and configured to accept jobs from the client computer(s) 1^'8, assign and communicate the job to one or more of the grid computers 12 and, where appropriate, communicate final results back to the respective client computers 18. It will be appreciated that a "job" may comprise a processing task or a data storage task.

When the network proxy 16 receives a job request from a client computer 18, it determines a number of parameters, including the computational power or quantity of storage resource required to perform the job, and the availability of such computational power or storage resource among the grid computers 12 within a specified service level, which service level determines the required level of reliability of resources to which a respective job is allocated. This will be discussed in more detail later.

The network proxy 16 may be arranged and configured to determine, on a dynamic basis and/or in response to a job request from a client computer 18, available resources among the grid computers 12 and the respective reliability thereof, and then allocate the job accordingly. Alternatively, or in addition, availability of a resource may be dictated by the owner of that resource. For example, the resource may be made continuously available between certain hours of the day, and not outside of those hours. A level of reliability of the provided resource is also set by the owner of that resource. For example, in the case of data storage, the length of notice required to be provided to the network proxy administrator or before data stored on behalf of a client on a grid computer "broker" 12 can be deleted by the resource provider may be selected from, say, none, 24 hours, 3 days, 1 week, etc. and the reliability of a respective resource is set accordingly. Any payment to the resource providers in return for the provision of respective resources may be dependent on the level of reliability the provider is prepared to guarantee.

It will be appreciated that, for example, if a commercial company wishes to offer unused storage resources to the broker so that it can be offered for use by third parties, the company is likely to require some reassurance that they will not be held liable (or deemed complicit) if a third party stores illegal (e.g. pornographic) material on their resources. Referring additionally to Figure 2 of the drawings, the network proxy 16, when it receives a data set from a client computer 18 for compression, sequences of data in the data set which are contained in the global dictionary are identified and replaced in the data set by unique identification numbers. The global dictionary is then optionally updated and/or expanded with new entries as appropriate, and each such entry (possibly encrypted and/or compressed in themselves) may be stored on a physically different, arbitrarily selected resource within the grid. The location of each encrypted/compressed global dictionary entry and the encryption key, are only known to the broker 16, such that no individual resource can be deemed to be storing a complete data, or have access to the stored data or dictionary entry representing a fragment of data sets thus stored using the global compress system. Therefore, the resource provider cannot be liable for the data stored.

Critical dictionary entries may be stored on highly reliable resources or at data centres, whereas less critical entries may be stored on less reliable (and therefore less expensive) remote resources. In fact, to increase reliability of data storage when less reliable resources are being used, entries may be replicated and stored a number of times at different physical resources. Another option would be to adopt the "Raid" format. RAID, which is short for Redundant Array of Independent Disks, is a category of disk drives that employ two or more drives in combination for fault tolerance and performance, which enables reliability of data storage to be increased in respect of relatively unreliable disks, without necessarily requiring all data to be replicated. The RAID format provides so-called data stripping (spreading out portions of each file, at block-or-byte level, across multiple disk drives), preferably with an additional parity disk, which is created by generating checksum data in respect of the data stored on the above-mentioned multiple disk drives. Thus, if one of the disks fails, the parity data can be used to create a replacement disk. For example, and by analogy to a standard RAID system in the context of this invention, digital data required to be stored could be split up into, say 8 pieces, which are then encrypted and stored on respective remote unreliable resources (i.e. provider nodes 12). Checksum data in respect of the data stored on each of the 8 unreliable resources is also generated and stored on a 9^th unreliable resource. If data is lost from any one of the 9 unreliable resources, then the remaining data stored on the other 8 resources can be used to recreate the lost data. Several different levels of the RAID format are known and can be employed, depending on the reliability of data storage specified by the respective client.

In the case where a provider offers unused storage space to the broker for use by a client, but specifies that, say, unannounced (i.e. zero notice) access will be required by the provider so that space might become unavailable in the event that their own immediate requirements change, then the provider could potentially delete a third party's data from their machine, corrupting the file and causing the data (dictionary entry) to be unrecoverable. Obviously, critical entries would be stored on more reliable resources (e.g. requiring at least 1 hour's notice before storage space can be released for use by the provider). However, less critical entries can either be stored multiple times on multiple respective relatively unreliable resources (wherein the number of copies of each piece of data will be determined by the importance of the data and/or the reliability of the resource used to store it) and/or the RAID format described above can be used.

Equally, in order to effectively manage a computer processing job with respect to less reliable remote resources, the processing job may be allocated for performance to several computational resources, possibly with staggered start times for processing, thereby increasing the probability that at least one of the resources will complete the job effectively and return a complete result set for transmission by the broker 16 to the respective client computer 18.

Referring to Figure 3 of the drawings, the network proxy 16 provided by this exemplary embodiment of the invention, has as its principal components an allocation module 20 and a VM (Virtual Machine) configuration module 22, running on the server, and a daemon process 24 running on each grid computer 12. The daemon process comprises software loaded on each grid computer 12 that communicates with the VM module 20 and also monitors various parameters locally, including idle time, policies regarding time/resources, credits/payments, etc.

When a job request is received from a client machine 18, the VM module 22 creates a virtual machine to run the job, based on the type, quantity and reliability of resources required to perform the job. It is well known to a person skilled in the art that a virtual machine is an abstract specification of a computing device that can be implemented in different ways so that the requested job can be performed effectively. The job is run by the respective virtual machine, using available resources from one or more of the grid computers 12, according to availability and suitability within defined parameters. ^' Thus, a job is received, a virtual machine to run the job is configured by the VM module 22, and the job is allocated to one or more available resources by the allocation module 20.

In respect of any particular processing job submitted for execution over the grid, the quantity of processing power required to complete the job will be specified and, based on the available resources, the virtual machine will then submit the job to a selected one or more grid computers 12 for execution. However, if the processing job is relatively large, it will take a relatively long time (e.g. 12 hours) to complete, whereas individual grid computer 12 may only be available for use for shorter periods of time. This makes it difficult to schedule a long job for execution in conventional arrangements and, even if sufficient available resource starts executing the job, it may have to be interrupted suddenly in the event that the machine's owner wishes to use the resource (owners always have priority). However, the present invention deals with this issue, because each job is run by a virtual machine, a job, and its VM, can be check-pointed (i.e. a "snapshot" of the current state of a job can be taken) periodically at the resource on which it is currently being executed, and the resulting data returned to the backup server/broker 16. Thus, if a resource suddenly ceases to be available, the job can be restarted by the broker 16 on another available resource using the state of the grid job and VM as at the last recorded check-point. In this way, only a small proportion of the computational power is wasted by unpredictable resource availability. In other words, the broker 16 is arranged to monitor the resource(s) on which any job is currently being performed, and re-allocate the job to another available resource in the event that a current resource becomes unavailable.

Furthermore, the facility may be provided whereby the broker 16 starts a job in multiple places (not necessarily concurrently), so that multiple checkpoints are available at any given time, and the probability of significant loss due to sudden unavailability of a resource is minimised. It will be appreciated that the daemon process 24 provided in respect of each grid computer 12 is arranged and configured to perform the periodic checkpointing and return the results to the broker 16 across the network 14.

The virtualisation functionality provided by the arrangement described above effectively guarantees isolation of each job from the machine on which it is being performed at any specified time. So-called "freezing" (check-pointing) of a job is effected automatically, either pre-emptively (because the broker knows that a resource is scheduled to- become unavailable) or dynamically, in response to unpredicted notice that a resource is (or will soon be available).

Thus, as explained above, this exemplary embodiment of the invention provides a virtual software environment in which a client's job can be run (and isolated from the actual resource on which it is being executed at any one time), and the job may be moved around the network during execution as a consequence of the varying availability of the hardware resources, in the sense that jobs can only run on idle resources, and idle time on any given resource may not be sufficient to cover the entire execution time of a job. Under normal circumstances, this would result in all network transfers being broken because the IP address of the machine to which a job is being moved will not be the same as the one from which the job is being transferred.

In accordance with this exemplary embodiment of the present invention, this problem is avoided because all network traffic is "tunnelled" through the single, reliable network proxy 16. Referring to Figure 4 of the drawings, each client machine 12 has a network card with a respective MAC address MAC01, MAC03, MAC05, , and each virtual machine

(VM), run by the daemon process 24 provided on each client machine, also has a unique respective IP address 02, 04, 06, All networking, however, is performed by the network proxy server 16 on behalf of all running jobs, so that when a job moves to a new machine (having a different IP address), the server 16 simply redirects network traffic accordingly. The "outside world" communicates with the grid job via the server and only sees (via the network 14) the IP address(s) that the server 16 has assigned to the grid job. Such IP address(s) for each grid job are thus assigned and controlled by the server, if the server moves the grid job to free a provider's resource, the server can ensure continuity of network connectivity by redirecting all subsequent network traffic to the grid job (now running on the new provider's resource).

Network data 26 generated at the daemon process 24 of a grid machine 12 is transmitted via a standard network protocol (e.g. TCP/IP) 28, via the network 14 to the server 16. The server 16 is arranged and configured to "unwrap" the data (so as to effectively remove the associated IP address of the virtual network card from which the data originates) and access the raw data. Thus, the server or "network proxy" opens data received from each virtual network card and then resends the data to the rest of the world via the network 14.. Thus, it appears that the data actually originates from address 30 which the server 16 has previously allocated to the grid job. The network proxy server handles network data at the lowest level, thereby being able to proxy all network protocols at the same time. Thus, each virtual machine will have a network connection via a virtual network card having a unique, respective IP address. However, because this network connection is effectively "tunnelled" through the server 16 to the rest of the world, i.e. the network proxy provides a so-called "bottleneck" for all network connections, the following principal advantages are attained:

1. Any network activity originating from the grid computers 12 always appears (to "the rest of the world") to be coming from a specified IP address (or possibly one of a predefined set of IP addresses) allocated by the server. Thus, even if a processing job has been executed by several different grid computers, the client computer 18 (and indeed every other computer on the internal and external network) sees only the single IP address for the grid job's VM.

2. The administrator of the server has complete control, which cannot be overridden or bypassed by the grid job, as to the types of network resources that can be connected to. Any attempt by a grid job to connect to unauthorised network resources can be prevented.

3. Patterns of behaviour can be monitored such that potential problems (e.g. denial of service attacks) can be identified accordingly, hi this case, network activity is maintained and the number and nature of network connections determined; lots of jobs accessing the same network address over and over again is a main characteristic of a Denial of Service attack. This is possible because of the virtual layer provided by the network proxy i.e. because all jobs are routed through the virtual layer, all network activity can be seen.

In some cases, a computational request or job may require the use of multiple CPUs using MPI (Message Passing Interface) communications between the process on each CPU. When such a computational job is distributed to remote resources in a computer grid system, these MPI communications cannot, under normal circumstances, occur directly between each remote resource because of firewall restrictions. However, in this exemplary embodiment of the present invention, all network traffic is tunnelled/routed through the network proxy server 16 through the communications channel that is permitted between the server and the remote resources 12. The network proxy 16 effectively becomes the MPI switch in this case, in the sense that it directs data between the virtual CPUs (i.e. CPUs within the virtual machines 24) of the various remote resources. As will be well known to a person skilled in the art, an MPI switch is a high speed interconnect switch, and the MPI code used to effect MPI communications is a very low level code (almost at machine code). In a preferred embodiment, it is unnecessary for this MPI code (defined by the computational job) to be modified. Thus, referring back to Figure 3, the virtual machine (VM) module 22 provides software that defines the number and type of CPUs required to execute a computational request, the quantity and type of memory, and the relationship between the memory and respective CPUs. Thus, a virtual machine, may be defined that comprises a number of CPUs, a certain quantity of various types of memory, and the respective relationships therebetween, and the computational request requiring these resources for execution thereof runs on this VM. In fact, known "cluster" software may be provided which can be used to cluster VMs configured to customer requirements, so as to imitate clustered hardware machines. Various multi-processor architectures are envisaged using the virtual machine tunnelling techniques proposed herein, including shared memory multi-processors and other virtual interconnect hardware implemented in the virtual machine clients where the interconnect communications are tunnelled to the server.

This network "tunnelling", as described above, comprises a 2-way point-to-point connection protocol (low level network access) which wraps network data up in a protective layer (to "hide" them from the firewall) at the server 16 and then sends them to the specified destination machine via a dedicated network.

As explained above, the underlying concept of at least one aspect of the present invention is the provision of a computer grid with a server or "network proxy" acting as a broker between users of computer resources and providers of spare/idle computer resources. A virtual machine module 22 is used to define, in software, a hardware specification designed to meet the requirements of a computational request. This virtual machine is then realised using the remote resources provided by the grid computers 12. Under normal circumstances, if a virtual machine is run on a remote resource, there is an initial delay before any useful computation can be done, while the "image" of the virtual machine (disk, memory, etc), as defined by the VM module 22, is transferred to the daemon process 24 of the selected remote resource (allocated by the allocation module 20). For some virtual machines, the size of this image could be relatively large, such that the transfer of the image is prohibitive with regard to the computation job being done (i.e. it is expensive in file transfer resources and takes too long with respect to total computation time).

A solution to this problem is provided in accordance with this exemplary embodiment of the present invention by causing the grid computers 12 allocated to realise a virtual machine, and to commence useful work without having the complete virtual machine images. Instead, virtual machine image data is supplied to the allocated remote resource 12 over the network 14 by the server 16 in an incremental fashion. Thus, whatever is required by the remote resource 12 allocated to execute a computational job may be transferred thereto, in response to a request, or this may occur pre-emptively (for example, using algorithms currently employed in conventional Operating System (OS) virtual memory or hard disk caching systems, which send data in anticipation of the next request based on the previous data requested). Alternatively, or in addition, the system (server 16 or daemon process 24, or a combination thereof) may be enabled to record how a specific class of grid job typically behaves, and thus be able to pre-empt the requirements of the remote resource from knowledge of previous times the grid job was run.

It is well known to provide virtual memory in respect of operating systems on desktop computers. With virtual memory, when available RAM has been filled, the computer can look at RAM for areas that have not been used recently, and copy them onto the local hard disk. This frees up space in RAM to load another application, say. In this exemplary embodiment of the invention, it is proposed to provide virtual RAM across the network 14, as required (in response to a request from a remote resource 12 or pre-emptively as described above).

This may be achieved by effectively allowing the remote resource (via communication between the VM module 20 and the daemon process 24 running the virtual machine locally) to provide a virtual machine 24 that reports more RAM that is actually available to grid jobs running on the remote resource 12. This is achieved at the hardware level, such that the remote resource 12 may provide a virtual machine 24 which claims to have any amount of RAM required, but access to any required RAM memory in excess of the actual amount available on the resource 12 is provided by the server 16, or another remote resource 12 via the server 16, across the network 14.

Finally, it is envisaged that a resource owner may offer space/idle resources to the broker 16, on condition that, for example, none of the owner's competitors can use the resource, although use by the general public would be acceptable, or the offer of use of the resource may be conditional in that it cannot be used for one or more specific activities. In order to solve this issue, it is proposed that the broker maintains a list of machines (available resources) and their acceptable uses, and the allocation module 22 ensures that no inappropriate jobs or data (for storage) are sent to respective selected machines. As an extension of this concept, provision may be made for the owner of a resource to receive different payment, depending on the resource being used and/or the type of job being run (or type of data being stored). In this case, the broker is arranged and configured to ensure appropriate accounting facilities so that the owner of a resource is recompensed correctly at the price structure agreed.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be capable of designing many alternative embodiments without departing from the scope of the invention as defined by the appended claims. In the claims, any reference signs placed in parentheses shall not be construed as limiting the claims. The word "comprising" and "comprises", and the like, does not exclude the presence of elements or steps other than those listed in any claim or the specification as a whole. The singular reference of an element does not exclude the plural reference of such elements and vice- versa. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

CLAIMS:

1. In accordance with the first aspect of the present invention, there is provided a method of compressing a data set comprising a plurality of data files comprising identifying recurring sections of data between said plurality data files, storing a single copy of each identified recurring section of data in a global dictionary and generating in respect of each data file, an index stream including, instead of an identified recurring section of data, a reference to the respective recurring section of data in said global dictionary.

2. A method according to claim 1, wherein the global dictionary is stored separately from the index streams.

3. A method according to claim 1 or claim 2, wherein said global dictionary is made available for use across a communications network.

4. A method according to any one of claims 1 to 3, wherein when a data file is required to be compressed, the global dictionary is updated so as to provide an optimal resource built from statistics of all data files stored by any user using a system employing the method.

5. A system for compressing a data set comprising a plurality of data files, comprising means for identifying recurring sections of data between said plurality of data files, means for storing a single copy of each identified recurring section of data in a global dictionary and means for generating in respect of each data file, an index stream including, instead of an identified recurring section of data, a reference to the respective recurring section of data in said global dictionary.

6. A global directory for use in the method according to any one of claims 1 to 4, said global dictionary comprising a plurality of fragments of data, representative of recurring section of data identified between said data files each fragment being identifiable with a unique identification code.

7. A distributed computing system comprising one or more user nodes and a plurality of remote provider nodes each having unused data storage resources for use by one or more of said user nodes, said system being arranged to receive a data set for storage, said data set comprising a plurality of data files, means for compressing said data set using the method according to any one of claims 1 to 4, so as to generate or update a global dictionary and a plurality of respective index streams, and means for storing said global dictionary on one or more of said provider nodes.

8. A system according to claim 7, further comprising a server for facilitating remote access by a user node to said global dictionary to enable an index file to be decompressed.

9. A distributed computing system comprising one or more user nodes and a server being arranged and configured to receive a data set for compression means for accessing a global dictionary as defined above, identify, within said data set, fragments of data which are in said global dictionary and replacing said identified data fragments in said data set with said unique identification code so as to generate a compressed data set, said server being arranged and configured to forward said compressed data on to one or more remote nodes for storage or processing.

10. A system according to claim 9, wherein said server is further arranged and configured to receive a compressed data set, look up said identification codes in said global dictionary so as to identify the respective data fragments corresponding thereto, replacing said identification codes in said compressed data with said identified corresponding data fragments so as to reconstruct said data set, and forwarding said reconstructed data set on to one or more remote nodes for storage or processing.

11. A system according to claim 9 or claim 10, wherein means are provided, whereby, when a data set is received by the server for compression, additional data fragments identified within said data set are assigned a respective identification code and entered into said global dictionary so as to extend or update it.

12. A system according to any one of claims 1 to 11, comprising means for transmitting one or more data fragments of said global dictionary for one or more remote provider nodes for storage, and means for recording the identity of a provider node to which said one or more data fragments have been transmitted for storage so as to facilitate subsequent access thereto.

13. A system according to any one of claims 9 to 12, wherein said data fragments of the global dictionary may be compressed and/or encrypted prior to storage thereof.

14. A system according to any one of claims 9 to 13, wherein the provider nodes selected to store one or more data fragments of said global dictionary may be selected arbitrarily from a set of resources available for use by said server.

15. A system according to claim 14, wherein the set of available resources is selected depending on the level of reliability of data storage offered thereby.

16. A system according to claim 14 or claim 15, wherein the service level defining the level of reliability of data storage is selected by the originator of said data set.

17. A system according to any one of claims 9 to 16, wherein one or more of said data fragments of said global dictionary are replicated and each replicated data fragment is transmitted to different respective provider nodes for storage thereby.

18. A system according to any one of claims 9 to 17, wherein checksum data is generated in respect of said data fragments of said global dictionary stored on separate respective resources and that checksum data is stored on another remote resource.

19. A system according to claim 18, wherein said checksum data is encrypted and/or compressed.

20. A system according to any one of claims 14 to 16, comprising means for mapping user requirements defined by said selected user level onto the reliability of the storage resources available for use.

21. A distributed computing system comprising one or more user nodes and a plurality of remote provider nodes each having idle computational capacity for use by said one or more user nodes, said system further comprising a server for facilitating two-way communication between said one or more user nodes and said provider nodes, said server being arranged and configured to receive from a user node a processing task for execution by said distributed computing system, define a virtual machine for execution of said processing task in terms of resources required to complete said processing task, allocate said processing task for execution to a first provider node having idle computational capacity, monitor said provider node executing said task for continuing availability of said idle computational capacity, and interrupt and then move said processing task for execution to a second provider node having idle computational capacity if the idle computational capacity of the first provider node becomes unavailable.

22. A system according to claim 21, wherein said processing task is periodically "check pointed" during execution thereof by the provider node currently executing the task, and the state of the task at that time is transmitted to a backup server.

23. A system according to claim 22, wherein a check-point is used by said server to restart a processing task on another provider node if the idle computational capacity of the first provider node becomes unavailable.

24. A system according to any one of claims 21 to 23, wherein the server is arranged and configured to commence execution of a processing task on a plurality of provider nodes.

25. A system according to claim 24^ wherein execution of a processing task is commenced on a plurality of provider nodes, either simultaneously or with different start times.

26. A system according to any one of claims 21 to 25, arranged and configured such that all communication between said at least one user node and said provider nodes is tunnelled through a network proxy server.

27. A system according to claim 26, comprising means in the provider nodes for transmitting network data using a predetermined transmission protocol to said server, said server being arranged and configured to unwrap said network data before retransmitting said data onward to its intended destination.

28. A distributed computing system comprising one or more user nodes and a plurality of provider nodes each having idle computational capacity for use by said one or more user nodes, said system further comprising a server for facilitating two-way communication between said one or more user nodes and said provider nodes, said server being arranged and configured to receive from a user node a processing task for execution by said distributed computing system, define a virtual machine for execution of said processing task in terms of hardware required to complete said processing task, said server being arranged and configured to realise said virtual machine by distributing said processing task among a plurality of provider nodes, and to facilitate interconnect communications between hardware of respective provider nodes across a dedicated virtual network by tunnelling said communications through a specified server.

29. A distributed computing system comprising one or more user nodes and a plurality of remote provider nodes each having idle computational capacity for use by said one or more user nodes, said system further comprising a server for facilitating two-way communication between said one or more user nodes and said provider nodes, said server being arranged and configured to receive from a user node a processing task for execution by said distributed computing system, define a virtual machine for execution of said processing task in terms of resources required to complete said processing task, and allocate said processing task for execution to one or more provider nodes having idle computational capacity, said server being further arranged to transfer data representing said resources defining said virtual machine to a provider node executing said processing task on a piecemeal basis as it is required for execution of said processing task.

30. A system according to claim 29, wherein said data representing resources defining said virtual machine is transferred to said provider node in an incremental fashion over a network connection between the server and said provider nodes.

31. A system according to claim 30, wherein said data representing resources defining said virtual machine is transferred to said provider node on demand over said network connection.

32. A system according to any one of claims 29 to 31 , wherein pre-emptive caching techniques are employed so that data representative resources defining said virtual machine is transferred to said provider node just in time for use in executing the processing task.

33. A system according to any one of claims 29 to 32, wherein means are provided to recognise a particular processing task such that requirements for data representing resources defining said processing task within said virtual machine in respect thereof can be pre-empted.

34. A system according to any one of claims 29 to 33, wherein said resources defining said virtual machine include remote access to random access memory (RAM).