GB2443442A

GB2443442A - Automated redundancy control and recovery mechanisms in a clustered computing system

Info

Publication number: GB2443442A
Application number: GB0622010A
Authority: GB
Inventors: Jonathan Morgan; Nicholas Pearce-Tomenius
Original assignee: OBJECT MATRIX Ltd
Current assignee: OBJECT MATRIX Ltd
Priority date: 2006-11-04
Filing date: 2006-11-04
Publication date: 2008-05-07
Also published as: GB0622010D0

Abstract

The invention relates to method for providing data redundancy across a clustered computing system according to a replication policy. Each cluster of the system comprises at least one computer node. Cluster control software running on each cluster can control the number of locations where data objects are stored such that a required level of data redundancy is reached and maintained in both normal operation and failover scenarios without undue storage overhead. The method comprises the steps of: selecting a replication scheme that describes the level of data redundancy required in the clustered system; applying the scheme to data stored on a computer node within a cluster of the system; if required by the scheme, replicating the data onto an additional computer node within the clustered system; and deleting any data replications surplus to that required by the scheme. Steps may be repeated until the level of data redundancy required by the scheme is attained.

Description

Automated Data Redundancy Within a Cluster Set This invention relates

to a method for providing data redundancy across a cluster set, where the cluster set comprises a plurality of clusters connected for communication with each other and wherein each cluster comprises at least one computer node, the or each node of each cluster being interconnected for communication therebetween, and to a cluster set capable of performing such a method.

A cluster is a set of one or more computers or computer nodes that work together to run a common set of applications and appear as a single system to a client or application. The computer nodes may be connected either wirelessly, or physically by cables, and controlled by suitable cluster software. This arrangement allows the nodes to be geographically separate. The clustered organisation permits "failover" and "load balancing", which is not possible with a stand-alone computer, as will be described in more detail below. A cluster set is a connected set of such clusters.

Clusters may be designed to act as stores of "data objects", which are defined as objects consisting of any or all of these parts: data, related metadata, and related policies. For example, data objects'may be created to describe data elements such as emails, backup binaries, completed documents or any other static data type.

Data objects that encapsulate data useful to an organisation such as backup data, or reference information, can be stored in a cluster of computer nodes that has access to a set of storage devices. The cluster may be designed so as to avoid a single point-of-failure, enabling high availability of the data objects. Data objects can be stored on more than one cluster storage location (for example by splitting the object using algorithms such as RAID-5 or by mirroring the data on to several storage locations) to achieve a degree of parallelism and failure recovery, and providing more availability. Multiple nodes in a cluster remain in constant communication, which enables the functionality of the cluster to be determined.

Should one or more of the nodes or storage devices in a cluster become unavailable, for example as a result of failure or maintenance, another node or storage device is selected by the cluster software to take over the task of providing the data object. This process is known as failover. If the system has high availability, users who access the service would be able to continue to access the service, and would be unaware that the service is now being provided by another node or storage location. The advantages described of clustering computers make it a highly desirable platform to store data to.

A known cluster set system is shown in Fig. I. A cluster set 105 includes two clusters 110, 150. Cluster 110 contains a set of one or more computer nodes 115, here two such nodes are shown. A client 149 can communicate with the cluster 110 via a communications network that talks with interfaces published by. the node's software modules 120. Each node 115 either comprises, or has access to, one or more storage devices 135. Each storage device 135 can be accessed by one or more nodes 115. As shown, each node 11 5 accesses exactly one storage device 135 and each storage device 135 is accessed by only one node 115. Each node 115 comprises memory 125, a processor unit 130, and an input/output unit 140.

Cluster 1 50 meanwhile comprises nodes 155, with each node 155 comprising memory 165, a processor unit 170, and an input/output unit 180. Each node 155 either comprises, or has access to, a storage device 175.

Each processor unit 1 30, 1 70 comprises several elements such as a data queue, arithmetic logic unit, memory read register, memory write register, etc. Cluster software 121 runs on each node 115 within the cluster 110, so that the nodes in the cluster 110 can recognise each other and work together as a team, providing the set of data object storage services supported by other software modules 122, and 123. Likewise, cluster software 161 runs on each node 155 within cluster 150, so that the nodes in cluster 150 can recognise each other and work together as a team, providing the set of data object services supported by other software modules 162 and 163.

Clusters 115 and 155 can talk with one another via a communication link, sharing or transferring data objects and configuration information. Typically, clusters 115 and 155 are geographically separated, whereas the nodes within a cluster are directly connected by cabling. Other clusters may also be in the cluster set 105, aware of each other through automated detection or manual configuration, adding their computation, storage and software module resources to the set.

A typical failover sequence would take place in a typical cluster such as cluster 110.

Suppose a hardware failure occurs in a node 115 such that it is unable to communicate with client 149 via its input/output 140 or with other nodes in the cluster 110 or with other clusters in the cluster set 105. Since client 149 is unable to communicate with the node to retrieve, store, or perform functionality on the data objects that the failed node 110 has in its storage device 135, the cluster 110 provides another node 115 where a mirrored copy of that data object resides. In some instances of a cluster, the cluster is capable of returning the minimum set of information the client 149 needs to build the data object from a number of storage locations 135 accessed by a number of nodes 115. In other instances of a cluster it is the responsibility of the cluster to construct a data object in a temporary location, from a number f storage locations accessed by a number of nodes 115, before returning the data object to the client 149. The cluster can be capable of building a data object from a number of locations containing fragments of that data object when it has previously employed a data redundancy algorithm to store the data.

In the system shown in Fig. I, replication of data objects is carried out by a replication software module 122 held within software module 120 and communicating with a replication software module 162 held within software module 160. Data objects that are to be written to storage device 135 and are requested to be replicated either due to an automated procedure or due to a specific client request, are listed and transferred to cluster 150. A compression module may be used to reduce the amount of data transferred to cluster 150. The replication module may support both synchronous and asynchronous replication types.

The clusters 115 and 150 can replicate data objects over a direct or long distance communication link. However, in the system described thus far, although there is a replication link, there is no automated functionality that maintains the required levels of data redundancy with a set of clusters, nor that provides automated recovery of required redundancy levels in the case of a failover scenario.

A cluster system which shows various of these features is known from In computer storage, "data redundancy" (also referred to as data reliability) is a property of some disk arrays which provides fault tolerance such that if some disks fail, all or part of the data stored by the array is not lost. The cost of providing this feature is most typically increased disk space; implementations require either a duplication of the entire data set or an error-correcting code to be stored on the array. When data is stored in a cluster, on multiple nodes, the concept of data redundancy can be extended such that data is not just stored on multiple disks but instead is stored on multiple nodes, either mirrored (with an exact copy) or striped (using error-correcting code).

Currently, there exist systems that replicate or migrate data from one cluster to another but currently clustering software is not capable of guaranteeing that data is consistently highly available whilst being able to migrate data to the appropriate location(s) within the network of clusters.

It is an object of the present invention to provide a system that allows the overhead of storing extra copies of the data at a location without compromising data availability or data redundancy. In such a system, at no point will the data redundancy defined by the policy of a * -4-iata object (or set of data objects) be compromised and furthermore neither will the data availability be compromised. Furthermore, recovery from failure scenarios (e.g. a node going down) can be automated in such a manner that minimises storage redundancy overhead.

With the present invention, a data redundancy level is defined for a set of objects to indicate how data should be stored, e.g., mirrored three times.

It is a further object of the present invention to provide a method for performing automated redundancy control of data objects and recovery mechanisms between two or more clusters of nodes. The cluster control software running on each cluster can control the number of locations where data objects are stored in such a manner that the required level of data redundancy is reached and maintained both in normal operation and in failover scenarios, without undue storage overhead and with flexibility to be independent of the hardware types used in each cluster.

in accordance with a first aspect of the present invention, there is provided a method for providing data redundancy across a cluster set, the cluster set comprising a plurality of clusters connected for communication with each other and wherein each cluster comprises at least one computer node, the or each node of each cluster being interconnected for communication therebetween, the method comprising the steps of: a) selecting a replication scheme that describes the level of data redundancy required in the cluster set, b) applying the scheme to data stored on a computer node within a cluster of the cluster set, c) if required by the scheme, replicating the data onto an additional computer node within the cluster set, d) deleting any data replications surplus to that required by the scheme, e) repeating steps c) and d) until the level of data redundancy required by the scheme is attained.

Steps a) to e) may automatically implemented by controlling software.

In this case, after step a), the replication scheme may be passed to the controlling software by a client.

referably, the replication scheme is included with the data, for example as metadata or policy information.

Advantageously, in step b), the replication scheme is applied to a set of data objects according to their physical location or data type.

Preferably, the method may include the step of providing a default replication scheme.

The scheme may define transient, minimum and / or final levels of data redundancy on individual nodes, clusters and / or the cluster set. This provides a full mechanism for ensuring the required redundancy.

The method may further comprise the steps of: f) detecting changes to the functionality of the cluster set, g) if changes are detected, repeating steps b) to e).

This feature ensures that failure of a node for example will not lead to an insufficient data redundancy level.

Step a) may include checking the scheme, and if the scheme is found to be incomplete or incorrect, carrying out the remaining steps using default values of data redundancy. This feature addresses the problem of errors in the replication scheme.

The method may further comprise the step of dynamic reallocation of the data onto at least one selected computer node to provide required redundancy levels. This feature provides additional flexibility within the system.

The method may include the step of storing a data tag on a computer node of a cluster, the tag comprising information relating to the identity and location of the data. The provision of such a data tag enables the node to request and obtain a replication of the data from a different node. A node on which the tag is stored may be responsible for ensuring the data exists on the cluster.

In accordance with a second aspect of the present invention there is provided a computer program for enabling the method, when implemented on a cluster set.

accordance with a third aspect of the present invention there is provided a computer-readable medium carrying the program.

In accordance with a fourth aspect of the present invention there is provided a cluster set when programmed with the program.

Tn accordance with a fifth aspect of the present invention there is provided a cluster set comprising a plurality of clusters connected for communication with each other, each cluster comprising at least one computer node, with the or each node of each cluster being interconnected for communication therebetween and comprising means for storing data, wherein the cluster set is programmed with a replication scheme that describes the level of data redundancy required in the cluster set, the cluster set being provided with means for replicating data stored on a computer node for storage onto a different node in accordance with the scheme and deleting any data replications surplus to that required by the scheme.

The invention will now be described with reference to the accompanying figures, in which: Fig. I diagrammatically shows a known cluster set; Fig. 2a schematically shows a data object; Fig. 2b shows a redundancy record information set in accordance with an embodiment of the present invention; Fig. 2c shows a policy setting priority chart in accordance with the embodiment; Figs. 3a to c schematically show stages for achieving automated redundancy levels in accordance with the embodiment; and Figs. 4a to c schematically show stages for providing automated failover and data recovery in accordance with the embodiment.

Figs. 2a to c show how data object redundancy levels can be decided. Fig. 2a schematically shows a structure of a data object 210 which can be stored in accordance with the present invention. The data object 210 shown comprises data 211, policy information 212 and metadata 213. A data object 210 need not contain all three of those elements but must contain at least one. The present invention requires a replication scheme to be established, which sets out the required redundancy levels, an example of this scheme or "redundancy required information set" (RRIS) 220 being shown in Fig. 2b. The RRIS 220 may optionally e included within the policy information 212, but in any case the RRIS can always be derived by examining at least one of the policy 212, the physical or logical location of the data object, the metadata 213 of the data object, the data 211 of the object or alternatively by using a default RRIS.

Fig. 2b shows an RRIS 220 in accordance with an embodiment of the present invention, which includes information about the level of data redundancy required at each cluster location. The "primary" location is the first location where the data is stored. This RRIS could be described in a number of different ways: a scheme could be devised wherein the primary cluster is an actual location or cluster type, for example a cluster type could be "MP3Archive" and if the cluster storing the data object exhibits the property "MP3Archive" then it would apply that level of data redundancy. Information from the RRIS 220, combined with information from the cluster, such as cluster 110 as shown in Fig. 1, and information from the data object 210 is capable of describing the level of data redundancy to be applied upon a node, the local cluster and I or upon the cluster set as a whole. A cluster can be identified by a "dynamic quality", by a "fixed attribute", or by both dynamic qualities and fixed attributes. As shown in Fig. 2b, the "Primary" and "2'" locations are examples of dynamic qualities. A primary cluster is the first cluster where an individual client stores data, where a "2'" is any other cluster that the data stored is replicated to. A cluster that has been defined within its configuration as an "Arc" cluster is an example of a fixed attribute. Here "Arc" indicates for example that that particular cluster is responsible for a secure archive of data. The RRIS 220 also defines a minimum level of data redundancy that must be held by the cluster set as a whole. As shown, it is defined that the first location where the data object is stored must contain at least one copy of the data, and that the second location where the data object is stored must contain at least one copy of the data, but that any cluster exhibiting the "Arc" attribute must store at least two copies of the data. Furthermore, the cluster set as a whole must store at least two copies of the data.

Fig. 2c shows a "Policy selecting priority chart" 230 according to this embodiment, illustrating the choice of RRIS taken from various sources in increasing order of preference.

Here the preference is to use an RRIS indicated in the "data object policy" setting 234. If that is unavailable, an RRIS discovered by examining the data objects metadata 233 is used. As an example, "MP3" may be a data object metadata value. If that is also unavailable, an RRIS discovered by examining the data object's user group 232 is used. In this embodiment the user group is also stored in the data object's metadata. If all of those are unavailable, the default RRIS 231 for the cluster set 105 is used.

In general, configuration information can be set either by default, or by a user, into the set of clusters to describe the type of redundancy and type of automated failover required within each cluster and upon the set of clusters as a whole. Furthermore, the configuration information can be set to apply to a set of related data objects, thus it could be the case that a user sets no redundancy to occur on one set of data, whilst another set of data would be required to be at another level of redundancy. Configuration information for a set of data objects is set upon one node from where it is distributed, as described below, to the other nodes within that cluster as well as to other connected clusters that hold or may hold that set of data objects.

If manually entered configuration information is incorrect, then the configuration information is rejected by the cluster control software and default configuration information is used instead.

Figs. 3a to c schematically illustrate an embodiment wherein a cluster set containing two clusters, 310 and 350, is requested by a client 349 to store a data object 399, using the RRIS as shown in Fig. 2b. Figs. 3a to c respectively show three strategic life stages that the data object 399 goes through as it is stored to the cluster set and as the cluster set achieves the required automated cluster set data redundancy level settings. However, intermediate stages have not been shown for clarity of the invention, but can easily be deduced by someone of ordinary skill in the art.

Looking at Fig. 3a initially, cluster 310 contains a set of four identical nodes 315 to 318 and cluster 350 contains a set of four identical nodes 355 to 358. The nodes in the cluster set described contain the components as described with reference to Fig. 1. The clusters 310, 350 are connected via an asynchronous internet link. The four nodes 31531 8 are cabled together and can synchronously perform data storage between any two of those nodes.

Similarly, the four nodes 355-358 are cabled together and can synchronously perform data storage between any two of those nodes.

In Fig. 3a, the client 349 has elected to store a data object 399 to the cluster set 305 via access point cluster 310. The client data transfers data object 399 to the cluster set by transferring the data object 399 to node 315 in cluster 310, i.e. by dynamic reallocation of the data object. Node 315 ensures that the data is synchronously stored to a second node 316 before informing the client 349 that the data object 399 is stored. In such a way, the RRIS requirement shown for this embodiment in Fig. 2b, i.e. that the cluster set must keep at least two mirrored copies of a data object, has been fulfilled. * -9-

In Fig. 3b, the cluster 310 asynchronously copies the data object to cluster 350. The cluster 350 stores the data to two nodes 355, 356 to ensure that the RRIS requirement shown in Fig. 2b, i.e. that the cluster set must keep at least two mirrored copies of a data object is fulfilled. In fact there are now four copies of the data but each cluster does not necessarily know this yet.

In Fig. 3c, which for clarity only shows two nodes in each cluster, the clusters 310 and 350 communicate with each other to discover that both clusters have two copies of the data object 399. Looking at the RRIS table in Fig. 2b the clusters agree that they both have two copies, that neither cluster exhibits the "Arc" type, that both clusters are required to store at least one copy of the data object, and that the cluster set as a whole need store only two copies of the data object. Thus, it is logically concluded that they may both delete one copy of the data object 399, and for the purposes of automated failover, described below, one copy of the data object 399 in cluster 310 (on node 316) is replaced with a tag 391 and one copy of the data object 399 in cluster 350 (on node 356) is replaced with a tag 392. As a result, the data object 399 exists in its entirety both on node 315 of cluster 310 and on node 355 of cluster 350. Tag 391 on node 316 describes the data object to a sufficient level to identify it and its expected location within cluster 310. Similarly, tag 392 on node 356 describes the data object to a sufficient level to identify it and its expected location within cluster 350.

Although a relatively simple embodiment has been described, the method of providing automated data redundancy levels can be used to provide a level of automated data redundancy that can be guaranteed across a cluster set in a manner that at no point breaks a minimum required level of data redundancy, i.e. by setting transient redundancy levels for nodes, clusters or the cluster set, but is capable of reducing the level of data redundancy within a specific cluster, within the set RRIS parameters, in order to save storage space.

Furthermore, this method can be used to transfer data to a separate location, for example by specifying in the RRIS that the minimum number of copies of a data object within one cluster is zero, whilst the minimum number of copies within the cluster set as a whole is one or more.

The RRIS may be extended to cope with concepts such as creating extra copies of data objects due to dynamic events, such as large numbers of read requests for the object.

It can be seen that this method enables the locating of data within a cluster set in a manner that is invisible or "virtualised" from the client using the data such that the client is unaware of the physical logicality of the data.

Figs. 4a-c schematically illustrate how the invention may provide automated failover nd data recovery within a cluster set storing data objects, for example as shown in Fig. 3c.

They respectively show three strategic life stages that the cluster set goes through during automated data failover and data recovery. Intermediate stages have not been shown for clarity of the invention, but can easily be deduced by someone of ordinary skill in the art.

Fig. 4a shows a situation in which node 315 of cluster 310 has experienced an outage which could have been caused by a number of reasons, for example hardware failure.

Fig. 4b shows node 316 reacting to this situation. Node 316 may have realised that this situation exists for one of many reasons, for example because it has been unable to contact node 315 for a certain amount of time, or because a client has requested to read a data object on that node. The node holding the data tag that corresponds to a data object that cannot be reached is responsible for ensuring the object exists on the local cluster, and the node that holds the data object is responsible for ensuring that the data tag exists on the cluster. Realising that the RRIS requirements for the data object 399 have been broken, the node 316 makes a request to other clusters in the cluster set, in this case cluster 350, for a copy of the data object 399.

Fig. 4c shows that the node 316 has selected a new location to store the data object 399 in the cluster 310. A copy of the data object is read from node 355 and stored to node 317. The tag 391 is updated to indicate the new location of the data object 399 in the cluster 310. Thus, automated failover and data recovery is performed such that the required level of data redundancy described within the RRIS is restored via automated mechanisms. Whilst in this embodiment data tags (e.g., 391) are used to perform the operations described, it is also possible to achieve and maintain the levels of data redundancy required without the use of such "one-to-one" tags, for example, by using lists of data objects, or by adding metadata about the data object within the data object at the remote location etc. It can be seen that the method may aid detection of failures at a data object level and implementation of automated failover recovery scenarios such that the node, cluster or set of clusters returns to its required level of data redundancy as appropriate. The level of required data redundancy should be re-reached as soon as possible. In alternative embodiments, the method may be implemented at set points, so that for example the level of required data redundancy is re-reached at a defined time of day or at a defined speed of data transfer.

Although one embodiment of the invention has been described in detail above, many other possibilities will be possible within the scope of the claims. The inventive method may be implemented by hardware, firmware, software, or any combination thereof. The software / rmware may include the actual code to carry out the operations described, or code that emulates or simulates the operations.

The program or code segments can be stored in a readable medium or transmitted by a computer data signal embodied in a carrier wave, or a signal modulated by a carrier over a transmission medium. The computer readable medium may include any medium that can store, transmit or transfer information. Examples of the processor readable or machine accessible medium include an electronic circuit, a semiconductor memory device, a read only memory (RUM), a flash memory, an erasable RUM (EROM), a floppy diskette, a compact disk (CD) RUM, an optical disk, a hard disk, a fibre optic medium, a radio frequency link (RF), etc. The computer data signal may include any signal that can propagate over a transmission medium such as electronic network channels, optical fibres, air, electromagnetic, RF links, etc. Code segments may be downloaded through Internet, Intranet, etc. The machine readable medium may be embodied in an article of manufacture, and may include program code embedded therein. The program code may include machine-readable code to perform the operations described above. The term "data" as used herein refers to any type of information that is encoded for machine-readable purposes. Therefore, it may include program, code, data, file, etc.

Claims

LALMS

1. A method for providing data redundancy across a cluster set, the cluster set comprising a plurality of clusters connected for communication with each other and wherein each cluster comprises at least one computer node, the or each node of each cluster being interconnected for communication therebetween, the method comprising the steps of: a) selecting a replication scheme that describes the level of data redundancy required in the cluster set, b) applying the scheme to data stored on a computer node within a cluster of the cluster set, c) if required by the scheme, replicating the data onto an additional computer node within the cluster set, d) deleting any data replications surplus to that required by the scheme, e) repeating steps c) and d) until the level of data redundancy required by the scheme is attained.

2. A method according claim I, wherein steps a) to e) are automatically implemented by controlling software.

3. A method according to either of claims I and 2, wherein the replication scheme is included with the data.

4. A method according to claim 2, wherein, after step a), the replication scheme is passed to the controlling software by a client.

5. A method according to any preceding claim, wherein in step b), the replication scheme is applied to a set of data objects according to their physical location or data type.

6. A method according to any preceding claim, including the step of providing a default replication scheme.

7. A method according any preceding claim, wherein the scheme defines transient, minimum and / or final levels of data redundancy on individual nodes, clusters and / or the cluster set.

A method according to any preceding claim, further comprising the steps of: f) detecting changes to the functionality of the cluster set, g) if changes are detected, repeating steps b) to e).

9. A method according to any preceding claim, wherein step a) includes checking the scheme, and if the scheme is found to be incomplete or incorrect, carrying out the remaining steps using default values of data redundancy.

10. A method according to any preceding claim, further comprising the step of dynamic reallocation of the data onto at least one selected computer node to provide required redundancy levels.

11. A method according to any preceding claim, including the step of storing a data tag on a computer node of a cluster, the tag comprising information relating to the identity and location of the data.

12. A method according to claim 11, wherein a node on which the tag is stored is responsible for ensuring the data exists on the cluster.

13. A computer program for enabling the method according to any preceding claim, when implemented on a cluster set.

14. A computer-readable medium carrying the program of claim 13.

15. A cluster set when programmed with the program of claim 13.

16. A cluster set comprising a plurality of clusters connected for communication with each other, each cluster comprising at least one computer node, with the or each node of each cluster being interconnected for communication therebetween and comprising means for storing data, wherein the cluster set is programmed with a replication scheme that describes the level of data redundancy required in the cluster set, the cluster set being provided with means for replicating data stored on a computer node for storage onto a different node in accordance with the scheme and deleting any data replications surplus to that required by the scheme.

7. A method substantially as herein described with reference to the accompanying Figs. 2 to4.

18. A cluster set substantially as herein described with reference to the accompanying Figs.2to4.