GB2529403A

GB2529403A - A Method of operating a shared nothing cluster system

Info

Publication number: GB2529403A
Application number: GB1414592.4A
Authority: GB
Inventors: Dominic Mueller-Wicke; Nils Haustein; Thomas Schreiber; Christian Bolik
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2014-08-18
Filing date: 2014-08-18
Publication date: 2016-02-24
Also published as: GB201414592D0; US9952940B2; US20160048430A1

Abstract

Share nothing cluster system (SNCS) 100 comprises backup server 126 for backing up data elements 135. SNCS 100 includes common namespace component 101 (e.g. file system mount point), nodes 110-112 with storage devices 120-122, and local table 128. Data 135 is partitioned in block sequences 137A-137E distributed across storage 120-122 as set of blocks BS1-BS3. For example BS1 comprises sequences 137A, 137C. Placement information is stored in table 128. After a request for backing-up data 135, each node identifies corresponding block sequences and sends them to the back up server alongside information on the ordering of block sequences, the node, and the data element (e.g. filename). The server organises the information for a block sequence as an entry in backup information table 300. It adds a first flag after complete reception and storage of each sequence and a second flag at completion of the back-up for the entire data element.

Description

DESCRIPTION

A method o operating a shared nothing cluster system

Field of the invention

Ihe invention relates to computing systems, and more particularly to a method of operating a shared nothing cluster sy stern.

Background

Ihe significant grow of data confronts system administrators with new challenging situations in terms ci data proiection. Tn large storage systems data is not just stored to have it available for seldom future usage. Analytics and stream computing require immediate access to all the data stored in such a system in different manners. The listed and other aspects of data collection and data usage are described with the acronym Big Data. To allow immediate access to all the data new system architectures were defined that implement physically short distances between the data stored on disk and the nodes that process the data. The most common architecture in this area is the so called Shared Nothing Cluster. However, there is a continuous need to improve the storage performance of such shared nothing clusters.

Surrimary of the invention It is an objective of embodiments of the invention to provide for an improved method of operating a shared nothing cluster system, a shared nothing cluster-system, a backup server, a node and a computer program product. Said objective is solved by the subject matter of the independent claims. Advantageous embodiments are described in the dependent claims.

In one aspect, the invention relates to a method of cperating a shared nothing cluster system, SNCS, the SNOS comprising at least a first and a second storage node connected via a first network of the SNCS, the first and second storage nodes being configured to store a first set and second set of bloc:cs respectively, wherein the first and second set of blocks form a single data element. The method comprises: -providing a backup server being connected to the first and second storage nodes, the backup server comprising a backup information table; -configuring the first and second storage nodes to act as backup clients in a client-server configuration involving the backup server; -upon receiving at the first and second storage nodes a request to backup the data element, the method further comprises for each node of the first and second storage nodes: * identifying by the node one or more block sequences of consecutive blocks in the set of blocks of the data element stored in the node; * sending by the node the identified one or more block sequences to the backup server; * generating by the node backup information indicating at least: o the order of each block in each of the one or more block sequences, o the order of each of the one or more block sequences in the one or more block sequences, and o the node and the data element, * sending by the node the backup information to the backup server; * storing by the backup server each received block sequence of the one or more block sequences in a respective storage location in the backup server; * creating by the backup server for each of the received one or more block sequences an entry in the backup information table, wherein the entry comprises at least the storage location of the block sequence and the associated backup information in the backup server; * adding and setting by the bac:Kup server a first flag into at least one of the created entries in case ci a complete reception and storage of the one or more block sequences.

-in response to a determination by the backup server that the first and second sets of blocks are associated with the first flag, adding and setting a second flag to at least one of the created entries for the first and second sets of blocks indicating the completion of the backup of the data element.

In another aspect, the invention relates to a computer program product comprising a non-transitory computer readable storage medium having computer readable program code embodied therewith, said program code being executable by a computer to perform the method steps of the method of any one of the preceding embodiments.

In another aspect, the invention relates to a shared nothing cluster system, SNCS, the SNCS comprising at least a first and a second storage node connected via a first network of the SNCS, the first and second storage nodes being configured to store a first set and second set of blocks respectively, wherein the first and second set oi blocks form a single data element, the SNCS further comprising a backup server, wherein the first and second storage nodes are configured to act as backup clients in a client-server configuration involving the backup server, the SNCS being configured to perform at least part off the method steps of the method of any one of the preceding embodiments.

In another aspect, the invention relates to a storage node for a SNOS. The storage node is configured to act as a backup client in a client-server configuration involving a backup server connected to the storage node; store a set of blocks of a data element in the storage node; receive a request to backup the data element; identify one or more block sequences of consecutive blocks in the set of blocks of the data element stored in the node; send the identified one or more block sequences to the backup server; generate backup information indicating at least the order of each block in each of the one or more block sequences, the order of each of the one or more block sequences in the one or more block sequences, and the node and the data element, and send the backup information to the backup server.

In another aspect, the invention relates to a backup server for a SNCS. The backup server is configured to receive from each node of a first and second storage nodes of the SNCS one or more block sequences and associated backup information, the first and second storage nodes being configured to store a first set and second set of blocks respectively, wherein the first and second set of blocks form a single data element; store each received block sequence of the one or more block sequences in a respective storage location iu the backup server; create for each of the received one or more block sequences an entry in a backup information table of the backup server, wherein the entry comprises at least the storage location of the block sequence and the associated backup information in the backup server; add and set a first flag into at least one of the created entries in case of a complete reception and storage of the one or more block sequences and in response to a determination that the first and second sets of blocks are associated with the first flag, add and set a second flag to at least one of the created entries for the first and second sets of blocks indicating the completion of the backup of the data element.

Brief description of the drawings

In the following, preferred embodiments of the invention will be described in greater detail by way of example only making reference to the drawings in which: Fig. 1 illustrates the architecture of a shared nothing cluster syst em; Fig. 2 is a flowchart cf a method of operating a shared nothing cluster system; Fig. 3 illustrates a backup information table; Fig. 4 illustrates an exemplary backup processing on the backup clientide; Fig. 5 illustrates an exemplary backup processing on the backup server side; Fig. 6 is a flowchart of a method for restoring a data element in a shared nothing cluster system; Fig. 7 illustrates an exemplary restore processing on the backup clientide; and Fig. 8 illustrates the data query as part of the restore processing of the backup server.

Detailed description

In the following, like numbered elements in the figures either designate similar elements or designate elements that perform an equivalent function. Elements which have been discussed previously will not necessarily be discussed in later figures ii the function is equivalent.

As used herein, the SNCS refers to a distributed computing architecture where each node is independent and self-sufficient, without the need to directly share memory or disk access with other nodes of the SNCS. For example, the first (second) storage node has independent access and/or control of the first (second) set of blocks.

A client-server configuration is a system where one or more computers (e.g. first and second storage nodes) called clients connect to a central computer named a server to use resources such as storage and/or processing resources i.e. the server is used to "serve" client computers.

As used herein, the term "node" refers to any communication entity such as a processor, a computer (for instance a client computer), which can be operated within a communication network (such as the public Internet, an intranet, a telecommunication network, etc., or a combinaticn of such networks) . A node may have processing and storage capability for fulfilling assigned tasks in the framework of a shared nothing cluster system.

The first (second) storage node is configured to store the first (second) set o± blocks locally in the first (second) storage node or in a respective storage device that is directly connected to the first (second) node.

The request to backup the data e1enent may be received, for example, from a user of the ENCE or from a computer system connected to the SNCS.

The basic idea of the invention is the usage of the shared nothing cluster architecture for backup. This may be accomplished by introducing a novel backup client to each storage node o the shared nothing cluster and a file entry table provided by the baclKup server which provides the ability to store and identify file blocIK meta data, wherein i± a backup is initiated. initiator informs other storage nodes via restricted channel, then each storage node acts independent and backups file bloclKs accessible on the given node, each backup client sends its file bloccs to the backup server which updates the file entry table and stores the data.

According to one embodiment, the request to backup the data element may be received via the first networl<. The first networlc may have a limited or restricted bandwidth. For example, one node of the first and second storage nodes may be the initiator (i.e. the first receiving or generating the backup request) of the baclcup request and may send or torward the backup request via the first network to the other storage node o± the first and second storage nodes. In an alternative example, the first and second storage nodes may both receive the backup request via the first network.

The term "data element" or "file system object" as used herein re±ers to information that is stored in a data field. The data element niay have a size of multiple data blocks. The data element may be split in blocks of data, each of them may be stored in a unit of data storage on a device. Examples 0± data elements comprise document file, image file, audio file, or video file, data folder or container etc. For example, the data element may refer to an information, and may comprise multiple blocks or data bloclKs e.g. of fixed size or variable size depending on the storage node where a blocIK is stored. A block sequence of the data element nay comprise consecutive blocIKs 01 the data element that refers or contain a portion of the inThrmat ion.

The above features may have the advantage of reducing the data traffic in the SNOS compared to conventional methods. Tn a shared nothing cluster, of the conventional methods, when a backup client -running on one storage node -reads a file system object (e.g. the data element) in the shared nothing cluster several blocks of the file system object come from other storage nodes than the one where the backup client runs on. This causes extra data on the network (e.g. the first networc which may have a limited bandwidth) and is therefore inefficient.

The above features may enable the restore or recovery of data (when needed) stored in the ENCE.

Another advantage may reside in the fact that the present method may be seamlessly integrated in the existing shared nothing cluster systems as it makes use of the infrastructure of shared nothing systems and introduces changes with reduced burden into the shared nothing cluster system.

According to one embodiment, the SNCS comprises a local table for storing placement information of the one or more block sequences of the data element. The method further comprises receiving at the first and second storage nodes a restore request for restoring the data element. The method further comprises br each node off the first and second storage nodes: accessing the backup information table for identifying entries associated with the one or more block sequences of the data element that are associated with the node; determining that the second flag is set for the data element; retrieving the one or more block sequences and the associated backup information; storing the one or more bloc:K sequences in the node and updating the local table using the backup information to indicate the stored one or more block sequences; determining that the one or more block sequences of the first and second set of blocks have been restored and responsive to this marking the identified entries as restored.

According to one embodiment, adding and setting by the backup server the first flag is performed after a waiting time is elapsed starting from the time at witch a block sequence of the one or more block sequences is first received. If after the waiting time at least part of the one or more block sequences is not received, the first flag may not be set e.g. it may be equal to "FALSE" or an empty value. This may be advantageous as it may prevent the association of incomplete information with the data element in the backup server. For example, if a restore request is received for restoring the data element after the waiting time, and at least part of the one or more block sequences is not received, the backup server may read the first flag and send a meaningful error message to mention that data element is not fully backed up. However, if such waiting time constraint is not introduced, the backup server may not be able to read the first -10 -flag as it is not set or has no value and thus may, if any, send a non meaningful error message.

According to one embodiment, the waiting time is determined by upon receiving a first block sequence of the one or more block sequences, determining a transmission time delay for transmitting a block sequence of the one or more block sequences from the node to the backend server; using the transmission time delay and the number ol one or more block sequences for determining the waiting time.

According to one embodiment, the sending of the one or more block sequences is performed in parallel or in sequence.

Depending on the available resources the SNCS may have the choice of using one of the submission methods.

According to one embodiment, the method further comprises deleting the identified one or more block sequences from the node. This may have the advantage of using the SNCS for a distributed archiving of data.

According to one embodiment, the method further comprises generating by the node links pointing to the identified one or more block sequences in the backup server and storing the links in the node. The links may comprise stub files. A stub file refers to a file that replaces the original file on a local file system e.g. in the storage node when the file is migrated to storage e.g. to the backup server. A stub file contains the information that is necessary to recall a migrated file from server storage. It also contains additional information that can be used to eliminate the need to recall a migrated file. This may have the advantage of using the SNOS for a distributed migration of data.

-11 -According to one embodiment, the method further comprises in response to a determination by the backup server that at least one of the first and second sets of blocks is not associated with the first flag, sending a backup failure message to the sender of the request.

According to one embodiment, the first and second storage nodes are connected to the bac:cup server via a second networ:c different from the first network. For example, the first and the second network are able to communicate at a first and a second bandwidth respectively. The first bandwidth is smaller than the second bandwidth. The second network may comprise a wired or a wireless JAN. The second networ:c may be a public network, such as the Internet, a private network, such as a wide area network (WAN), a storage area network (SAN) or a oombination thereof.

According to one embodiment, the SNCS further comprises a third storage node. The method further comprises at least one node of the first and second storage nodes sending information indicating the first, second and third storage nodes; the backup server upon setting the second flag, redistributing the blocks of the block sequences of the data element into restore block sequences; assigning the restore block sequences to the first second and the third storage nodes; updating the backup information table for associating to each of the first, second and third storage nodes its assigned restore block sequences and information indicative of the restore block sequences.

For example, when restoring or retrieving the data element the same method of the restore embodiment may be used. However, the results may be different in the sense that the data element when retrieved may be stored in the first, second and third storage nodes following the assignment that has been used by the backup server.

-12 -The information Indicating the first, second and third storage nodes may comprise the number of storage nodes in the at least first, second and third storage nodes.

The information indicative of the restore block sequences may comprise, for example, at least: the order of each block in each of the restore block sequences, and the order of each of the restore block sequences in the blocIK sequences that are assigned to a given node of the first second and the third storage nodes, and which block sequence belongs to each node of the first, second and third storage nodes, and an indication of the data element e.g. its filename.

In an alternative embodiment, the method of the above embodiment may be performed for a reduced number of storage nodes of the at least first and second storage nodes e.g. the blocks sequences may be assigned to only the first storage node.

According to one embodiment, the information indicating the first, second and third storage node comprises the system load in each of the first, second and third storage nodes, wherein the assigning is performed based on the system load.

The system load may comprise the Cpu load and/or I/C load. The CPU load refers to a measure of the load on a central processing unit of a computer or a storage node, and may be expressed as the amount of CPu time being used by applications and the operating system of the storage node per unit of time.

The system load may be a measure of the amount of computational work that a storage node performs e.g. for T/C activities (e.g. CPU load for I/O) . For example, the highest the system load in a -13 -node the smaller the number ci blocks or clock sequences to be stored in that node.

Fig. 1 describes the architecture of a shared nothing cluster system 100 in accordance with the present disclosure.

Component 101 describes the common namespace (such as a file system) . The common namespace may be, for example, a file system mount point that allows read and write activities on the storage nodes, but is not limited to this.

Components 110, 111 and 112 describe the storage or compute nodes. Storage nodes 110-112 are interconnected via compute node network 105. The connection via network 105 may have a limited bandwidth and may be used to exchange metadata and/or small amount of data such as to indicate a backup request.

Components 120, 121 and 122 describe the local storage devices.

Each storage device 120-122 Is directly connected to one corresponding storage node 110-112 and can be directly accessed from this node. In an alternative example, the storage devices 120-122 may be part of the storage nodes 110-112 respectively.

The combination of storage nodes and storage devices such as 110+120, 111+121 and 112+122 represent three independent compute-storage nodes that provide the storage for the common namespace 101. The common namespace 101 logically represents the storage provided by all compute-storage nodes 110+120, 111+121 and 112+122.

File system objects or data elements are stored via the common namespace 101. File system objects like normal files with a size of multiple file system blocks may be split in block sequences.

Each block sequence will be stored on one of thetorage devices -14 - 120-122. The different block sequences of a single file may be distributed on storage devices 120-122.

For example, when a data element or a data file 135 is received at the SNSC 100 for storage in the SNOS 100, the data element may be split by the common namespace 101 into block sequences 137A-E (data bloc:c sequences) . For example, the bloc:Ks forming the data element 135 may be stored in this order M137A- 137B-137C-137D-137E". For simplicity of the description it is shown in Fig. 1, only five block sequences forming the data element 135. For example, block 0 to N-5 is collocated to block sequence 137A. Block N-4 to N-i is collocated to block sequence 1373. Block N to N-10 is collocated to bloc:c sequence 137C.

Block 04-9 to 04-1 is collocated to block sequence 137D. Block V to EOF is collocated to block sequence 137F.

The common namespace 101 may calculate (e.g. taking into account the available resources in the storage nodes 110-112) the optimal placement for the block sequences 137A-E and distribute them on different storage devices 120-122. For exampTh, a first set of blocks 331 may comprise the block sequences 137A and 137C, and a second set of blocks 332 may comprise the block sequences 1373 and 137E, and a third set of blocks 353 may comprise the block sequence 137D, wherein the first set of blocks P31 may be stored in the storage device 120, the second set of blocks 332 may be stored in the storage device 121 and the third set of blocks 353 may be stored in the storage device 122. In another example, a random placement of the block sequences 137A-E may be performed.

The placement information such as which block of the data element 135 is on which storage device may be stored is defined in a local table 128. Table 128 might be part of the metadata of the file system provided by the common namespace 101. Table 128 -15 -may be accessed from common namespace 101 internally and provides an API for external applications.

In an alternative example, the table 128 may be stored in each storage node 110-112.

In terms of a read of the normal file e.g. data element 135 the storage nodes may read the information stored in the table 128 and coordinate the read of the block sequences l37A-E. The actual read of a single block sequence will be performed on the storage node e.g. 110 that has the block sequence e.g. 13Th stored in its storage only.

The SNCS 100 may further comprise a backup server 126. The backup server 126 comprises a backup storage (like disk or tape) to store file system objects (or data elements) . Storage nodes 110-112 may be configured to act as backup clients in a client-server configuration involving the backup server 126. For example, components 131, 132 and 133 represent the backup client components that run on the storage nodes 110-112 respectively.

The backup clients have access to local table 128 that stores the placement information of the block sequences stored on the storage devices 120-122. Furthermore, the backup clients are connected to the backup server 126.

The backup server 126 comprises a backup information table 300 (ci. Fig.3) that holds backup version information and block sequence information of each file received and stored in the database server 126.

The backup server 126 manages and maintains the backup information table 300 that holds information required to ensure the completeness of a data file backup. Furthermore, the table -16 - 300 is used to gather the required information for the data file restore and to ensure the completeness of the restore of data files.

The operation of the data processing system 100 will be described in details with refference to Fig. 2-S.

Fig. 2 is a flowchart of a method of operating a SNCS e.g. 100.

A request to backup a data element or data file e.g. 135 being stored in the SNCS 100 may be received at the storage nodes 110- 112. For example, storage node 110 may receive the request and then forward it to the other storage nodes 111-112 via networ:K 105. In an alternative example, the request may be received by each one of the storage nodes 110-112. For example, the request may be received via component 101 either automatically or triggered by a request of a user of the SNCS 100. The method step 201-213 may be executed for each node of the storage nodes 110-112. For example, the steps 201-213 may be executed in parallel on the storage nodes 110-112.

In step 201, the node e.g. 110 may identify one or more block sequences of consecutive blocks in the set of blocks of the data element 135 stored in the node. For example, storage node 110 may identify block sequences 137A and 1370 of the set of blocks BSl. For example, the storage node 110 may read the local table 128 of the cormron namespace 101 to identify the one or more block sequences. In another example, the storage node 110 may identify itself the one or more block sequences that belong to the data element 135 e.g. by reading the table 128 being locally stored in the storage node 110.

In step 203, the node 110 may send the identified one or more block sequences 137A and 1370 to the backup server 126. The sending of the one or more block sequences is performed in -17 -parallel or in sequence e.g. 137A is first sent and then 1370 is sent. The parallel submission may be performed using distributed parallel processing resources in the node 110, which may speed up the backup process.

In step 205, the node 110 lay generate backup information indicating at least the order of each block in each of the one or more block sequences, the order of each of the one or more block sequences in the one or more sequence of blocks, and the node and the data element. For example, the generation of the backup information may be performed in parallel to or after the identification of the one or more bloc:cs sequences is performed.

For example, the local table 128 may be read for the identification and for generation of the backup information. For example, the order of block N in the block sequence 1370 may be 0 since it is the first block of the block sequence 1370 and the order of block M-10 in the block sequence 1370 may be M-10-N.

The order of block sequence 137A may be for example 1/2 (indicating its order within two block sequences) in the set of blocks 331 and block sequence 1370 may have an order 2/2 in the set of blocks 331. An indication of the node may comprise, for example, at least one of IP address, port number, Fiber Channel address or any data able to identity the node 110. The indication of the data element may be for example it name e.g. filename.

For example, in case the backup information table 300 is created before step 205, the backup information may be generated in accordance with at least part of the backup information table 300 such that it can be efficiently used to file the fields of the backup information table300. For example, if a field concerns a given parameter X the backup information may be generated such that it contains values of parameter X or other values from which parameter X value can be derived. Tn another -18 -example, the backup information table 300 may be created as soon as the first block sequence of the data element 135 or any data element stored in the backup server 126 is first received.

For example, the backup information for a given block sequence may indicate values for the fields 301-306 (described below) ot the backup information table 300.

In step 207, the node 110 may send the backup information to the backup server 126. The submission of the backup information may be performed together with or after the ubmissfon of the identified one or more block sequences.

In step 209, the backup server 126 may store each received block sequence of the one or more block sequences in a respective storage location 309 (Fig. 3) in the backup server 126.

In step 211, the backup server 126 may create for each of the received one or more block sequences an entry in the backup information table 300, wherein the entry comprises at least the storage location of the block sequence and the associated backup information in the backup server. For example, after receiving block sequence 137A, the backup server 126 may create an entry 311 in table 300 such that the fields 301-309 may be filed. For example, in field 302 the name of the data element 135 may be filed. In field 303, the sequence number of the block sequence 137A which may be 0 as it is the first sequence in the data element 135 (e.g. the sequence number of block sequence 1370 is 2) In step 213, the backup server 126 may add and set a first flag 307 into at least one of the created entries in case of a complete reception and storage of the one or more block sequences. The first flag being set means that it indicates the -19 -complete reception of the one or more block sequences. The first flag may comprise a Boolean or a number etc. The first flag is set, for example, when its Boolean value is equal to "TRUE". For example, as soon as the last block sequence of the bloc:c sequences 137A and 1370 is received, the backup server-126 may file the field 307 with "TRUE" for the entry 313 and/or entry 314 to indicate that all the block sequences of the data element that are stored in node 110 are received at the backup server 126. For determining that the reception (e.g. of 13Th and 1370) is complete, the backup server 126 may use the backup information to determine the order of the received block sequence 137A e.g. 1/2 in the block sequences 137A and 1370 of BS1. Assuming that the block sequence 137 A is first received, the backup server 126 may determine by reading the denominator that there are two block seguences of the data element 135 that may be received from the node 110 and that one of them having order 1 is received. In this case, as soon as another block sequence of the data element 135 is received from the node 110 and having an order that is different from the order of the previously received block sequence 137A, the backup server 126 may set the first flag to "TRUE" in entry 314 and/or 313.

In an alternative example, the backup server 126 may determine that the reception (e.g. of 137A and 1370) is complete using the backup information, wherein the backup information indicate the final sequence e.g. of 137A and 1370 to be received by the backup server 126. For example, the block sequences 137A and 1370 may be sent in sequence one after the other, and may also be received one after the other i.e. 137A then 1370. In this case, the backup information associated with each block sequence may comprise an indication whether the block sequence is final (e.g. last submitted) one or not (e.g. 137A may be associated with an indication that is not final as it is the first one submitted and 1370 may be associated with an indication that is -20 -final) . The backup server 126 may use such indication associated with each block sequence and if it determines that it is final it sets the first flag to "TRUE" in entry 314. The first flag may be "False" or empty in entry 313. In another examule, it may set the first flag to "TRUE" in entries 313 and 314.

In step 215, in response to a determination by the backup server 126 that the sets of blocks 331, 332 and 333 are associated with the first flag (the first flag being set) , adding and setting a second flag (e.g. 308 "T0TTON COMPLETE") to at least one of the created entries for the first and second sets of blocks indicating the completion of the bac:Kup of the data element (i.e. the data element has been backed up) -For example, the backup server 126 may determine that the first flag is set (e.g. to "TRUE") for at least one block sequence of each set of blocks 351-353 i.e. in other terms the sets of blocks 351, 352 and 353 are associated with the first flag. The second flag may thus be set e.g. to "B" for the block sequences of the set of blocks 31-333.

Fig. 3 shows an exemplary structure of the bac:cup information table 300 of the backup server 126. The table 300 comprises entries 313, 314, 315, 316, 317 created for the respective block sequences 3172k, 3170, 3173, 3173 and 3173. The backup information table shows an example of a complete backup of the data element 135.

The columns of table 300 as shown in Fig. 3 are defined as follows.

-Backup Run 301: The number of a backup run of a complete file system backup including multiple single file system object backups (e.g. 0001) . Backup run 0001 indicates that all block sequences of the file (e.g. data element 135) are part ci the same file version. The file system object or -21 -data element 135 that was backed up has the name /fsl/filel (column object name) -Object Name 302: The full qualifying name of a file system object (e.g. /mountpoint/path/filename) -Block Sequence Number 303: Logical location of the block sequence inside the data element 135. For example, the block sequence number of the block sequence 137A may be 0 since it is contains the first blocks that form the data element 135.

-Block Sequence Begin 304: First logical block in the complete block sequence of the file system object (e.g. 0) -Block Sequence End 305: Last logical block in the complete block sequence of the file system object (e.g. 4) -Node Name 306: The name of the backup client node (i.e. storage node 110-112) that sends the block sequence for backup (e.g. NODE_hO) -Final Sequence (the first flag) 307: A flag thac indicates that the block sequence sent from the storage node is received in a complete fashion with the other block sequences of the data element that are stored together with the block sequence in the storage node. In another example, the flag may indicate that the corresponding block sequence is the last block sequence of the data element 135 that is received irom the speciiied node. Tn this case, for example, the flag may be received irom the spe cified node in association with the block sequence.

For example, the ilag may be set based on an indication of the order of a block sequence of a data file in the one or more sequence of blocks of the data file stored in the storage node that has sent the block sequence. For example, the indication of block sequence 137A may be "1/2", where the numerator may be the order of the block sequence 13Th and the denominator is the number of block sequences i.e. -22 - 13Th and 1370. In this example, the block sequence 1370 may be indicated by "2/2". Using the final sequence 307, the backup server 126 may be able to determine whether it received all the block sequences of a given data element that are stored in a given storage node.

-Action Complete (the second flag) 308: A flag that indicates that the completeness of the backup or restore of a block sequence (e.g. "B" may indicate a Backup is Complete, "BR" may indicate Restore Complete, "A" may indicate Archive Complete, "AR" may indicate Retrieve Complete, 94" may indicate migration complete, and "U" may indicate recreation complete) -Storage Location 309: Path to the storage location where the block sequence is stored in the backup server (e.g. /pool2/004/00l23) For example, if a storage node (not shown) of the SNCS 100 does not contain any block of the data element, it may send empty action complete information for the data element to indicate that it is finished. An entry may thus be added to the backup information table that indicates storage node, the data element and in which the final sequence (i.e. first flag) is set to "True". After all affected storage nodes send the final block sequence the backup run is finished.

Fig. 4 illustrates an exemplary backup processing on the backup client side for a backup of a data element e.g. 135. After the backup was initiated and the file system object or data element to be protected is known to the backup client i.e. storage node e.g. 110, the backup client queries table 128 to get block sequence information of the block sequences of the data element that are stored on the storage node. Tf no result was received the storage node sends metadata (e.g. the empty action complete 307 information) only to the backup server 126 to indicate that -23 -the backup on the local node has finished. The backup ends then.

For each query result received the storage node may read the item in the query result list and collect data and metadata of the corresponding block sequence. If the currently processed list item is the last, the storage node may set the metadata last or final sequence 307 to "TRUE". After the collection of the data and metadata has finished the backup client sends the information on the block sequences to the backup server 126.

Until the backup server 126 commits the information (e.g. informs the storage node that the backup is ended, complete or successful) , the storage node may retry the send. The retry of the send may be triggered for example by a reception of an error from the backup server 126. If the actual sent information stand for the final block seguence the backup ends. If the actual sent information doesn't stand for the final block sequence the novel backup client will proceed with the next item in the list.

Fig. 5 illustrates an exemplary backup processing on the backup server side. The backup server 126 receives backup data and metadata information from the storage node. The backup server 126 extracts the block sequence and stores it in a storage pool.

After the backup data was successfully stored the menadata will be analyzed. The backup server 126 sets the field action complete3C8 to "B" and updates table 300. After the table 300 was updated successfully the backup server 126 sends commit information to the storage node.

Fig. 6 is a flowchart of a method for restoring a data element e.g. 135 stored the SNCS 100. A restore request for restoring the data element 135 may be received at the storage nodes 110- 112. For example, storage node 110 may receive the restore request and then for-ward it to the other storage nodes 111-112 via network 105. Tn an alternative example, the restore request may be received by each one of thetorage nodes 110-112. For -24 -example, the restore request may be received via component 101 either automatically or triggered by a request of a user of the SNCS 100. The method comprises for each node of the storage nodes 110-112: In step 601, the backup information table 300 may be accessed for identifying entries associated with the block sequences 137A-E of the data element 135 that are associated with the node e.g. 110. The backup information table 300 may be accessed by the node 110 by e.g. requesting its content by the node from the backup server 126. Tn another example, the backup information table 300 may be accessed by the bac:Kup server 126, wherein the restore request is forwarded to the backup server 126 by the node 110.

In step 603, it is determined (by the node 110 or the backup server 126 who accessed the backup information table 300) that the second flag e.g. Action Complete 308 is set for the data element. In other terms, at least part of the values of the field 308 associated with the block sequences of the data element 135 may be read and if the second flag is set e.g. the value "B" is a field 308 value, it is thus determined that the second flag is set for the data element 135.

In step 605, the node 110 may retrieve the block sequences 137A-E and the associated backup information e.g. the values stored in the fields 301-309 for entries 313-317. In an alternative example, the backup server 126 (if it has access to the backup information table 300 in step 601) may send the block sequences 137A-E and the associated backup information to the node 110.

In step 607, the block sequences 137A-E may be stored in the node 110 and the local table 128 may be updated using the backup information to indicate the stored block sequences 137A-E.

-25 -In step 609, the baclKup server 126 may determine that the block sequences 137A-E of the data element have been restored and responsive to this it marks the identified entries 313-317 as restored. For example, it may set the value of the field 308 to BR" to indicate that the restore is complete.

Fig. 7 illustrates an exemplary restore processing on the backup client side. After the restore processing was initiated and the storage node knows the file system object or data element e.g. name to be restored, the storage node queries table 300 via backup server 126 to receive a metadata list of all bloc:K sequences of the data element 135 that belong the storage node (The information found for the storage node that send the restore request are written to the iretadata list by the backup server 126 and sent to the storage node) . If the query result is empty the storage node sends metadata to the backup server 126 that indicates that the restore processing has finished on the storage node. For each item in the query result list the storage node will query the backup server 126 for the actual data of the corresponding block sequence. Next the node may write the block sequence to disk. If the data was written to disk successfully the storage node may update table 128 with the appropriate metadata of the recentlywritten block sequence. If the actual restored block sequence belongs to the last item in the query result list the storage node may send metadata to the backup server 126 that indicates that the restore processing has finished on the storage node and ends the restore processing.

Figure 8 illustrates the data query (i.e. the access to the backup information table 00 for processing the restore request) as part of the restore processing of the backup server 126.

After the backup server 126 received a data query from the storage node, the backup server 126 may query table 00 to -26 -extract the required information. Ti no data exists for the querying storage node the backup server 126 sets the restore to state complete in table 300 for the querying storage node and sends the empty data query result to the storage node. For each query result in the list the backup server 126 verifies whether the current list item is the last item in the list. TI the actual processed item is the last item in the list the backup server 126 sets the action complete information for the item to "BR". The backup server 126 sends the block sequence and the corresponding metadata to the storage node. If the suorage node indicates that the information was received the backup server 126 may process the next item in the list. If no more items exist the backup server 126 ends the restore processing.

In the following, other data processing methods that may be derived from the above method are described.

ARCHIVE Method: The archive processing leans against the backup processing. The difference is the action complete 308 indicator in table 300 is set to "A". Furthermore the storage node 110 may optionally remove the block sequences 137A and 137C of the file system object i.e. data element 135 from the file system on the storage node 110 after the archive operation was committed from the backup server 126.

RETRIEVE Method: The retrieve processing leans against the restore processing.

The difference is that the action complete 308 indicator in table 300 is set to "AR".

MTGRATE Method: -27 -The migrate processing leans against the backup processing. The difference is that the action complete 308 indicator in table 300 is set to "M". Furthermore the storage node 110 may remove the block sequences 1372k and 137C oi the file system object 135 from the file system on all storage nodes in the SNOS 100 after the migrate operation was committed (e.g. the backup server sends a message that the migrate operation on the server side is complete or successiul) from the backup server 126, except for the node owning the file object's metadata (mode) . On this node it may be kept as many blocks on the local storage of the node, starting from the beginning of the file or data element 135, as is configured as the "stub size" of the file object 135. Reads within this stub size may not cause a recall operation to be triggered, but read operations beyond the stub size as well as write operations to the migreted file object will cause the recall method to be triggered.

RECALL Method: The retrieve processing leans against the restore processing.

The difference is that the action complete 308 indicator in table 300 is set to "U".

Cluster-optimized data transfer method from backup server to backup clients: This Cluster-optimized data transfer method does not assume the number of backup clients involved in a restore, retrieve, or recall operations is the same as was the case when the file object or the data element e.g. 135 was backed up, archived, or migrated. Tnstead, it enables full utilization ci a larger or smaller number ci storage nodes in the cluster to be used for -28 -transferring file object data as part of a restore, retrieve, or recall operation. For example, the storage nodes 110-112 have been used to store the data element 135 and thus used for the backup operation when the bac:cup information is generated.

However, the SNOS 100 may further comprise at least one more storage node (not shown) . This method may make use of the storage nodes 110-112 and the additional storage node for e.g. the restore operation such that when the data element 135 is restored or retrieved it may be stored not only on storage nodes 110-112 as before but it may be distributed over storage nodes 110-112 and the additional storage node. Tn another example, this method may make use of a reduced number of storage nodes as that has been used for the backup operation e.g. it may make use of two or one storage node of the storage nodes 110-112 in order to redistribute the data element 135 on the reduced number of nodes.

This cluster-optimized data transfer method involves the following steps, before data is transferred in parallel by all of the available storage storage nodes in the manner described in sections above: 1. The initial storage node 110 (i.e. the storage node that has first received the request to backup the data element 135) or at least one of the storage nodes 110-112 communicates to the backup server 126 the number and identity (e.g. host name, IF address, etc.) of the storage nodes available in the SNCS 100 which can be used for the data transfer required to complete the restore, retrieve, or recall operation.

2. Using this information, the backup server 126 creates a new logical set of block sequences for the data element 135 being the subject of the operation, by dividing total size of the data element 135 by the number of nodes available for the data transfer. This includes storing a new set of entries in the -29 -backup information table 300, along with new block sequence number, begin, end, and "final sequence" specifications.

3. when the storage nodes interrogate the bac:Kup server 126 for the sequences to transfer, the bac:cup server 126 consults these new entries for distributing the data transfer of the data element 135 (e.g. equally) across the available storage nodes.

In another implementation of steps 1 and 2, the storage node includes an indication of the current Cpu load of each of the storage nodes available for data transfer in the set of information communicated to the backup server 126 in step 1. In step 2, then, the bac:Kup server 126 considers the current CPU load of the storage nodes in sizing the sequences passed to each of the nodes for being transferred back to node local storage.

In yet another implementation, the current T/O load of each storage node is considered in the same way.

A computer-readable storage medium' as used herein encompasses any tangible storage medium which may store instructions which are executable by a processor of a computing device. The computer-readable storage medium may be referred to as a computer-readable non-transitory storage medium. The computer-readable storage medium may also be referred to as a tangible computer readable medium. Tn some embodiments, a computer-readable storage medium may also be able to store data which is able to be accessed by the processor off the computing device.

Examples of computer-readable storage media include, but are not limited to: a floppy disk, a magnetic hard disk drive, a solid state hard disk, flash memory, a USE thumb drive, Random Access Memory (RAM) , Read Only Memory (ROM) , an optical disk, a magneto-optical disk, and the register file of the processor.

Examples of optical disks include Compact Disks (CD) and Digital Versatile Disks (DVD), for example CD-ROM, CD-RW, CD-R, DVD-ROM, -30 -DyE-RN, or DVD-R disks. The term computer readable-storage medium also refers to various types of recording media capable of being accessed by the computer device via a networ:c or communication lin:c. For example a data may be retrieved over a modem, over the internet, or over a local area network.

Computer executable code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RE, etc., or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with computer executable code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable ignal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Computer memory' or memory' is an example of a computer-readable storage medium. Computer memory is any memory which is directly accessible to a processor. Computer storage' or storage' is a further example of a computer-readable storage medium. Computer storage is any non-volatile computer-readable storage medium. In some embodiments computer storage may also be computer memory or vice versa.

A processor' as used herein encompasses an electronic component which is able to execute a program or machine executable instruction or computer executable code. References to the computing device comprising "a processor" should be interpreted as possibly containing more than one processor or processing -31 -core. The processor may for instance be a multi-core processor.

A processor may also refer to a collection of processors within a single computer system or distributed amongst multiple computer systems. Ihe term computing device should also be interpreted to possibly refer to a collection or network of computing devices each comprising a processor or processors. The computer executable code may be executed by multiple processors that may be within the same ccmputing device or which may even be distributed across multiple computing devices.

Computer executable code may comprise machine executable instructions or a program which causes a processor to perform an aspect of the present invention. Computer executable code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C-I--I-or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages and compiled into machine executable instructions. In some instances the computer executable code may be in the form of a high level language or in a pre-compiled form and be used in conjunction with an interpreter which generates the machine executable instructions on the fly.

Ihe computer executable code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) -32 -Aspects of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block or a portion of the blocks of the flowchart, illustrations, and/or block diagrams, can be implemented by computer program instructions in form of computer executable code when applicable. The amount of processing resources may indicate the use degree of each of the physical components such as CPU, memory, and N/W bandwidth included in the computer system and their money cost. It is further under stood that, when not mutually exclusive, corribinations of blocks in different flowcharts, illustrations, and/or block diagrams may be combined. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructfons may also be loaded onto a computer, other programmable data processing apparatus, or other -33 -devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as an apparatus, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," "module" or "system." Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer executable code embodied thereon.

It is understood that one or more of the aforementioned embodiments may be combined as long as the combined embodiments are not mutually exclusive.

-34 -List of Reference Numerals ENCE, 101 namespace, 110-112 storage node, 120-122 storage device, 126 backup server,

128 local table,

131-133 backup client, data elerrent, 137A-E block sequence, 300 backup information table,

301-309 fields,

313-317 entries.

Claims

-35 -CLAIMS1. A method of operating a shared nothing cluster system, SNCS, the SNOS (100) comprising at least a first and a second storage node (110-112) connected via a first network (105) of the SNCS, the first and second storage nodes (110-112) being configured to store a first (551) set and second set (552) of blocks respectively, wherein the first and second set oi blocks form a single data element (135), the method comprising: -providing a backup server (126) being connected to the first and second storage nodes, the backup server comprising a backup information table (300); -configuring the first and second storage nodes (110-112) to act as backup clients in a client-server configuration involving the backup server (126); -upon receiving at the first and second storage nodes (110-112) a request to backup the data element (135), the method further comprising for each node of the first and second storage nodes (110-112) * identifying by the node (110) one or more block sequences (137A, 1376) of consecutive blocks in the set of blocks (551) of the data element (135) stored in the node (110); * sending by the node (110) the identified one or more block sequences (137A, 137C) to the backup server (12 6) ; * generating by the node (110) backup information indicating at least: o the order of each block in each of the one or more block sequences (137A, 1370), -36 -o the order of each of the one or more block sequences in the one or more block sequences (137A, 1370), and o the node (110) and the data element (135), * sending by the node (110) the backup information to the backup server (126); * storing by the backup server (126) each received block sequence of the one or more block sequences (137A, 1370) in a respective storage location in the backup server (126); * creating by the backup server (126) for each of the received one or more block sequences an entry (313, 314) in the backup information table (300), wherein the entry comprises at least the storage location (309) of the block seguence and the associated backup information in the backup server (126); * adding and setting by the backup server (126) a first flag (307) into at least one of the created entries in case of a complete reception and storage of the one or more block sequences (137A, 137C) -in response to a determination by the backup server that the first and second sets of blocks (Psi, P52) are associated with the first flag (307), adding and setting a second flag (309) to at least one of the created entries for the first and second sets of blocks indicating the completion of the backup of the data element.
2. The method of claim 1, wherein the SNOS (100) comprises a local table (128) for storing placement information of the one or more block sequences of the data element (135), the method further comprising: -receiving at the first and second storage nodes (110-112) a restore reguest for restoring the data element (135); -37 -the method further comprising for each node of the first and second storage nodes: * accessing the backup information table for identifying entries associated with the one or more block sequences of the data element that are associated with the node; * determining that the second flag is set for the data element; * retrieving the one or more block sequences and the associated backup information; * storing the one or more block seguences in the node and updating the local table using the backup information to indicate the stored one or more block sequences; -determining that the one or more block sequences of the first and second set of blocks have been restored and responsive to this marking the identified entries as restored.
3. The method of claim 1, wherein adding and setting by the backup server (126) the first flag is performed after a waiting time is elapsed starting from the time at witch a block sequence of the one or more block sequences is first received.
4. The method of claim 3, wherein the waiting time is determined by: -upon receiving a tirst block sequence of the one or more block sequences, * determining a transmission time delay for transmitting a block sequence of the one or more block sequences from the node to the backend server; -38 - *using the transmission time delay and the number of one or more block sequences for determining the waiting time.
5. The method of claim 1, wherein the sending of the one or more block sequences is performed in parallel or in sequence.
6. The method of claim 1, further comprising deleting the identified one or more block sequences from the node.
7. The method of claim 6, further comprising generating by the node lin:cs pointing to the identified one or more bloc:c sequences in the backup server and storing the links in the node.
8. The method of claim 1, further comprising in response to a determination by the backup server that at least one of the first and second sets of blocks is not associated with the first flag, sending a backup failure message to the sender of the request.
9. The method of claim 1, wherein the SNCS (100) further comprises a third storage node, the method further comprising: -at least one node of the first and second storage nodes (110-112) sending information indicating the first, second and third storage nodes; the backup server (126) upon setting the second flag, redistributing the blocks of the block sequences (137A-E) of the data element into restore block sequences; -assigning the restore block sequences to the first second and the third storage nodes; -39 - -updating the backup information table for associating to each of the first, second and third storage nodes its assigned restore block sequences and information indicative of the restore block sequences.
10. The method of claim 9, wherein the information indicating the first, second and third storage node comprises the system load in each of the first, second and third storage nodes, wherein the assigning is performed based on the system load.
11. The method of claim 1, wherein the iirst and second storage nodes are connected to the backup server (126) via a second network difierent from the iirst network (105)
12. A computer program product comprising a non-transitory computer readable storage medium having computer readable program code embodied therewith, said program code being executable by a computer to perform the method steps of the method oi any one of the preceding claims.
13. A shared nothing cluster system, SNCS, the SNCS comprising at least a first and a second storage node connected via a first network of the SNOS, the first and second storage nodes being configured to store a first set and second set of blocks respectively, wherein the first and second set ci blocks iorm a single data element, the SNCS further comprising a backup server, wherein the first and second storage nodes are configured to act as backup clients in a client-server configuration involving the backup server, the SNOS being configured to perform the method steps oi claim 1.
14. A storage node (110) for a shared nothing cluster system, SNCS (100) , the storage node (110) being configured to -act as a backup client in a client-server configuration involving a backup server (126) connected to the storage node (110); -40 - -store a set of blocks (ES1) ci a data element (137) in the storage node (110); -receive a request to backup the data element; -identiiy one or more block sequences (13Th, 1370) of consecutive blocks in the set of blocks (331) of the data element (135) stored in the node (110); -send the identified one or more block sequences (137A, 1370) to the backup server (126); -generate backup information indicating at least: * the order of each block in each of the one or more block sequences (1372k, 1370), * the order of each of the one or more bloc:K sequences in the one or more block sequences (137A, 1370), and *the node (110) and the data element (135), -send the backup information to the backup server (126) -
15. A backup server (126) for a shared nothing cluster system, SNOB (100) , the backup server (110) being configured to -receive from each node of a first and second storage nodes (110) of the SNOB (100) one or more block sequences (1372k, 1370) and associated backup information, the first and second storage nodes (110-112) being configured to store a first (331) set and second set (352) of blocks respectively, wherein the first and second set of blocks form a single data element (135); -store each received block sequence of the one or more block sequences (1372k, 1370) in a respective storage location in the backup server (126); -create for each of the received one or more block sequences an entry (313, 314) in a backup information -41 -table (300) of the backup server (126), wherein the entry comprises at least the storage location (309) of the block sequence and the associated backup information in the backup server (126); -add and set a first flag (307) into at least one of the created entries in case of a complete reception and storage of the one or more block sequences (137A, 1370) -in response to a determination that the first and second sets of blocks (BS1, 3S2) are associated with the first flag (307), add and set a second flag (309) to at least one of the created entries for the first and second sets of blocks indicating the completion of the backup ci the data element.