GB2558517A - Automatic and customisable checkpointing - Google Patents

Automatic and customisable checkpointing Download PDF

Info

Publication number
GB2558517A
GB2558517A GB1609530.9A GB201609530A GB2558517A GB 2558517 A GB2558517 A GB 2558517A GB 201609530 A GB201609530 A GB 201609530A GB 2558517 A GB2558517 A GB 2558517A
Authority
GB
United Kingdom
Prior art keywords
node
memory
staging
computation
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1609530.9A
Other versions
GB2558517B (en
GB201609530D0 (en
Inventor
Aldea Lopez Sergio
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to GB1609530.9A priority Critical patent/GB2558517B/en
Publication of GB201609530D0 publication Critical patent/GB201609530D0/en
Priority to US15/454,651 priority patent/US10949378B2/en
Publication of GB2558517A publication Critical patent/GB2558517A/en
Application granted granted Critical
Publication of GB2558517B publication Critical patent/GB2558517B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1456Hardware arrangements for backup
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/2866Architectures; Arrangements
    • H04L67/2885Hierarchically arranged intermediate devices, e.g. for hierarchical caching
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1471Saving, restoring, recovering or retrying involving logging of persistent data for recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/84Using snapshots, i.e. a logical point-in-time copy of the data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Retry When Errors Occur (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A Big Data ecosystem such as the Internet of Things (IoT) has a checkpointing mechanism whereby objects corresponding to in-memory data structures are copied from memories 240 in computation nodes 200 to staging nodes 700 by using Remote Direct Memory Access (RDMA). Staging nodes are intermediate nodes facilitating communications between computing nodes and file systems for backup. Checkpoints are kept in RAM memory 740 of staging node 700, and then asynchronously copied to non-volatile storage disks, such as parallel file system 150. Differently to previous approaches, checkpoints remain in volatile memory 740 within the checkpointing mechanism, making recovery faster by exploiting the speed advantage of RAM over disk accesses. Alternatively to copying, objects in memory are updated incrementally to newer versions by applying the chain of changes (lineage) made to them in the corresponding computation nodes 200. An automatic and customisable mechanism controls when the checkpointing process is triggered.

Description

(71) Applicant(s):
Fujitsu Limited (Incorporated in Japan)
1-1 Kamikodanaka 4-chome, Nakahara-ku, Kawasaki-shi, Kanagawa 211-8588, Japan (72) Inventor(s):
Sergio Aldea Lopez (74) Agent and/or Address for Service:
Haseltine Lake LLP
5th Floor Lincoln House, 300 High Holborn, LONDON, WC1V 7JH, United Kingdom (51) INT CL:
G06F11/14 (2006.01) H04L 29/08 (2006.01) (56) Documents Cited:
WO 2013/101142 A1 WO 2012/130348 A1 US 20080046644 A1
31st International Conference on Distributed Computing Systems (ICDCS), 2011, IEEE, Prabhakar et al, Provisioning a Multi-Tiered Data Staging Area for Extreme-Scale Machines, ISBN
978-1-61284-384-1 ; ISBN 1-61284-384-0 International Conference for High Performance Computing Networking, Storage and Analysis (SC), 2012, IEEE, Sato et al Design and Modeling of a Nonblocking Checkpointing System”, ISBN 978-1-4673-0805-2 ; ISBN 1-4673-0805-6 (58) Field of Search:
INT CL G06F, H04L
Other: EPODOC, WPI, TXTE, TXTCN, TXTKR, TXPJPEA, TXTJPEB, INSPEC, XPESP, XPIEE, XPIPCOM, XPI3E, XPMISC, XPLNCS, XPRD, XPSPRNG, TDB, Internet (54) Title of the Invention: Automatic and customisable checkpointing
Abstract Title: Using the memory of staging nodes as part of the checkpointing mechanism (57) A Big Data ecosystem such as the Internet of Things (loT) has a checkpointing mechanism whereby objects corresponding to in-memory data structures are copied from memories 240 in computation nodes 200 to staging nodes 700 by using Remote Direct Memory Access (RDMA). Staging nodes are intermediate nodes facilitating communications between computing nodes and file systems for backup. Checkpoints are kept in RAM memory 740 of staging node 700, and then asynchronously copied to non-volatile storage disks, such as parallel file system 150. Differently to previous approaches, checkpoints remain in volatile memory 740 within the checkpointing mechanism, making recovery faster by exploiting the speed advantage of RAM over disk accesses. Alternatively to copying, objects in memory are updated incrementally to newer versions by applying the chain of changes (lineage) made to them in the corresponding computation nodes 200. An automatic and customisable mechanism controls when the checkpointing process is triggered.
Computation node -/200
Staging node ^00
Figure GB2558517A_D0001
FIG. 7
At least one drawing originally filed was informal and the print reproduced here is taken from a later filed formal copy.
1/11
Transformation #1
Dataset t2 t3 t4
Dataset
Object #1 -^[object #2^-^ Object #3 -^Object #4
Input
Figure GB2558517A_D0002
Figure GB2558517A_D0003
Checkpointing
Figure GB2558517A_D0004
Output /
Figure GB2558517A_D0005
Figure GB2558517A_D0006
Increases safety Bottleneck (write on disk)
FIG. 1
Figure GB2558517A_D0007
* .......
CPU
26
mini
: Bus
a .......
CPU
27
iiinir
Figure GB2558517A_D0008
J1 Network£L
FIG. 2
2/11
Ο
Ό Ο
Ο CZ
Ο ο
Ω_^3 CO 00 Ο § in -Ο . Ρ co -22 ό
Ο CO Π3 ο — <Ζ> Ξ Ε ϊ S'gο οί ο
Ε ο
E
Transformation #1
Figure GB2558517A_D0009
Figure GB2558517A_D0010
Figure GB2558517A_D0011
m οθ <Ν
C/3
CO
CD·^ ο Ε _§ Ο Ο Ε ο Ο
0’ Ρ
3/11
Without RDMA Without RDMA
Figure GB2558517A_D0012
FIG. 4
4/11
Legend: g Disk n Memory OCPU
Figure GB2558517A_D0013
FIG. 5
5/11 ο
ο ω
Ό ο
c c
co co
Ο ο
<Ν ω
Ό ο
ο
Π3
ο.
Ε
Figure GB2558517A_D0014
FIG. 6
6/11
Figure GB2558517A_D0015
FIG. 7
7/11
Figure GB2558517A_D0016
ο ω
Ό £=
Φ
CO
Ο =3 Φ.Ω j'T'!—
Figure GB2558517A_D0017
ο
Ε
Figure GB2558517A_D0018
C/3 £
Figure GB2558517A_D0019
- Usage frequency <Ν
Ο ω
8/11
Figure GB2558517A_D0020
ο ο
Q
FIG. 9
9/11 ο
Ό
Figure GB2558517A_D0021
CO co
CO £Ζ =5 Ο CO _§ ο
Figure GB2558517A_D0022
CD
-Q
Figure GB2558517A_D0023
ο ο
Q.
Ο
CD
- Usage frequency
CZ3
Ό
CZ3
CD
10/11
Transformation #1
Figure GB2558517A_D0024
Figure GB2558517A_D0025
FIG. 11
11/11
1000
Figure GB2558517A_D0026
FIG. 12
77777777777sH^sssssssssssssssSssssssssssssss^
Automatic and Customisable Checkpointing
EtPM of. the invention
The present invention relates to checkpointing wliich is e technique employed to improve fault-tolerance of applications executed by computer systems.
Background of the Invention
Computer systems are not exempt from unexpected failures, and in any case require periodic shutdowns for maintenance. Tree has led to a proliferation of different faulttolerance techniques with the aim of avoiding either the loss of data, or the need to recompute complex and long precesses. One of the most common techniques consists of making a checkpoint of the state of said processes, or the data structures used by those processes, by saving them to reliable storage (conventionally, disk-based). This allows later restart-ng of the execution of the processes or restoring the values of those data structures. Although necessary to avoid loss of data, checkpointing mechanisms typically incur bottlenecks because they usually involve I/O operations to disk. For this reason, multiple approaches have been put forward in order to improve these mechanisms, from diskless to multi-level checkpointing.
For many years, main efforts in developing checkpointing techniques were focused on scientific applications, and therefore, how to checkpoint end restart processes, often executed in parallel and running for hours or days at a time. Although there is still room for improvement, years of research and innovation have crafted efficient and reliable mechanisms in this area.
On the other hand, as we come close to a world dominated by sensors, wearable devices, ioT, etc,, all of them contributors to a Big Defa ecosystem, the amount of date generated and handled by many applications is enormous. As a consequence of its volume, movements of data are very cosily. Fault-tolerant mechanisms specially designed to deal with these requirements neve been developed. Among Uvose. Spark, with its in-memory Resilient Distributed Datasets (ROD), has had a major impact,
A Spark application consists of a driver program which executes various parallel operations on a cluster of nodes. The RDD is a collection of elements partitioned across the nodes of the cluster, and the! can be operated on in parallel. A scheduling component of Spark (task scheduler) divides tasks into stages which cars be executed by the available resources, taking into account the needs of other users of the cluster. RDDs can only be created through deterministic operations called “transformations 10 on either data in stable storage or other RDDs. These transformations (e,g.t map, filter and join) apply the same operation to many data stems. This allows RDDs to efficiently provide fault tolerance by logging the transformations used to build a dataset (its lineage) rather then the actual data. If a partition of so RDD is lost, the RDD has enough information about how it was derived from other RDDs to recompute just that partition. Thus, lest data can be recovered, offers quite quickly, without requiring costly replication. Despite being limited to coarse-grained transformations, RDDs are a good fit for many parallel applications, where the same operations are applied to multiple data items.
Figure 1 shows a typical checkpointing mechanism used to save the state of an Inmemory object or data structure (henceforth referred to simply as an object). The basic arrangement is that to execute an appiicalion, an input dataset is processed in a sequence of operations or “transformations* in accordance with code of the application, producing an output dataset as final result. The transformations together form a “lineage computation chain.
As shown in Figure 1, the input dataset for use In executing an application is read from a file store 11 such as a Parallel File System (PFS) and, as part of the application, a first transformation (Transformation #1 or t1) is applied to produce a first object 31 (Object #1) which exists in a volatile memory 21 (part of a computation node of the computer system on which the application is executed), Next, a transformation 12 is applied which changes the object to a new or modified object 32 (Object #2) in volatile memory 22, which memory may belong to a different computation node and thus may be different from memory 21, The process is repeated by applying subsequent transformations t3 and t4, The final object 34 (Object #4), which is created in volatile memory 23, is written to a file store 12 (which may of course be tire same as file store 1 i) to produce the output dataset as the final result of executing the application.
The checkpointing comes in upon creation of object 33 (Object #3) following transformation t3. As indicated by the downward arrow, the object is not merely retained in a computation node's volatile memory but Instead stored in a Ole store 15 (wh-ch may be the same as file store 11 or 12), Then, even if there is a fault, for example toss of data from memory 23, the stored object cen be retrieved from the file store and processing can resume starting with transformation t4, without having to start eit over again with t1. As noted in Figure 1, each checkpoint increases safety in terms of the ability to recover from faults, but at the cost of causing bottlenecks due to relatively slow disk writing times.
Spark, and other In-memory approaches, can reduce such bottlenecks thanks to the above mentioned technique of re-computetion of objects based on their lineage computation chain. in this way, as an alternative to writing objects to disk for later retrieval, if a fault happens lost objects can be re-computed by applying the same operations as were recorded in a log. However, when long lineage computation chains ere required to produce a certain object, it is worth keeping a copy of the object itself in order to avoid long computation times, This approach Is followed by Spark, which increases the safety of data, but incurs bottlenecks due to the writes to disk.
in order to eliminate these bottlenecks, diskless approaches are proposed, instead of checkpointing to stable storage, these approaches rely on memory and processor redundancy. Although this technique is fester and allows faster recoveries, if does not scale well to large number of processors, and is less secure than disk-based approaches.
This issue is addressed by hybrid mechanisms, which make the checkpoints over different levels, combin-ng the speed of volatile storage with the security of non-volatile, stable storage. These approaches generally combine a duel local-global approach, making local and global copies in a distributed environment. However, these techniques cause three bottlenecks: i f) copies in distributed nodes are written into non-volatile, slower storage, (2} communicating checkpoints through the network is costly, and (3) global server or storage may be saturated by many simultaneous local nodes transferring copies.
in order to solve the last bottleneck, staging nodes are often used as an intermediate layer to coordinate the writes to the distributed, parallel file system (PFS) used to back up the checkpoints. Generally, these techniques use Remote Direct Memory Access (RDMA) to speed up ccmmunicat-ons between client, and server nodes.
Figure 2 shews the general principle ci RDMA communication (not. specifically in the context of checkpointing), in which data is transferred from one memory 24 in one node to o memory 25 in another node without involving a Central Processing Unit (CPU) 25 or 27 of either node, or caches or context switches. RDMA instead relies on respective network adapters 51 and 52 of the nodes to read and write data directly. Effectively, the network adapter 52 in the second node pulls the data from the memory 24 over a network 60 with assistance of network adapter 51. and places the data directly io memory 25 without the need for caching, and without the respective CPUs even being aware that this has occurred. This reduces latency and increases the speed of data transfer, which Is obviously beneficial in high performance computing.
Consequently, references in this specification to data being transferred from one computer or node to another should be understood to mean that the respective network adapters for equivalent) transfer data, without necessarily Involving the CPU of each computer or node.
The current amount of data that is being handled in many fields, from science to finance, together with its importance and significance, make the implementation of secure and fast fault recovery systems absolutely crucial In particular, current Big Date technologies, which help us to deal with the plethora of date we are generating, have to consider the implementation of these fault recovery systems.
Although foe above-mentioned approaches improve the performance of checkpointing processes, they do not contemplate the use of volatile storages in staging nodes, as a way of speeding up the recovery process, neither do they offer enough flexibility for using different techniques to checkpointing such as lineage-based recovery, nor do they include mechanisms to make required checkpoints when necessary automaflceily. The present invention has been devised to address these problems.
According to a first aspect of the present invention, there is provided a method of checkpointing a data object in a computer system having a computation node, a staging node end access to a file store, the method comprising:
duplicating, in a memory of the staging node, an object in a memory of the compulation node;
copying the object from the memory of the computation node to the Hie store; and retaining the object in the memory of the staging node after copying the object to the file store.
in the above, “data object” or object' denotes a date structure created or modified by the computation node, usually as part of an application being executed by the computer system. 'Checkpointing’ refers to the process of saving data objects during the course of computations performed by the computer system, fencing a checkpoint from which computations cars be restarted in the event of a failure. The term “memory” denotes some form of fast storage, typically but not necessarily exclusively e solid25 state memory such as random-access memory (RAM). The memory will usually be volatile memory (and may be part of the computation node or staging node itself, or an assigned area of a global memory in the computer system). The ‘Tie store” denotes a non-volatile memory such as a set of herd disks (end may be remote from the computer system itself, and/or distributed), it is referred to elsewhere also as disk” or “Parallel File System (PFS)”.
in contrast to previously-proposed approaches, the present invention uses the memory of a staging node not only as a temporary storage merely for transferring objects to the file store, but also as a fast-accessible checkpoint in its own r-ght, tons exploiting toe speed advantage of memory (e.g, RAM) over disk accesses. This apparently small difference leads to significant changes in how the checkpointing process is handled.
The ' duplicating” referred to above may comprise copying the object from the memory of the computation node to the memory of the staging node, Preferably, such copying is performed using Remote Direct Memory Access, RDMA,
Alternatively, duplicating the object is performed by updating the object already present to the staging node. This is done by, in the staging node, applying one or more transformations to the object retained In the memory of the staging node to replicate changes made to the object in the computation node, in other words the lineage computation chain is applied instead of copying the whole object over again.
in view of these alternative possibilities for duplicating the object, there is preferably added a step of, prior to duplicating the object, selecting whether to duplicate toe object by copying it from the computation node or by updating the object in the staging node.
Such selecting (which can be decided in the staging node itself without the need for manual intervention) may be performed by calculating whether the staging node applying said one or more transformations to the object is quicker than reading the object from the memory of the computation node.
in any case, checkpointing needs to be triggered to some way. Preferably, checkpointing is performed on a par-computation node basis and triggered by the computation node without the need for manual intervention. Thus, the method preferably includes an initial step of the computation node sending object attributes to the staging node. Receipt of object attributes cart be taken as an implicit request tocheckpoint the object (or alternatively may be accompanied by an explicit checkpointing request).
Automatic checkpointing in the above manner can be customised by a user.
Accordingly, the method may further comprise the user setting conditions under which the compute tiers node decides to checkpoint an object, including any one or more oh computation time of the object, priority of the object, and usage frequency of the object.
Preferably, the method further comprises the staging node receiving the object attributes from the computation node and, based on the object attributes, selecting whether to copy the object from the computation node or to update the object in the staging node,
Although only one computation node and one staging node were mentioned above, in practice the computer system will have many computation nodes and a plurality of staging nodes. For reasons of economy, the number of staging nodes may be much less than the number of computation nodes, leading to a potential problem of memory capacity in the staging node. For this reason, preferably, tbs method further includes the staging node judging, prior to the duplicating, whether sufficient space for the object exists in the memory of the staging node and if not, creating space in the memory.
This cars be done by releasing one or more objects previously duplicated in the memory of the staging node but which have already been transferred to the fife store.
The copying irons the computation node to the staging node can ba carried out synchronously or asynchronously. Synchronous copying is preferable since the copying takes place at the time of checkpointing, and must ba completed quickly since computation in the computation node is interrupted until the copying has finished. Likewise, copying of the object from the staging node to the fife store can be performed either synchronously or asynchronously. Asynchronous copying is preferable in this case, as writing to the hie store Is relatively slow, but the compulation node can continue with its computation without interruption.
Holding objects in staging nodes and file store is, of course, only one stage of checkpointing, Another is: to make use of checkpointed objects to recover from a fault, or to re-start after a planned shutdown. Accordingly, the method may further comprise, upon occurrence of a fault in the computation node, restoring the object from the memory of the staging node to the memory of the computation node. If the object is not stiil retained in the staging node (for example due to being released to make space for another object), the object can be retrieved from the fiie store.
According to a second aspect of the present invention, there is provided a computer system comprising:
a plurality of computation nodes for processing objects; a plurality of staging nodes, each staging node assigned to one or more of the computation nodes:
a network for exchanging data including objects between the computation nodes ano the staging nodes and accessing a fiie store; wherein a said staging node is arranged to:
duplicate in a memory' of the staging node an object which exists in a memory of a computation node to which die staging node is assigned;
copy the object from the memory of the computation node to the file store;
and retain the object in the memory of the staging node after copying the object to the file store.
The above system may, of course, provide any of the features mentioned above with respect to the method of the invention.
.According to a third aspect of the present invention, there is provided a computer program containing computer-readable instructions which, when executed oy a computation node and/or a staging node in a computer system, perform any method as defined above.
Such a computer program may be provided in the form el one or more non-transitive computer- readable recording media, or as a downloadable signal, or in any other form.
Thus, embodiments of the present invention provide an automatic end customisable RDMA-based checkpointing mechanism for in-memory date structures (objects) that combines memory end disk, whilst reducing the load of the computation nodes with the use of staging nodes. The proposed mechanism is able, under certain customisable conditions, to trigger the process to make a checkpoint of a certain in-memory data structure. Moreover, embodiments of the invention keep in-memory checkpoints in the staging nodes to allow faster recovery when faults happen, whilst checkpoints ere also distributed over a parallel file system (PFS) to Increase data safety. This hybrid approach combines memory and disk checkpointing, while- communicating data through RDMA connections to reduce bottlenecks. Finally, embodiments of the invention are able to decide when it is not worth copying ail the data to make a new checkpoint, and instead apply an incremental checkpointing based on the lineage computation chain of the in-memory data structure, hence reducing communication between nodes.
Embodiments of the present invention can address the issues identified in the above section as follows.
(a) Bottlenecks occur when checkpointing data from memory to disk, h embodiments, data is net copied from memory to disk, but from memory to memory through a RDMA mechanism, (b) Checkpointing copies in distributed nodes are written into ncn~vo!atiie, slower storage, io embodiments, staging nodes era used not only as an intermediate layer through the filo store (PFS) in which checkpoints are stored Into a non-volatile storage, but also as nodes whose memories (fast, usually volatile, storage) are used as an in memory checkpointing area.
(c) Communicating checkpoints through the network is costly, in embodiments, communications over -he network are done by using RDMA, which is a faster way of communicating between nodes, since transfers go from memory to memory directly, without involving the operating system or the CPU. Moreover, embodiments of the invention can reduce the amount of communications required, by only transferring the lineage of an object to update the corresponding checkpoint.
(ds A global server or storage in a computer system may be saturated by many simultaneous local nodes transferring copies, Embodiments of the invention make use of staging nodes to coordinate the copies from the computation nodes to the PFS, white using the same staging nodes as in-memory checkpointing areas, (e) Previous checkpointing techniques do not contemplate the use or volatile storage in the staging nodes as a way of speeding up the recovery process. As mentioned in thi, the present invention employs staging nodes in a novel manner by retaining objects in memory even after transfer to the file store, to support a hybrid checkpointing mechanism,
CQ Previous checkpointing techniques do not offer enough flexibility for using different techniques to checkpointing such as Hneage-basod recovery. Embodiments of the invention decide when if is not worth if to copy all the data regarding a new checkpoint, and instead apply an incremental checkpointing based on the lineage computation chain of the in-memory date structure (object), fiance reducing communication between nodes, (a) Previous checkpointing techniques do not include mechanisms to make required checkpoints when necessary automatically. Embodiments of the invention can, under certain conditions, automatically trigger the process to make a checkpoint of a certain
2o in-memory data structure. Moreover, those conditions ere customisable by the user, so it, is possible to change the behaviour of the described automatic mechanism.
Brief Descriptionof..fhe.Drawings
Reference is made, by way of example only, to the accompanying drawings in which:
Figure 1 shows a conventional checkpointing mechanism;
Figure 2 shows conventional Remote Direct Memory Access (ROMA);
Figure 3 compares the conventional checkpointing mechanism of Figure 1 with a checkpointing mechanism employed in embodiments of the present invention;
Figures 4(e) to (c) compare (a) a conventional checkpointing mechanism without ROMA; (b) a proposed checkpointing mechanism with RDMA end (c) a checkpointing mechanism as used by embodiments of the present invention;
Figure S shows a multi-level system architecture used in an embodiment of the present invention;
Figure 6 shows a sequence of steps in checkpointing an object by copying the object; Figure ? shows a sequence of steps in lineage-based updating of an object;
Figure 8 shows a process workflow performed by a computation node;
Figure 9 shows a process workflow performed by a staging node;
Figure 10 compares manual and automatic checkpointing;
Figure 11 compares a conventional Spark-based checkpointing mechanism with checkpointing proposed in the present invention; and
Figure 12 shows a hardware configuration of a computer system capable of being applied to a computation node and/or a staging node in an embodiment of the present invention,
Before describing embodiments of the present invention if may be helpful to list some technical terms to be used in the subsequent description, as follows.
HPC: High Performance Computing,
RDMA: Remote Direct Memory Access. 'T his allows data to be transferred from one memory in one node to the memory in another node without Involving a CPU, caches or context switches, because data is copied by the nodes’ network adopters.
Node; A hardware resource in e computer system including at least a processor, or possibly a complete computer which is part of a networked computer system. The node may have Its own local memory, or may have an assigned area In a global memory of the computer system.
Computation node: Node in which the computation is performed.
Staging node: An intermediate nods that facilitates communications from the compulation nodes end the Ale system used for backup, or checkpointing storage.
Checkpointing: A technique to add fault tolerance into computing systems, which consists of saving a snapshot of the application’s state, in data-driven applications, a checkpoint can be made by only saving the state of a certain data structure (object) in a certain moment of time.
Hybrid checkpointing: The combination of non-volatile (such as HDD) and volatile storage (such as RAM) to provide the checkpointing mechanism.
Object; A data structure which is used in, or is generated by an application to be checkpointed.
Lineage computation chain: A technique by which, when a fault happens, lost objects can be re-computed by applying the same chain of operations as ware recorded in a tog.
PFS: Parallel File System: a file store for long-term storage of objects and other data, programs etc. Typically disk-based, and may be remote from the computer system itself (and possibly in the cloud*').
User; A human user, who may influence the decision to checkpoint an object, either by manually coding checkpoints during development of the application, or by adjusting parameters for allowing the system to perform checkpointing automatically.
Embodiments of the present invention provide an automatic mechanism to perform hybrid (using both memory and disk storage) checkpoints of In-memory data structures (hereinafter called objects). Communications of these objects are done by using RDMA, in a manner similar to that shown in Figure 2, which differs from the approach followed by other alternatives, such as Spark, because checkpoints are made from memory to memory, instead of copying the object from memory to a slower, nonvolatile storage. Following this approach, embodiments oi' the invention eliminate the bottlenecks involved in the writes to disk, and hence, improve the performance of the checkpointing mechanism. As will be later explained, writes to the non-volatile storage are delayed and performed in the staging nodes, with no impact on the performance of the computation nodes.
Figure 3 shows a comparison between the typical checkpointing mechanism and the proposed RDMA-based technique to eliminate the bottleneck in the writes to disk of inmemory checkpointing mechanisms. The upper half of the Figure depicts the conventions! checkpointing mechanism as already described with respect to Figure 1,
The lower part of Figure 3 depicts the approach taken in the present invention. As before, an input dataset 11 includes data structures which are subject to successive transformations to create objects 31, 32 and 33 etc. The difference is et the stage where object 33 has been created end is io be checkpointed, instead of directly writing this object to Hie store 15 as in the conventional approach, RDMA is used to puli the object from a memory 24 of the node which created the object to a memory 25 in a staging node (described later in more detail). From there, the object is transferred asynchronously to the file store 15. The result is to increase speed and eliminate the bottleneck, et the cost of the additional hardware (in particular, memory) needed for the staging node. Although such use of staging nodes is known per se as noted above, there is more to their use in the present invention as will become apparent.
Figure 4 indicates how the use of staging nodes differs between the present invention and previously-proposed approaches. Figure 4(a) shows the Spark-based approach, Figure 4(b) uses staging nodes in a previously-proposed way, and Figure 4(c) shows the novel approach taken in the present invention, in each case, a network 60 or 606 connects computation nodes 20, 2CA, 2QB or 200 with each other and with staging nodes 76 or 700, e.g, as pad of a large- scale computer system.
Spark’s approach is shown in Figure 4(a). Derived by processing data from a local disk 14 in a CPU 26, an in-memory object within memory 21 of computation nods 20A is transferred to another computation node 208 having a CPU 27 and memory 22 without using RDMA, to be distributed, replicated and stored (usually using HOPS: Madcap
Distributed File System). As writes to disk (PFS 16) are dene in the computation nodes, a certain lose of performance is produced, it is snot possible to use RDMA communications with Spark, because it is not designed to do that: the semantics ci Spark's commands do not include any provision for RDMA communications.
Figure 4(b) shows a different approach, which has the aim of eliminating disk bottlenecks in the computation nodes by Introducing staging nodes 70, as well as improving the communication via RDMA connections (as denoted by the curved arrows bypassing CPUs 26 and 72). That is, an object can be directly transferred by MDMA from memory 21 of computation node 20 to memory 74 of staging node 70 having CPU
72, and from there asynchronously transferred to file store 15
Although this approach effectively improves the performance, it should be noted that in the known approach the memory 74 in the staging node 70 is only used as a buffer to write to the FFS 15: in other words the object is released (lost from memory 74) once the transfer is completed.
By contrast, as indicated in Figure 4(c) the present invention improves on previous approaches by retaining objects in the staging node memory 740 (volatile storage) as pari of the checkpointing mechanism. As a result, recovery from checkpoint is potentially faster, since the required checkpoint may be already in memory in the staging node. That is, in Figure 4(c}; RDMA is again used to transfer the object from computation node memory 210 via network 800 to staging node memory 740 but in this case, instead of the memory 740 merely acting as a buffer for writes to the PFS 150, the staging node memory'· 740 becomes an in-memory checkpointing area,
Whilst Figure 4 shows only individual computation nodes and staging nodes, of course in practice a computer system may have very many computation nodes and many staging nodes. Figure 5 shows, as a simplified example, a multi-level system architecture proposed in embodiments ci the invention. Computation nodes 200 are the nodes in which the computation tasks are performed, end the in-memory objects ere generated. The nodes may be regarded as forming a “cluster’' in the sense used by Spark for example, Computation tasks refer to both the above mentioned
S transformations of In-memory objects, end other computer instructions that build the logic of e program being executed in the computation nodes. Net every computer instruction in a program is involved in e transformation of in-memory objects.
Each node may be regarded as having (at least nationally) its own memory 210 and
CPU 260, and access to a local file system 140. Some hardware resources may be shared among computation nodes if desired. The computation nodes are rnutualiy connected by a network 500, for example a high-speed iocaierea network (LAN) or possible a wide-area network (WAN) including the internet.
For every n computation nodes 200, there is a sieging node 700, each comprising a CPU 720 and local memory 740 and having access to the PFS 150 for checkpointing purposes. Each staging node 700 is responsible for receiving the checkpoints from the computation nodes that it has assigned, keeping those checkpoints In memory and transferring them to the FES 150. In contrast to previously-proposed approaches, it is important to note that the deta Is retained In the memory 740 of the staging node 700, and not merely stored for RUMA purposes. Thus, a difference over previous propeseis is how memory -s being used to exploit Its speed advantage over disk accesses.
incidentally, it is assumed that each object is handled by one computation node only, so thet each one of them could apply a different transformation to the object. However; each computation node can handle different objects, even concurrently, since current multi core architectures allow us to process computer instructions in parallel. Also, the same node may perform different transformations upon the same object at different stages of execution,
As will be understood, the 3:1 relationship shown in Figure 5 is e simplification. The actual ratio (which may be much larger, e.g. 100:1 or more) should be determined for each implementation of this invention, depending on the technologies used, as well as other factors. Thus, the ratio n” of computation nodes per staging node is not fixed, and can he changed when the architecture is implemented. Depending on the characteristics of each particular computation precess, the ratio could change with the aim of making the checkpointing process more efficient
The hardware resources available may differ between computation nodes 200 and staging nodes TOO, Generally, computation nodes should have more resources, while the priority for the staging nodes should be memory capabilities, i.e. memory size, Sow latency, etc.
The staging nodes 700 are preferably located as close as possible to the computation nodes 200 for speed of access, in some implementations, the computation nodes end staging nodes may eii be processors, CPUs, cores or system boards of the same kind, some being assigned as computation nodes and some assigned as staging nodes. Such an arrangement allows the number of staging nodes, relative to that of the computation nodes, to he varied as required. Although not preferable from a spaed viewpoint, under some circumstances it may be possible for the same hardware to provide both a computation node and a staging node.
The checkpointing mechanism proposed by embodiments of the invention implements RDMA clients 270 In each computation node 200, and RDMA servers 750 in the corresponding staging nodes 700 Figure 6 describes the process followed by the proposed mechanism to achieve the checkpointing of a single object.
When a checkpointing action is initialized (explained later with reference to Figure 10), the object being checkpointed is read by the RDMA client 270 (Step 1) in the computation node 200, which sends the object’s attributes to the RDMA server 75Q in the staging node (Step 2).
With this information, Ihe RDMA server 750 located in the staging node can check if there is enough memory for the new object being checkpointed (Step 3), if not, the server 760 moves as many objects as necessary to the PFS 160 (Step 4), freeing enough memory for the now object. As will be understood, the objects being moved
From memory to disk are the result of earlier checkpoints, since the staging node does not delete them iron? memory ontil necessary. The actual selection of objects being moved is based on a combination of factors, including the priority of the object, how offers the object has been used, and the time required for computing that object. This calculation fries to maximize the likelihood of having a certain object in memory If a fault involving that object occurs.
Once the required memory space is available, the RDMA server 750 sends an RDMA read request (Step 5) to the RDMA client 270, which responds at Step 6 with the object to be checkpointed. This object is then checkpointed in memory (Step 7), and the server 750 sends an acknowledgment to the client 270, Finally, objects In memory 740 are asynchronously copied to the PFS 150, as a way of increasing security of the checkpointing mechanism in case of a failure of the staging node.
Note that if several computation nodes corresponding to a staging node initiate the checkpointing process concurrently, their requests are served following a FIFO (First in First Out) policy, the first node to request the checkpoint is the first to be heard.
Regarding Step 2 above, each object has the following attributes that help in the checkpointing process:
4 art ID: a unique identifier for each object * the size of the object in memory * the computation time required to reach the current state of the object in memory (since last checkpoint) 4 priority: a user-defined attribute set by the user to make the object less likely of being removed from the in-memory checkpointing storage.
usage frequency: how often the object has been used since it was created, where “used means that some or ail of the values encoded in the object are read for
IS the program purposes, It should be noted that such reads are distinct from transformations, which involve a modification of the object, and hence, a writing to it.
* lineage computation chain: of more relevance to Figure 7, when a fault happens, lost objects can be re-computed by applying the same chain of operations as were recorded in a log, in the event that the; object is not retained in the staging node, the most recent version available can be retrieved from the file store.
Figure 7 describes the alternative approach to the regular checkpointing process described In Figure 6, Figure 6 may be regarded as a basic embodiment of the invention, while Figure 7 describes another embodiment wherein the checkpointing process can he also done by applying lineage.
The process Is very similar to that of Figure 6, with the difference that if allows updating δ an already checkpointed object in memory by applying its lineage computation chain (in other words, the sequence of transformations performed on the object since the last time if was checkpointed). Thus, Step 1 Includes the RDMA client 270 reading the lineage in addition to the other attributes shown in Figure 6. The choice between creating a new checkpoint or updating a previous one is made by using the computation time required to create the object (he., the time required to create the object in its current state in the computation node). However, this decision is done by the staging node, which uses the attributes sent by the client about the object, and evaluating if re-applying the computation chain to the old object is more efficient than sending the whole object again. Therefore, the client 270 needs to send this computation time as part of the object attributes (Step 2) to the server 750, Steps 3 to 5 are the same as for Figure 6. if the server decides to update the checkpoint, only the lineage is sent to the server (Step 6), which is able to apply the computation chain that led to the object being checkpointed (Step 7), end therefore, avoid the communication costs of transferring the actual object,
The staging node 730 (in particular the RDfriA server 750) Is responsible for applying the computation chain. The computation of the transformation being applied to an object requires some computational resources would not be available for the node while the transformation is processed. This coaid affect the performance of the checkpointing process, if the staging node needs to apply many of these transformations, if these transformations are computationally intense, or if there is a flood of objects to he saved in the staging node. The degree to which the performance will be harmed depends on the hardware of the staging node. The mechanism by which the server decides to checkpoint the object following this approach is later described in Figure 9,
To explain the processes of Figures 6 and 7 in more detail, Figure 8 describes the workflow of the client 270 in the computation node 200, The process begins in step 3100. in step SW2, the user sets any parameters which he or she wishes to be taken into account when the computation node determines whether an object needs to be check pointed. These parameters can include threshold values for computation time, priority of the object, and/or usage frequency of the object. It will be understood that
3102 can be performed in advance of any computation, e.g, during development of the application. Any parameters not set by the user will remain at a default value, in SI 04 and SI 08, the client 270 checks whether the conditions for checkpointing an object are met (this can be checked periodically). If not (3106, No), the process waits until the next check. However, once the conditions ere satisfied (S108, Yes), the flow proceeds to S108 in which the client 270 sends attributes of the object to the staging node. In 3110, in response to the object attributes, die client 270 receives an RDMA read request from the server 750 in the staging node. As previously described, depending on the type of the request (as judged at S112), the client 270 sends the object to be checkpointed (S114), or only the object’s lineage (S116), When the client 270 receives the acknowledgment from the server 750 (S118), its process ends (SI 20).
Figure 9 describes the workflow of the server 750 in the staging node 700. The process begins at S200. As a server, it begins listening to the client (S202), until in step S204 the client 270 sends the attributes of the object to be checkpointed, which Implies a checkpointing request for that object. Given its ID (S206), the server checks {5208} if the object has been already checkpointed, and if not (3203, No), the server checks {5212} if there is enough memory for the new object.
•jS i W if a previous state of the same object was checkpointed (3208, Yes), the system decides in step 3210 if it is worth creating a new checkpoint, discarding the previous, old version of the object In order to come to a decision, the server uses the computation time required to create fire object, and evaluates whether re-applying the computation chain to the old object is more efficient than sending the whole object again, end hence, the cost of communicating the object is not worthwhile, if it is determined (5210, “Update checkpoint) to perform an update, the difference in size between both the old and the new object is calculated in step 3216, and than is used in 3212 to check if there is enough memory for the updated version of the object. On the ether hand, if if determined to make a now checkpoint, (S210, “New checkpoint) then the server discards the previously-stored object in 5214 and the flow again proceeds to S212.
The checking applied by the server in 3212, as shown in more detail by steps 52120, 32122, S2124 and 32126, involves the selection of objects to be discarded and moved to the PF5 130 in the case there Is net enough memory for the new checkpointed object. This selection is based on a combination of factors, including the priority of the object, how often the object has been used, age of the object and the time requiring for computing that object. Here, priority” is a user-assignable variable allowing the user adjust the selection of objects relat-ve to one another. 't his calculation tries to maximize the likelihood of having a certain object In memory if a fault affecting that object occurs.
After the server has freed some memory and ensured that there is enough space for the new object or update, it sends a RDMA read request (S2 18) to the client 270, This requests the client 270 in the computation node 200 to send either the object itself or the lineage computation chain that leads to the current state of the new object (,5218, “update path” in Figure 9). If the object itself was requested, the server receives the object (3220) and places it in memory (3222). if the lineage was requested, the server ίλ· >
receives the lineage in S226 end applies the chain of logged changes to the already checkpointed object in S226.
Finally, once the server has completed Use in-memory checkpointing precess either by
S S222 or S225, if sends in S22-1 an acknowledgment message to the client 270 to confirm -he end of the process. A farther step (not shown) is to asynchronously write the checkpointed object to the tile store ISO,
The checkpointing process above is started either manually or automatically. Figure 10 shows the differences between these two options. Manual checkpointing (depicted in the upper part of the Figure) is triggered by the user, who decides when and what object is checkpointed. As already mentioned this decision can be made during the development stage of the application. At development time, the user can have a grasp of when the checkpointing should he applied, based on his or her experience both as a programmer and with a particular program: there is no need for the user to monitor progress of the executing the application. When the executed program reaches the stage decided in advance for checkpointing, a checkpoint is triggered. Thus, an object 350 in memory 240 of a computation node may be modified by a transformation t(n) to form an object. 360, triggering checkpointing in accordance with an earlier user decision. Checkpointing follows to place the object In memory 740 of staging node 700, upon which precessing can resume, with object 360 further transformed by i(n-rl) to an object 370 in computation node memory 250. Meanwhile, the staging node can perform asynchronous writing to disk 150.
On the other bend, as shown in the lower part of the Figure, checkpointing may be also automatically started, as long as certain conditions ere satisfied. These conditions can also be customised by the user by the above mentioned parameter setting, who can thereby change how the checkpointing mechanism behaves depending on the application. These conditions consist of thresholds for different parameters, such as computation time, priority, and usage frequency. If an object ties taken a long time to be computed, it may be worth checkpointing it to avoid its re-computation, in a similar fashion, if the object is heavily used, making a checkpoint may be worthwhile; and the same can be said with high priority objects. Thus, an object 350 in computation node memory 240 becomes transformed by ffn) to object 360, It is then checked whether the conditions for checkpointing ere full iked: if not, the object Is simply maintained in memory 240, but if a need for checkpointing is determined based on the conditions, checkpointing to staging node memory 740 end subsequent backup to the PFS 150 is carried out.
Each staging node 700 is responsible for deciding when the automatic checkpointing is applied. There is a trade-off between performance and safety, expressed as how often checkpointing occurs, if a checkpoint were made after each single instruction, the performance of the program would decrease drastically. The fewer checkpoints are made, the greater the risk oi losing some objects, hut they can be recovered doing the computefion again from the dependent object that was last checkpointed. Of course, if this happens, performance is also affected, hut it is compensated because faults do not occur very often.
To implement the above, an embodiment could ba applied as an enhancement of Spark, replacing its mechanism of checkpointing by the one proposed in embodiments of the invention, as shown in Figure 11.
Figure 11 is conceptually similar to Figure 3, with file stores 11.12, 15 and 110,120 and 150 as before, as wait as computation nodes having memories 21, 22, 23 and 210, 220, 230 and 240, A difference now =s that each object 31 to 34 or 310 to 340 is a RDD as provided in Spark. As already mentioned, Spark keeps data structures in memory called HDDs (Resilient Distributed Datasets), and thus oilers the user the possibility of checkpointing them to disk manually. The upper pad of the Figure shows conventional Spark-based checkpointing where writmg of RDD3 to fii-a store 15 is performed manually.
The lower part of Figure 11 represents an embodiment of the present invention combining RDMA, manual or automatic checkpointing (as discussed above), and inmemory checkpointing in which RDD3 is transferred vie ROMA from computation node memory 240 to staging node memory 740, where this ROD is retained even after being written to hie store 150, As noted in the Figure, not only the checkpointing itself but also the subsequent recovery, if required, are speedier than the conventional Sparkbased approach owing to the superior access speed of memories 240 and 740 particularly when using RDMA. The downside is tire need to reserve, or additionally provide, memory resources in the computer system for the use of the staging nodes.
Spark can be modified to implement embodiments of the invention's checkpointing mechanism. Following the described mechanism, RDDs will not be checkpointed directly to disk, thus Incurring bottlenecks. On the contrary. RDDs will be transferred using RDMA (and therefore speeding up the communication) to the corresponding staging node, which will keep a copy of the RDDs in-memory, and will transfer them asynchronously to disk. As a result, checkpointing and recovery phases are boosted, thanks to the use of RDMA and the in memory copy of RDDs in the staging node, in order to implement this mechanism within Spark, its API is not required to be modified, and therefore, users can keep using the same function to trigger the checkpointing process. For example, If using Scale:
vat ssc new StreamlngContext(.„) // new context sso.checkpoini(checkpoiniDireciory) // set, checkpoint where oheckpointOis the function to make the checkpoint of the corresponding ROD, The only addition needed for the API is a function by which toe user could set when the automatic checkpointing mechanism is triggered. This function should be called at the beginning the application, once the context has bean created,
What it is necessary to change is the underlying implementation of this function, as well as how Spark is physically deployed in a cluster, because with ihe application of embodiments of the invention, it is necessary to make changes in how the nodes are used and structured, while setting the computation and toe staging nodes as shown in
Figure 5.
The automatic mechanism proposed by embodiments of the invention should be implemented at a different level within Spark, Spark provides a scheduler as already mentioned. The scheduler is run by the driver, which is e centralised program that Is aware of the dependences between the different objects being created. Because of thist if should be the responsibility of Spark's scheduler to perform automatic checkpointing. Thai is, the scheduler should have the task of monitoring the RDDs, because it is the scheduler which is aware of the size of each ROD, as well as the time required to compute them. With this information, the scheduler should be able to start the checkpointing process, as shown in Figure 10.
Figure 12 is e block diagram of a computing device 1000 which may be used as a computation node and/or a staging node es referred to above in order to implement a method οί an embodiment. The computing device WOO comprises a computer processing unit (CPU) 993, memory, such as Random Access Memory (RAM) 995. end storage, such as a hard disk, 936. The computing device else includes a network adapter 990 for communication with other such computing devices of embrxJiments. For example, an embodiment may be composed of a network of such computing devices. Optionally, the computing device also includes Read Only Memory 994, one or more input mechanisms such as keyboard and mouse 998, and a display unit such as one or more monitors 937, The components are connectable to one another via a bus 992, The CPU 993 is configured to control the computing device and execute processing operations. The RAM 996 stores data being read and written by the CPU 993. The storage unit 996 may be, for example, a non-volatile storage unit, and is configured to store data,
The CPU 933 may include one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. The processor may include a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets. The processor may also include one or more specialpurpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPG.A), a digital signal processor (DBF), network processor, or the like, in one or more embodiments, a processor is configured to execute instructions for performing the operations and steps discussed herein.
The storage unit 996 may include a computer readable medium, which term may refer to a single medium or multiple media (e.g., a centralized or distributed database and/or associated caches and servers) configured to carry computer-executable instructions or have data structures stored thereon. Computer-executable instructions may include, for example, instructions and data accessible by and causing a general purpose computer, special purpose computer, or special purpose processing device (e.g., one or more processors) to perform one or more functions or operations.
Thus, the term “computer-readable storage medium” may else Include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methods of die present disclosure, The tern? “computer-readable storage medium” may accordingly be taken to include, but not be limited to, solid-state memories, optical media and magnetic media. By way of example, and not limitation, such computer-readable media may include non-transitory computer-readable storage media, Including Random Access Memory (RAM), Read-Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EBPROM), Compact Disc Read-Only Memory (CD-ROM) or other optical disk storage, magnetic disk storage or other rnagnet-o storage devices, flash memory devices (e.g.. solid state memory devices).
The display unit 99? displays a representation of date stored by the computing device and displays a cursor and dialog boxes and screens enabling interaction between a user and the programs and data stored on the computing device. The input mechanisms 998 enable a user to input data and instructions to the computing device.
The network adapter (network 1/E) 999 is connected to a network, euch as a highspeed LAN or the internet, and is connectable to other such computing devices via the network. The network adapter 999 controls data input/output from/to other apparatus via the network. Other peripheral devices such as microphone, speakers, printer, power supply unit, fan, case, scanner, trackball etc may be included in the computing device 1009.
Methods embodying the present invention may be carried cut on a computing device such as that illustrated In Figure 12, Such a computing device need not have every component illustrated in Figure 12, and may be composed of a subset of those components, A computation node or staging node embodying the present invention may be carried cut by a single computing device 1000 in communication with one or mere other computing devices via a network. The computing device may ba a data storage itself storing at least a portion of the objects. A computation node or staging nods embodying the present invention may be carried out by a plurality of computing devices operating in cooperation with one another. One or mere of the plurality of computing devices may be a data storage server storing at least a portion of the objects.
To summarise, embodiments of the present invention provide a checkpointing mechanism by which in-memory date structures ere copied from computation to staging nodes by using RDMA, checkpoints are made and kept in the staging nodes’ memories, and then asynchronously copied to non-volatile storage, in contrast to previous approaches, checkpoints remain in volatile memory as part of the checkpointing mechanism. As a result, recovery from checkpoint is potentially faster, since the required checkpoint may be aiready in memory in the staging node. An automatic and customisable mechanism is provided to control when the checkpointing process Is triggered. As an alternative to copying an object through the network, the object in memory can be updated to a newer version of the object by applying the chain of changes made in the object in the corresponding computation node.
it should be noted that Spark allows performing checkpoints as a complementary measure to lineage recovery. However, checkpoints in Spark are manual and userguided, while the method described above is transparent and does not need the user to perform the checkpointing. Other novel features in embodiments inciude:30 A checkpointing mechanism by which in-memory data structures ere copied from computation to staging nodes, checkpoints ere made and kept In the staging nodes' memories, and then asynchronously copied to ncn-voiatiie storage.
The automatic and customisable mochanssm that controls when the checkpointing process is triggered.
The mechsn-sm by which a checkpointed object in memory is updated to a newer version of the object by applying the chain of changes made in the object in the corresponding computation node.
The automatic mechanism by which the system decides if an object is checkpointed either copying tiro object through the network, or applying Its lineage computation chain.
The mechanism by which memory in the staging nodes is treed to create space for new checkpoints, using a combination of criteria.
The combined mechanism that allows manual and automatic triggers for the same checkpointing process.
Various modifications are possible within the scope of the invention.
Although one sequence of steps has been described in the above embodiment, tins is not necessarily the only possible sequence, and operations may be performed in a different order so long as the overall effect is equivalent.
Embodiments can be also applied beyond in-memory data structures, namely, checkpointing techniques involving processes (state of the process, context, memory stack, etc,) can benefit. However, the application of embodiments of the invention info this field has involved solving certain technical problems that may arise because of the peculiarities of checkpointing processes, and not strictly data.
The present invention Is particularly applicable to applications having “large granularity'', where the same operations are applied In parallel to many date Items, such as Big Data applications or HPC In general. Embodiments could also be also applied in scenarios in which finer-level changes are applied. However, if the granularity is finer, the number of operations, and ’objects’ which these are applied to would increase, and so the resources needed.
Although an embodiment was described with reference to Spark and PDDs, the 5 present invention Is not limited to such use,
In case the present invention is applied as an enhancement of Spark, this can be done in various ways. One way is the audition of more commands to the existing Spark package OM option. However, this will require keeping Spark capable of both the conventional mechanism end one embodying the present invention. Another approach is to modify Spark internals to adapt the new checkpointing mechanism, while modifying existing commands, if necessary, to include the semantics described above.
References above to ROMA are to be understood to include any protocol for 15 interconnecting memories of different nodes without involvement of CPUs and operating systems. Including for example RDMA over Converged Ethernet (RoCE) and Internet Wide Area RDMA Protocol fiWARP).
Embodiments of the present invention have application to various fields of computing for improving efficiency of checkpointing. An improvement is provided in terms of speed for checkpointing and recovery processes, because bottlenecks due to writes to disk are eliminated by copying the in-memory data structures from memory to memory. Communications In checkpointing processes are accelerated thanks to the use of RDMA connections. Communications needed for checkpointing processes are reduced, by applying the lineage computation chain for updating existing checkpoints.
There is an improvement in the usability of checkpointing processes. Since 30 checkpoints are made automatically, users do not need to know when it is the best moment to perform the checkpoint, or even bother about how the checkpoint has to be performed, Moreover, the described mechanism by which the system decides if an object is checkpointed either by copying the object through the network, or applying its lineage computation chain, encapsulates the complexity and hides if from the user, who can focus on the actual application, instead of dealing with the fault tolerant mechanism. Thanks io customisable checkpointing, flexibility is improved: users can change the conditions under which automate checkpoints are made.

Claims (14)

  1. Claims
    1. A method of checkpointing a date object in a computer system having a computation nods, a staging node and access to a file store, the method comprising:
    5 duplicating, in a memory of the staging node, an object from a memory of the computation node;
    copying the object from the memory οί tie computation node to the file store; and retaining the object in the memory of the staging node after copying the object to the tile store.
  2. 2. The method according to claim 1 wherein duplicating the object comprises copying the object from the memory of the computation node to the memory of the staging node,
    15
  3. 3. The method according to claim 2 wherein copying the object from a memory of the computation node to a memory of the staging node is performed using Remote Direct Memory Access, RDMA.
  4. 4, The method according to claim 1 wherein duplicating the object comprises
    20 updating the retained object by, in the staging node, applying one or more transformations to the object retained In the memory of the staging node to replicate changes made to the object in the computation node.
  5. 5. The method according to any preceding claim further comprising, prior to
    25 duplicating the object, selecting whether to copy the object from the computation node or to update the object in the staging node.
    8. The method according to claim 5 wherein the selecting is performed by calculating whether the staging node applying said one or more transformations to the
    30 object is quicker than reeding the object from the memory of the computation node.
  6. 7, The method according to any preceding claim further comprising an initial step of the computation node deciding to checkpoint the object and sending object attributes to the staging node,
    5
  7. 8, The method according to claim 7 further comprising setting conditions under which the computation node decides to checkpoint the object, including any one or more of;
    computation time of the object, priority of the object, and
    10 usage frequency of the object,
  8. 9. The method according to claim 7 or 8 further comprising the staging node receiving the object attributes from the computation node and, based on the object attributes, selecting whether to copy the object from the computation node or to update
    15 the object in the staging node,
  9. 10, The method according io any preceding claim further comprising, prior to the duplicating, judging whether sufficient space exists In the memory of the staging node and if not, creating space in the memory.
  10. 11, The method according to any preceding claim wherein copying the object from the memory of the staging node to trie file store is performed asynchronously,
  11. 12. The method according to any preceding claim further comprising, upon
    25 occurrence of a fault in the computation node, restoring the object from the memory of the staging node to the memory of the computation node,
  12. 13, The method according to any preceding claim wherein the memory of the staging node is a volatile memory.
  13. 14. A computer system comprising; a plurality of computation nodes for processing data objects:
    & plurality of staging nodes, each staging node assigned to one or more of the computation nodes;
    a network for exchanging data including objects between the computation nodes and the staging nodes and accessing a hie store; wherein
    5 a said staging node is arranged to·.
    duplicate, in a memory of the stag-ng node, an object from a memory of a computation node to which the staging node is assigned;
    copy the object from the memory of the computation node to the file store;
    end
    10 retain the object in the memory of the staging node after copying the object to the file store,
  14. 15, A computer program containing computer-readable instructions which, when executed by processors of a computation node and/or a staging node in a computer
    15 system, perform the method according to any of claims 1 to 13,
    Intellectual
    Property
    Office
    Application No: Claims searched:
GB1609530.9A 2016-05-31 2016-05-31 Automatic and customisable checkpointing Active GB2558517B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB1609530.9A GB2558517B (en) 2016-05-31 2016-05-31 Automatic and customisable checkpointing
US15/454,651 US10949378B2 (en) 2016-05-31 2017-03-09 Automatic and customisable checkpointing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1609530.9A GB2558517B (en) 2016-05-31 2016-05-31 Automatic and customisable checkpointing

Publications (3)

Publication Number Publication Date
GB201609530D0 GB201609530D0 (en) 2016-07-13
GB2558517A true GB2558517A (en) 2018-07-18
GB2558517B GB2558517B (en) 2022-02-16

Family

ID=56410787

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1609530.9A Active GB2558517B (en) 2016-05-31 2016-05-31 Automatic and customisable checkpointing

Country Status (1)

Country Link
GB (1) GB2558517B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080046644A1 (en) * 2006-08-15 2008-02-21 Kristof De Spiegeleer Method and System to Provide a Redundant Buffer Cache for Block Based Storage Servers
WO2012130348A1 (en) * 2011-03-31 2012-10-04 Thomson Licensing Method for data cache in a gateway
WO2013101142A1 (en) * 2011-12-30 2013-07-04 Intel Corporation Low latency cluster computing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080046644A1 (en) * 2006-08-15 2008-02-21 Kristof De Spiegeleer Method and System to Provide a Redundant Buffer Cache for Block Based Storage Servers
WO2012130348A1 (en) * 2011-03-31 2012-10-04 Thomson Licensing Method for data cache in a gateway
WO2013101142A1 (en) * 2011-12-30 2013-07-04 Intel Corporation Low latency cluster computing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"31st International Conference on Distributed Computing Systems (ICDCS)", 2011, IEEE, Prabhakar et al, "Provisioning a Multi-Tiered Data Staging Area for Extreme-Scale Machines", ISBN 978-1-61284-384-1 ; ISBN 1-61284-384-0 *
"International Conference for High Performance Computing Networking, Storage and Analysis (SC)", 2012, IEEE, Sato et al "Design and Modeling of a Non-blocking Checkpointing System", ISBN 978-1-4673-0805-2 ; ISBN 1-4673-0805-6 *

Also Published As

Publication number Publication date
GB2558517B (en) 2022-02-16
GB201609530D0 (en) 2016-07-13

Similar Documents

Publication Publication Date Title
US11461695B2 (en) Systems and methods for fault tolerance recover during training of a model of a classifier using a distributed system
US11442961B2 (en) Active transaction list synchronization method and apparatus
US9996427B2 (en) Parallel backup for distributed database system environments
US9075858B2 (en) Non-disruptive data movement and node rebalancing in extreme OLTP environments
US9069701B2 (en) Virtual machine failover
CN112269781B (en) Data life cycle management method, device, medium and electronic equipment
US20150186411A1 (en) Enhancing Reliability of a Storage System by Strategic Replica Placement and Migration
US11327966B1 (en) Disaggregated query processing on data lakes based on pipelined, massively parallel, distributed native query execution on compute clusters utilizing precise, parallel, asynchronous shared storage repository access
US11657066B2 (en) Method, apparatus and medium for data synchronization between cloud database nodes
JP5721056B2 (en) Transaction processing apparatus, transaction processing method, and transaction processing program
Wang et al. BeTL: MapReduce checkpoint tactics beneath the task level
US11188516B2 (en) Providing consistent database recovery after database failure for distributed databases with non-durable storage leveraging background synchronization point
US10789087B2 (en) Insight usage across computing nodes running containerized analytics
WO2021135210A1 (en) Methods and apparatuses for generating redo records for cloud-based database
JP5719083B2 (en) Database apparatus, program, and data processing method
US10949378B2 (en) Automatic and customisable checkpointing
Hursey et al. A composable runtime recovery policy framework supporting resilient HPC applications
US9460126B1 (en) Rotational maintenance of database partitions
US10929238B2 (en) Management of changed-block bitmaps
GB2558517A (en) Automatic and customisable checkpointing
CN112965939A (en) File merging method, device and equipment
Wang et al. Rect: Improving mapreduce performance under failures with resilient checkpointing tactics
US20220277006A1 (en) Disaggregated Query Processing Utilizing Precise, Parallel, Asynchronous Shared Storage Repository Access
US11755538B2 (en) Distributed management of file modification-time field
US11003684B1 (en) Cooperative log application