GB2537038A - Resilient programming frameworks for handling failures in parallel programs - Google Patents
Resilient programming frameworks for handling failures in parallel programs Download PDFInfo
- Publication number
- GB2537038A GB2537038A GB1604052.9A GB201604052A GB2537038A GB 2537038 A GB2537038 A GB 2537038A GB 201604052 A GB201604052 A GB 201604052A GB 2537038 A GB2537038 A GB 2537038A
- Authority
- GB
- United Kingdom
- Prior art keywords
- application
- resilient
- place
- executor
- computation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/1438—Restarting or rejuvenating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1479—Generic software techniques for error detection or fault masking
- G06F11/1482—Generic software techniques for error detection or fault masking by means of middleware or OS functionality
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Retry When Errors Occur (AREA)
Abstract
A method for supporting resilient execution of computer programs, and an information processing system 100 and a computer readable storage medium, each suitable for implementing the method. The method provides a resilient store 138 wherein information in the resilient store can be accessed in the event of a failure. The method then periodically checkpoints application state in the resilient store. A resilient executor 130 comprises software which executes applications by catching failures. The method uses the resilient executor to execute at least one application and, in response to the resilient executor detecting a failure, restoring application state information to the at least one application from a checkpoint stored in the resilient store, the resilient executor resuming execution of the at least one application. The resilient executor may also include an interface allowing applications to use it and invoke resilient run and recovery methods.
Description
RESILIENT PROGRAMMING FRAMEWORKS
FOR HANDLING FAILURES IN PARALLEL PROGRAMS
BACKGROUND
[0001] The present disclosure generally relates to fault-tolerant computing, and more particularly relates to a method and system for resilient computer programming frameworks for handling failures in executing parallel computer programs.
[0002] Failures in executing computer programs constitute a significant problem. The problem is compounded in multiprocessor environments where failure of a single processor can cause a computation to fail, requiring it to be run from scratch.
[0003] In recent years, frameworks such as map reduce (Hadoop is a well-known implementation, http://hadoop.apache.org/), Spark (https://spark.apache.org/) and Pregel ("Pregel: A System for Large-Scale Graph Processing", Malewicz et al, Proceedings of SIGMOD 2010, http://kowshik.github.io/JPregel/pregel_paper.pdf) have been introduced which provide some degree of resilience to failures. A main drawback to these previous approaches has been that they were only applicable for applications which follow certain regular patterns. There are many applications which do not fit within the paradigms of map-reduce or Pregel.
[0004] MPI (http://www.mcs.anl.gov/research/projects/mpi/) has been often used to program parallel computing systems. However, while MPI has provided message-passing support, it has not provided a full-fledged programming environment. Instead, it was designed to be used in conjunction with existing programming languages such as C, C++, Fortran, Java", etc. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.
[0005] There is thus a need for more general frameworks which help programmers write resilient programs.
BRIEF SUMMARY
[0006] According to one embodiment of the present disclosure, a method for supporting resilient execution of computer programs comprising the steps of: providing a resilient store wherein information in the resilient store can be accessed in the event of a failure; periodically checkpointing application state in the resilient store; providing a resilient executor which comprises software which executes applications by catching failures; using the resilient executor to execute at least one application; and in response to the resilient executor detecting a failure, restoring application state information from a checkpoint in the resilient store, the resilient executor resuming execution of the at least one application.
[0007] According to another embodiment of the present disclosure, an information processing system capable of supporting resilient execution of computer programs, the information processing system comprising: memory; persistent memory for storing data and computer instructions; a resilient store, communicatively coupled with the memory and the persistent memory, wherein information (e.g. application state information) stored in the resilient store can be accessed in the event of a failure (e.g. in response to detection of a failure) of an application executing in the information processing system; a resilient executor, communicatively coupled with the memory and the persistent memory, for executing computations of applications by catching failures in the execution of the computations; a processor, communicatively coupled with the resilient executor, resilient store, the memory, the persistent memory, and wherein the processor, responsive to executing computer instructions, performs operations comprising: periodically checkpointing application state in the resilient store; executing, with the resilient executor, computations of an application while catching failures in the execution of the computations; restoring, based on the resilient executor detecting a failure in the execution of a computation of the application, application state information for the application from a checkpoint in the resilient store; and resuming, with the resilient executor, execution of the computation of the application with the restored application state information.
[0008] According yet to another embodiment of the present disclosure, a computer readable storage medium comprises computer instructions which, responsive to being executed by a processor, cause the processor to perform operations for supporting resilient execution of computer programs, the operations comprising: providing a resilient store wherein information in the resilient store can be accessed in the event of a failure; periodically checkpointing application state in the resilient store; providing a resilient executor which comprises software which executes applications by catching failures; using the resilient executor to execute at least one application; and in response to the resilient executor detecting a failure, restoring application state information to the at least one application from a checkpoint stored in the resilient store, the resilient executor resuming execution of the at least one application with the restored application state information.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
[0009] The accompanying figures, in which like reference numerals refer to identical or functionally similar elements throughout the separate views, and which together with the detailed description below are incorporated in and form part of the specification, serve to further illustrate various embodiments and to explain various principles and advantages all in accordance with the present disclosure. Preferred embodiments of the present invention will now be described, by way of example only, and with reference to the following drawings: FIG. 1 is a block diagram illustrating an example of an information processing system in which a computer programming framework is implemented, according to a preferred embodiment of the present disclosure; FIG. 2 is a program listing illustrating an example of a ResilientComputation interface, according to various examples of the present disclosure; FIG. 3 is a program listing illustrating an example of a ResilientlterativeComputation, according to various examples of the present disclosure; FTGs. 4 and 5 constitute a program listing illustrating an example of a ResilientExecutor class that can be communicatively coupled with an application, according to various examples of the present disclosure.
DETAILED DESCRIPTION
[0010] According to various embodiments of the present disclosure, disclosed is a system and method providing a new computer programming framework for programmers to write resilient programs. Low level details such as catching and handling failures are handled by special software. This relieves significant programming burdens from software programmers and particularly from programmers of modern parallel computing applications.
[0011] Various embodiments of the disclosure are applicable to computer programming frameworks for software applications comprising a state machine with a state which can be periodically saved as a stored checkpoint. In the event of a failure, an application can be restarted from application state information restored from a previous stored checkpoint. If an application can be properly restored without referring to any saved state (e.g., the application already stores all of the state information needed for recovery in persistent storage, such as on disk), it is even easier to use an embodiment of the present disclosure for handling program resilience.
[0012] According to various embodiments, a programming framework provides software allowing an application to achieve resilience. It greatly simplifies the task of writing resilient programs.
[0013] Various embodiments of the present disclosure provide to application programs one or more of the following features: 1) an ability to execute an application program using exception handling which detects failed places. The term "place" refers to at least a part of an executing computation that may be for an application, such as a process (or in some cases, one or more threads). A place may comprise an entity executing a computation.
2) an ability to reliably checkpoint data structures in an application so that the data structures will be preserved in the event of a failure.
3) virtual places which hide the actual physical places on which a computation is executing. Programs reference virtual places instead of physical places. That way, if a physical place fails, the computation can continue to reference virtual places which do not fail. Virtual places are mapped to physical places. Mappings of virtual to physical places can be updated to mask physical place failures.
[0014] Various non-limiting example embodiments of the present disclosure are described herein using, for illustration purposes only, the X10 programming language "X10 Language
Specification Version 2.5", Saraswat et. al,
http://x10.sourceforge.net/documentation/languagespec/x10-latest.pdf. Additional information about X10 is available from: http://x10-lang.org/ [0015] An embodiment of the present disclosure could also be implemented for other programming languages and programming environments as well.
[0016] The ResilientComputation/ResilientExecutor framework, as will be discussed below, allows X10 programs to be written so that the programmer does not have to worry about low level failure handling. Low level details for handling failures such as catching and handling dead place exceptions (which is a type of exception that is generated by the X10 runtime system when a place fails) are handled by a ResilientExecutor class (and classes that it makes use of such as VirtualPlaceMap and ResilientMap).
[0017] It is applicable, according various embodiments, for applications with state which can be periodically checkpointed. In the event of a failure, the application is restarted from the point of the last consistent checkpoint. If an application can be properly restored without any checkpointed state (e.g., the application already stores all of the state needed for recovery in persistent storage, such as on disk), it is even easier to use the framework for handling resilience.
[0018] The present example framework makes use of the following classes: ResilientExecutor: The main class implementing the framework. See FIGs. 4 and 5 for an example.
ResilientComputation: An interface which specifies the application-specific methods that an application can use to implement the framework. See FIG. 2 for an example.
ResilientlterativeComputation: An interface which specifies the application-specific methods that an iterative application can use to implement the framework. See FIG. 3 for an example. VirtualPlaceMap: A class implementing virtual places which hides the actual physical places used by the application so that the application does not have to deal directly with place failures. An example of the VirtualPlaceMap will be discussed below with reference to FIG. 3. ResilientMap: provides resilient storage which is accessible in the event of place failures. PlaceGroupUnordered: This class implements place groups in which the order in the place group may differ from the physical order of places. This is used for managing virtual places.
[0019] According to the present example, to use this framework an application can implement the ResilientComputation interface (generally applicable) or the ResilientlterativeComputation interface (for iterative computations). There are a wide variety of other interfaces within the spirit and scope of the present disclosure. The ResilientComputation interface, including its several methods, is shown in FIG. 2 according to the present example. The ResilientlterativeComputation interface, including its several methods, is shown in FIG. 3 according to the present example. The ResilientExecutor class, and its several methods, is shown in FIGs. 4 and 5 according to the present example.
[0020] The framework of the present disclosure, according to the present example, can be used with an information processing system as will be described in more detail below.
[0021] An application program is implemented as an instance (comp) of a class which implements the ResilientComputation interface. If the ResilientlterativeComputation interface or other interface is used instead, the process would be similar.
comp creates a new ResilientExecutor object, resExec, for the application program. resExec.runResiliently() is called to invoke the run method in comp resiliently.
[0022] The run method in comp periodically invokes resExec.checkpoint to checkpoint the state of the computation.
[0023] If resExec.runResiliently encounters (e.g., detects and catches) exceptions (particularly, dead place exceptions), it resiliently restores the state of the computation to the previous checkpoint by invoicing comp.restore. After the state of the computation is restored to the previous checkpoint, resExec runResiliently continues the computation by invoking the run method in comp resiliently.
[0024] A main enabler for the above described approach using the framework is the use of virtual places. Virtual place numbers remain constant throughout a computation so that application-specific code does not need to be modified as the result of place failures. The ResilientExecutor class maintains the virtual place map and replaces dead physical places with live physical places to keep virtual place numbers consistent throughout a computation. Applications are written to iterate over virtual places instead of physical places.
[0025] The ResilientExecutor class is responsible for running application programs under an environment where place failures are automatically detected and caught via exception handling, and dead place exceptions in particular are properly dealt with. This class maintains a virtual place map to hide the fact that place failures may have occurred which necessitate the replacement of one or more physical places with other physical places. This class also provides a resilient environment for recovering from failures. Below is a summary of four features of the resilient framework: 1) Providing code to detect and recover from failures. User application code does not have to worry about low level failure detection either during normal processing or recovery. Exception handling is structured to provide resiliency for normal execution, checkpointing, and recovery from failures.
2) Providing support to efficiently checkpoint applications. The ResilientMap class is a main feature providing this support.
3) Applications refer to virtual places instead of actual physical places.
4) An object-oriented framework which provides a well-defined interface. The ResilientExecutor can be customized to handle different types of failures and different failure-handling requirements.
[0026] Two features of the resilient framework are: 1) it handles failure/recovery details so that the programmer does not have to deal with these details, and 2) it supports efficient checkpointing of application computations. This resilient framework is very general and supports efficient handling of a much broader range of applications than frameworks such as Hadoop, Spark, and Pregel.
[0027] Virtual Places Virtual places can be used to mask place failures from programs. Programs refer to virtual places instead of the physical places on which a computation executes. Virtual places can remain constant during the execution of a program. The underlying physical places may change. For example, if a virtual place vl is mapped to physical place pl and pl fails, then v I can be mapped to another physical place p2. The program can continue to refer to virtual place vl both before the failure of pl and after the failure of pl. That way, the application programmer does not have to write special code to deal with the fact that places in a program may change due to failures.
[0028] There are multiple ways that the system can obtain another place p2 to replace failed place pl. One option is to have a number of spare places running at the start of a computation. Whenever a place fails, the failed place is replaced by a spare place. This method incurs overhead for spare places. In addition, problems occur if the system runs out of spare places.
[0029] Another method is to start up a new place at the time of a place failure to replace the failed place. This avoids the drawbacks of failed places. There could be some overhead/delay in starting up a new place, though.
[0030] Virtual places are implemented by the class Virtual PlaceMap which includes the following methods which an application program can invoke to use: /* Construct virtual place map identical to physical place map for first * numPlaces places. Throw an exception if numPlaces is out of range public def this(numPlaces: Long) /* Return the virtual place id corresponding to "place". Return * NONEXISTENT_PLACE if no virtual place is found corresponding to "place" */ public def physicalToVirtual(place: Place): Long { 7** * Return a key that is specific to the virtual place corresponding to the * current place. If the current place is not part of the virtual place * map, return NONEXISTENT_ PLACE STRING. Useful for generating keys for * storing place local data in resilient stores. */ public def placeSpecificKey(keyRoot: String): String { /* Print out the contents of a virtual place map */ public def printVirtualPlaceMap(): void { /* Replace virtual place with id "id" with physical place "Place". Throw * an exception if "id" is out of range */ public def replaceVirtualPlace(id: Long, place: Place): void { /* Return total number of virtual places in the map public def total Virtual Places() Long { /* Return physical place with id "id". Throw exception if id is out of * range */ public def v rtualToPhysical(id: Long): Place { [0031] Resilient Executor Implementation The ResilientExecutor class, according to various embodiments, runs programs resiliently using the following methods. The resilient executor comprises software which executes applications by catching failures. The resilient executor can also handle at least one exception by recursively catching and handling additional exceptions which occur. Below will be discussed how the ResilientExecutor class can be implemented using the X10 programming language. It is also possible to implement our invention using other programming languages. Note in the methods below that "computation" and "iterativeComputation" are objects representing the application. According to various embodiments, a framework implementation could have more objects to represent additional types of applications within the spirit and scope of the present disclosure.
// Run computation resiliently, handling failures public def runResiliently(): void I try { finish computation.runQ; // application-specific method } catch (e:MultipleExceptions) Console.OUT.println("ResilientExecutor runResiliently has caught exceptions"); handleExceptionsResiliently(e, On); restoreResiliently(NUM_RECOVERY_ATTEMPTS); // Run iterative computation resiliently, handling failures public def iterateResiliently0: void var keepIterating: Boolean = true; try finish I Console.OUT.println("ResilientExecutor about to invoke iterative computation"); while (keepIterating) iterativeComputati on. step0; II application-specific method iterativeComputati on. checkpoint(); // application-specific method keepIterating = iterativeComputation.notFinished(); // application-specific method I catch (e:MultipleExceptions) Console.OUT.println("ResilientExecutor iterateResiliently has caught exceptions"); handleExceptionsResiliently(e, On); restoreResiliently(NUM RECOVERY ATTEMPTS); [0032] If additional types of computations are used, it is possible to have additional run methods within the spirit and scope of the invention.
[0033] If the run methods encounter failures, they attempt (e.g., invoking a recovery method) to resiliently restore the state of the computation via: // Restore computation resiliently, handling failures public def restoreResiliently(attemptsLeft:Int): void f if (attemptsLeft < 1n) throw new Exception("Error in ResilientExecutor.X10 restoreResiliently: Recovery from failure failed too many times"); else { try { finish restore(); ) catch (e:MultipleExceptions) { handleExceptionsResiliently(e, On); restoreResiliently(attemptsLeft -1n); resumeExecution(); [0034] The restoreResiliently method invokes application-specific methods to restore the state of the computation of the application. It might restore application state information from a checkpoint stored in resilient storage. Below is the code for restore. Note that "computation" and "iterativeComputation" are objects representing the application. A particular embodiment could have more objects to represent additional types of applications within the spirit and scope
of the present disclosure.
private def restore() 1 switch (computationType) case GENERAL: computation.restoreO; // application-specific method break; case ITERATIVE: iteratiyeComputationsestore(); // application-specific method break; default: throw new Exception("Error in ResilientExecutor.X10: unknown value of computationType in restore(): " + computationType); [0035] It is critically important to catch and handle exceptions properly. This is achieved by the following method which is targeted to identifying dead place exceptions. A dead place exception is an exception which occurs when a place fails. It would be possible to extend this method within the spirit and scope of the invention to handle other types of exceptions as well.
// resilient method for handling exceptions private def handleExceptionsResiliently(e:MultipleExceptions, numExceptionsHandledSoFar:Int):void val exceptions = e.exceptions; // e is a Rail of CheckedThrowable val numExceptions = exceptions.size; var numExceptionsHandled:Int = numExceptionsHandledSoFar; try { finish while (numExceptionsHandled < numExceptions) if (exceptions(numExceptionsHandled) instanceof MultipleExceptions) handleExceptionsResiliently(exceptions(numExceptionsHandled) as MultipleExceptions, On); else if (!(exceptions(numExceptionsHandled) instanceof DeadPlaceException)) ( Console.OUT.println("Error: Exception encountered. Here is the stack trace: "); e.printStackTrace(); System.killHere(); else I val deadPlace = (exceptions(numExceptionsHandled) as DeadPlaceException).place; handleDeadPlace(deadPlace); numExceptionsHandled++; // while catch (e2:MultipleExceptions) handl eEx ception sResili ently (e2, On); handl eExceptionsResi 1 ently(e, numExcept on sHandl ed); [0036] When dead places are detected by catching and identifying a dead place exception, they are handled by the following method. A key point is that the application program is referring to places using virtual places which never die. Virtual places are mapped to physical places. After a physical place dies, a live physical place is mapped to the virtual place previously corresponding to the dead physical place.
public def handl eDeadPlace(deadPlace: Place): void { if (deadPlace.id() >= 0 && deadPlace.id() < isDeadPlace.size && !isDeadPlace(deadPlace.id())) { isDeadPlace(deadPlace.id()) -true; val virtualid = virtualPlaces.physicalToVirtual(deadPlace); if (virtualld!= VirtualPlaceMap.NONEXISTENT PLACE) 1 virtualPlaces.replaceVirtualPlace(virtualld, getNewPlace()); [0037] The restoreResiliently method shown previously calls resumeExecution() to continue execution of the application program after failures have been properly handled.
Below is the code for resumeExecution: // Resume computation after a failure has been handled.
def resumeExecution(){ switch (computationType) case GENERAL: runResiliently(); break case ITERATIVE: terateResiliently(); break; default: throw new Exception("Error in ResilientExecutor.X10 unknown value of computationType in resumeExecution(): " + computationType); [0038] Note that it is straightforward, within the spirit and scope of the present disclosure, to extend resumeExecution to handle other types of computations besides GENERAL and ITERATIVE.
[0039] A main aspect of the solution disclosed herein in accordance with a preferred embodiment, present disclosure is checkpointing. The resilient executor provides the following methods for checkpointing which are invoked by application programs.
/* Checkpoint the computation. This is called within transitively from * runResiliently, so exceptions are already being caught and handled.
public def checkpoint(): void I finish for (i in 0..(virtualPlaces.totalVirtualPlaces() -1)) { val p:Place = virtualPlaces.virtualToPhysica1(i); async at (p) checkpointAtPlace(); ) // this could fail with dead place exception checkpointAtPlace0(); // assume that this won't fail numCheckPoints++; if (numCheckPoints > 1) ) deleteCheckPoint(); private def checkpointAtPlace() switch (computationType) case GENERAL: computation.checkpointAtPlace(); // application-specific method break; case ITERATIVE: iterativeComputation.checkpointAtPlace(); // application-specific method break; default: throw new Exception("Error in ResilientExecutor.X10: unknown value of computationType in checkpointAtPlace(): " + computationType); private def checkpointAtPlace0() switch (computationType) case GENERAL: computation.checkpo ntAtPlace0(); // application-specific method break; case ITERATIVE: iterativeComputation.checkpointAtPlace00 // application-specific method break; default: throw new Exception("Error in ResilientExecutor.X10: unknown value of computationType in checkpointAtPlace00: " + computationType); [0040] According to the present example which uses an X10 implementation, a special place, Place 0, is assumed to never fail. Therefore, an embodiment of the present disclosure can safely checkpoint at least some application state information at Place 0. This is one reason for having the checkpointAtPlaceO() method. If a system cannot assume that a Place 0 exists which never fails, then an application would not use the checkpointAtPlace00 method.
[0041] When a new checkpoint el is taken, the present example maintains the previous checkpoint c0 stored in resilient storage. That way, if a failure occurs while cl is being computed, the system will still have c0 to restore a state to the executing application. After cl has completely computed, it is safe to delete cO. The ResilientExecutor, according to the present example, has the following methods to delete the old checkpoint cO right after cl has been completely computed: /* Delete previous checkpoint. This is only called after a new * checkpoint has successfully completed.
def deleteCheckPoint() : void I finish for (i in 0..(virtualPlaces.totalVirtualPlaces() -1)) val p:Place = virtualPlaces.virtualToPhysical(i); async at (p) deleteAtPlaceQ; private def deleteAtPlace() switch (computationType) case GENERAL: computation. deleteAtPlaceQ; // application-specific method break; case ITERATIVE: iterativeComputation.deleteAtPlace0; // application-specific method break; default: throw new Exception("Error in ResilientExecutor.X10: unknown value of computationType in deleteAtPlaceQ: " + computationType); [0042] The application has the option of defining a deleteAtPlace() method which deletes the previous checkpoint right away. If the application chooses not to do so, the application will still continue to run resiliently and correctly. The only drawback may be that the old checkpoint c0 will continue to exist stored in resilient storage (instead of being immediately deleted) until the next checkpoint is taken and overwrites cO.
[0043] An Example of Use of the ResilientExecutor Class by Applications With reference to FIGs. 1 to 7, below will be discussed an example of an information processing system 100 that can resiliently execute an application using the ResilientExecutor class. For illustration purposes only, and not for any limitation of the present disclosure, an example application to be executed by the information processing system 100 is a molecular mechanics simulation implemented as a ResilientherativeComputation.
[0044] The application creates an instance of a ResilientExecutor class: resExec = new ResilientExecutor(this); The resilient executor is then invoked via: resExec.iterateResiliently(); The application implements a number of methods which the ResilientExecutor instance executes resiliently using exception handling described earlier. These methods include: // called by ResilientExecutor public def step(): void mdStep(timestep); step++; public def notFinished(): Boolean I return (step < numSteps); The following code checkpoints the application and is invoked by the ResilientExecutor instance: // checkpoint computation public def checkpoint(): void if ((step % ITERATIONS PER BACKUP) == 0)1
// checkpoint atoms and forceField at place 0
resExec.checkpoint(); // resExec is the ResilientExecutor instance [0045] Note that in order to reduce the checkpointing overhead, checkpoints are not necessarily invoked after every iteration. If after each iteration a checkpoint is invoked, checkpointing overhead might be high. An advantage is that recovery time will be short. If checkpointing is invoked less frequently (i.e. ITERATIONS PER BACKUP is an integer larger than 1), this will reduce checkpointing overhead. The disadvantage is that recovery time will be longer. There is thus a trade-off between checkpointing overhead and recovery time. Frequent checkpoints increase checkpointing overhead but reduce recovery time after a failure compared with less frequent checkpoints.
[0046] The application implements the following application-specific checkpointing methods which are invoked by the ResilientExecutor instance: public def checkpointAtPlace0O: void { forceFi el dBackup = Runtime.deepCopy(forceFi el d); public def checkpointAtPlaceQ: void val key = resExec.key(ATOMS_ROOT, false); backup.put(key, atoms()); [0047] The application implements the following application-specific method which is invoked by the ResilientExecutor instance to delete obsolete checkpoints. It should be noted that this method is optional. If it is not implemented, the program will continue to operate correctly and resiliently. The advantage to implementing the method is that it reduces space overhead consumed by checkpoints.
public def deleteAtPlace(): void { cal key = resExec.key(ATOMS_ROOT, false); backup.remove(key); [0048] In the event of a failure, the following application-specific method is invoked by the ResilientExecutor instance to restore the state of the computation from a previous checkpoint: public clef restore() virtualPlaceMap = resExec.getVirtualPlaceMap(); placeGroup = new PlaceGroupUnordered(virtualPlaceMap.getVirtualNIap0),
forceField = Runtime.deepCopy(forceFieldBackup);
atoms = PlaceLocalHandle.make[Rail[MMAtom]](placeGroup, ()=>(backup.get(resExec.key(ATOMS ROOT, true)))); step = (resExec.numberOfCheckpoints() -1) *IIERATIONS_PER_BACKUP; [0049] Virtual places are also a key element of this application. The application refers to virtual places instead of physical places throughout the computation. These virtual places do not change even if one or more physical places die while the computation is progressing.
[0050] According to the present example, the information processing system 100 (see FIG. 1) comprises at least one processor 102 communicatively coupled with memory 104 and with persistent non-volatile memory 106. The persistent memory 106 can store computer instructions 107, data, configuration parameters, and other information that is used by the processor 102. All of these stored components stored in persistent memory 106 can be individually, or in any combination, stored in main memory 104 and in the processor cache memory 102. According to the present example, a bus communication architecture 108 in the information processing system 100 facilitates communicatively coupling the various elements of the information processing system 100. A network interface device 124 is communicatively coupled with the processor 102 and provides a communication interface to communicate with one or more external networks 126.
[0051] While FIG. 1 is one possible embodiment of the invention, many other embodiments are possible. The invention is of particular relevance to systems with multiple processors. Thus, the earlier descriptions of the invention are more general and are applicable to a much wider variety of systems than the one depicted in FIG. 1.
[0052] The instructions 107 may comprise one or more of the following which have been discussed in more detail above: a ResilientExecutor class 130, a ResilientComputation 132, a ResilientIterativeComputation 134, a VirtualPlaceMap 136, a ResilientMap 138, a PlaceGroupUnordered 140, a ResExec method 144, and other application methods 142.
[0053] In persistent memory 106, there is a ResilientMap storage area 118. A computer storage device 120 is communicatively coupled with the processor 102. The computer storage device 120 can be communicatively coupled with a computer readable storage medium 122. The computer readable storage medium 122 can store at least a portion of the instructions 107.
[0054] A user interface 110 is communicatively coupled with the processor 102. The user interface 110 comprises a user output interface 112 and a user input interface 114. The user output interface 112 includes, according to the present example, a display, and audio output interface such as one or more speakers, and various indicators such as visual indicators, audible indicators, and haptic indicators. A user input interface 114 includes, according to the present example, a keyboard, a mouse or other cursor navigation module such as a touch screen, touch pad, a pen input interface, and a microphone for input of audible signals such as user speech, data and commands that can be recognized by the processor 102.
[0055] FIG. 2 illustrates an example ResilientComputation interface 132 which includes several methods. The run method runs the computation and additionally creates a checkpoint of the state of the application computation periodically. If there is a failure in one or more places executing computations of the application, the restore method can be invoked to restore the state of the application computation to the last checkpoint stored in the ResilientMap storage 118. Checkpoint data structures may be saved at place 0. Also, specific checkpoint data structures at specific places may be checkpointed to the ResilientMap storage 118. After a new checkpoint of a state of an application computation is saved to the ResilientMap storage 118, optionally the previous checkpoint stored in ResilientMap storage 118 can be deleted from the resilient storage. This optimizes space usage by deleting stale and unnecessary application state information from the ResilientMap storage 118.
[0056] FIG. 3 illustrates an example ResilientlterativeComputation Interface 134 which includes several methods. The step method is used to advance the state of a computation by one step. The notFinished method indicates whether the computation should continue executing. It is typically invoked by the resilient framework after each step of an iterative computation. A restore method restores the state of the application computation to the last stored checkpoint after a failure is detected. A method deleteAtPlace can be invoked by the resilient framework to delete a previously stored checkpoint from the ResilientMap storage 118. This method optimizes space usage by deleting stale and unnecessary application state information from the ResilientMap storage 118.
[0057] Referring to FIGs. 4 and 5, the ResilientExecutor class 130 includes several methods which can be invoked by an application. The ResilientComputation creates a new instance for a computation of the application. The ResilientlterativeComputation creates a new instance for an iterative computation of the application.
[0058] The runResiliently method invokes an application specific run method to resiliently execute one or more computations of the application. The runResiliently method can invoke an application specific restore method that restores application state information from a previous checkpoint stored in the ResilientMap storage 118.
[0059] A checkpoint method stores application specific data in ResilientMap storage 118.
A numberOfCheckpoints method provides a total number of completed checkpoints so far. A key method can be invoked to compute a key for an object to be used for checkpoint operations. This allows application programs to use the ResilientMap Interface for checkpointing data without having to manually calculate keys. A getVirtualPlaceMap method returns the virtual place map corresponding to a computation of the application. This method allows the application to use virtual places.
[0060] Various Aspects of a Resilient Framework According to the Present Disclosure 1) Use of a resilient store for checkpointing, and efficient and easy-to-use checkpointing techniques.
2) Use of virtual places to mask dead places.
3) An effective way to catch relevant exceptions and handle failures during execution, a restore phase after a failure, and an exception-handling method in the resilient framework.
4) Object-oriented framework and API to make the approach easy to use.
[0061] The present disclosure has illustrated by example a novel information processing system and a novel method that provide a new computer programming framework for programmers to write resilient programs. Low level details such as catching and handling failures are handled by special software. This relieves significant programming burdens from software programmers and particularly from programmers of modern parallel computing applications.
[0062] Non-Limiting Examples As will be appreciated by one of ordinary skill in the art, aspects of the present disclosure may be embodied as a system, method, or computer program product. Accordingly, aspects of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit"," "module", or "system." [0063] Various embodiments of the present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
[0064] The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signal sper se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
[0065] Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
[0066] Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
[0067] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
[0068] These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
[0069] The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
[0070] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
[0071] While the computer readable storage medium is shown in an example embodiment to be a single medium, the term "computer readable storage medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and sewers) that store the one or more sets of instructions. The term "computer-readable storage medium" shall also be taken to include any non-transitory medium that is capable of storing or encoding a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methods of the subject disclosure.
[0072] The term "computer-readable storage medium" shall accordingly be taken to include, but not be limited to: solid-state memories such as a memory card or other package that houses one or more read-only (non-volatile) memories, random access memories, or other re-writable (volatile) memories, a magneto-optical or optical medium such as a disk or tape, or other tangible media which can be used to store information. Accordingly, the disclosure is considered to include any one or more of a computer-readable storage medium, as listed herein and including art-recognized equivalents and successor media, in which the software implementations herein are stored.
[0073] Although the present specification may describe components and functions implemented in the embodiments with reference to particular standards and protocols, the disclosure is not limited to such standards and protocols. Each of the standards represents examples of the state of the art. Such standards are from time-to-time superseded by faster or more efficient equivalents having essentially the same functions.
[0074] The illustrations of examples described herein are intended to provide a general understanding of the structure of various embodiments, and they are not intended to serve as a complete description of all the elements and features of apparatus and systems that might make use of the structures described herein. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. Other embodiments may be utilized and derived therefrom, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. Figures are also merely representational and may not be drawn to scale. Certain proportions thereof may be exaggerated, while others may be minimized. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.
[0075] Although specific embodiments have been illustrated and described herein, it should be appreciated that any arrangement calculated to achieve the same purpose may be substituted for the specific embodiments shown. The examples herein are intended to cover any and all adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, are contemplated herein.
[0076] The Abstract is provided with the understanding that it is not intended be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, various features are grouped together in a single example embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separately claimed subject matter.
[0077] Although only one processor 102 is illustrated for information processing system 100, information processing systems with multiple CPUs or processors can be used equally effectively. Various embodiments of the present disclosure can further incorporate interfaces that each includes separate, fully programmed microprocessors that are used to off-load processing from the processor 102. An operating system (not shown) included in main memory for the information processing system 100 may be a suitable multitasking and/or multiprocessing operating system, such as, but not limited to, any of the Linux, UNIX, Windows, and Windows Server based operating systems. Various embodiments of the present disclosure are able to use any other suitable operating system. Various embodiments of the present disclosure utilize architectures, such as an object oriented framework mechanism, that allows instructions of the components of operating system (not shown) to be executed on any processor located within the information processing system. Various embodiments of the present disclosure are able to be adapted to work with any data communications connections including present day analog and/or digital techniques or via a future networking mechanism.
[0078] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof The term "another", as used herein, is defined as at least a second or more. The terms "including" and "having," as used herein, are defined as comprising (i.e., open language). The term "coupled," as used herein, is defined as "connected," although not necessarily directly, and not necessarily mechanically. "Communicatively coupled" refers to coupling of components such that these components are able to communicate with one another through, for example, wired, wireless or other communications media. The terms "communicatively coupled" or "communicatively coupling" include, but are not limited to, communicating electronic control signals by which one element may direct or control another. The term "configured to" describes hardware, software or a combination of hardware and software that is adapted to, set up, arranged, built, composed, constructed, designed or that has any combination of these characteristics to carry out a given function. The term "adapted to" describes hardware, software or a combination of hardware and software that is capable of, able to accommodate, to make, or that is suitable to carry out a given function.
[0079] The terms "controller", "computer", "processor", "serve', "client", "computer system", "computing system", "personal computing system", "processing system", or "information processing system", describe examples of a suitably configured processing system adapted to implement one or more embodiments herein. Any suitably configured processing system is similarly able to be used by embodiments herein, for example and not for limitation, a personal computer, a laptop personal computer (laptop PC), a tablet computer, a smart phone, a mobile phone, a wireless communication device, a personal digital assistant, a workstation, and the like. A processing system may include one or more processing systems or processors. A processing system can be realized in a centralized fashion in one processing system or in a distributed fashion where different elements are spread across several interconnected processing systems.
[0080] The term "place" as used herein is intended to broadly describe at least a part of an executing computation that may be for an application, such as a process (or in some cases, at least one thread). The term "virtual place" as used herein is intended to broadly describe a place that is referenced by executing programs, where the actual physical place on which a computation is executing is hidden from the referencing programs that use the virtual place instead of the actual physical place. Virtual places are mapped to physical places. Mappings of virtual places to physical places can be updated to mask physical place failures.
[0081] The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description herein has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the examples in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope of the examples presented or claimed. The disclosed embodiments were chosen and described in order to explain the principles of the embodiments and the practical application, and to enable others of ordinary skill in the art to understand the various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the appended claims below cover any and all such applications, modifications, and variations within the scope of the embodiments.
The ResilientComputation interface may include the following methods: * * P Run the *computation, run() should also checkpoint the state of the * computation periodically by calling the crieckpointi) method of ResilientExecutor) *public clef tooth void; * i> it Restore the state of tho con-ioutation to the last consistent checkpoint after a failure.
* public def restore(): void; *h' Checkpoint application-specific data structures at a specific place. Since the ResilientExecutor class provides a fault -tolerant way of iterating over the all places for ^ checkpointinq and ReSillentMap provides a convenient interlace for resiliently storing data * this method is intended to be easy to implement.
* public del checkpointAtPlace0: void; * P Performs application-specific °potations to checkpoint data structures * P at place 0, This method lakes advantage of the fact that place 0 will not fall-if there is * nothing to specifically checkpoint at place 0, this method can be null.
* public def checkpointAbalace00: void; *itt delete data from a previous checkpoint T his method oPtimizes aPace usage-The programmer *iti can choose not to implement the method in which case the obsolete checkpoint vviti eventually be * deleted but wilt be Root around for longer.
* public del deleteMPiacea: void; The Resilient! r IveCo putation interfa e may include the following ethods: l; Advance the state of the computation by; one step public clef step(): voidi Return trite iff the computation executing under the ResilientExec mework should continue executing. it is typically balled after each step. public def notFinished(): Boolean; II Restore the state of the computation to the Iasi consistentcheckpoirnt after a louvre. public def restore(); Void; Checkpoint application-specific data structures at a sPecitic Place-Since the ResiiientExecutor rovides a fault-toterant way of iterating over the alt places for checkpointing and fiesilientMap rovides a convenient interface or resikently storing data, this method is intended to be easy to implement.
public del che p ifitAtPla void; Peclomis application-specific operations to checkpoint data st E at place 0. This method takes;advantage of the fact that place 0 will not fail, if there is If nothing to specifically checkpoint at place 0; this method can be null.
public def checkpointAtPlace00: void; 11 delete data tweet a previous checkpoint. This method optimizes space usage. The programmer can not to implement the method in which case the obsolete checkpoint will eventually be eleted but will be kept around for longer.
public clef deletehAtlzflaceo: void; The ResilienlExecuter Mass may include the following methods,which an pplication can invoke: *create a new instance:of the ResilientExecutor cies pplicalion'comp" public del this(comp:Resillenteomputation); reate a new in ance of the ResilientExecutor class for the tierativa applicatiOn "comp" public del this(coneip:ResilientlterativeComputation); -iv Run computation resiliently by invoKinq the application-specific run method in * ResilierriComputatiort I-lande failures by resiliently invoicing the application-specificestore rnethocf In Resilienteoinputation. public del runFlesiliently(): void; 1 Run itsiative computation resilientiy, handling failures * public clef iterateResiliently(); void; *fiCheolcpoint the computation resiliently over all places-Store applioation-sPecific data in *ilresitient storage by invoking the nhecicpoin AtPlace method in Resilienteornputation del checkpoint():. void; = The ResifientExecutoi class may include the following methods which: an application can invoke: 4/Return minter of completed checkpoints so fa, y * pu bile del numberOfeheckpoints(): Int; ell Compute a key for an object to be used for checkpoint operations. This allows application =// programs to use. the ResilientMap interface for checkpointing data without having to -1/ manually calculate keys..
* public clef key(keyFloot; String, restore: Boolean): String; ^ Return virtual place map corresponding to the computation, This method allows the =11 application to use vinual places.
* public def.
Claims (25)
- CLAIMS1. A method for supporting resilient execution of computer programs comprising the steps of providing a resilient store wherein information in the resilient store can be accessed in the event of a failure; periodically checkpointing application state in the resilient store; providing a resilient executor which comprises software which executes applications by catching failures; using the resilient executor to execute at least one application; and in response to the resilient executor detecting a failure, restoring application state information from a checkpoint in the resilient store, the resilient executor resuming execution of the at least one application.
- 2. The method of claim 1, in which the resilient executor further comprises: an interface allowing applications to use the resilient executor; a resilient run method which an application invokes via the interface which executes the application while detecting and catching place failures as exceptions, the place failures comprising failures at a place of computation in the application; and a recovery method which is invoked when the resilient run method catches an exception resulting from a failed place, wherein the recovery method recovers from the place failure, restores the application to application state information from a checkpoint stored in the resilient store, and resumes execution of the application with the application state information restored from the checkpoint stored in the resilient store.
- 3. The method of claim I, further comprising: providing an interface allowing programs to explicitly reference a place to communicate with or execute at least one computation on the place, wherein each place is an entity executing a computation; providing a virtual place abstraction layer which defines a mapping between virtual places and physical places; providing an interface allowing an application to communicate with or execute at least one computation on a place pl by referencing a virtual place p2 which is mapped to physical place p I, and in response to a physical place p3 failing, wherein virtual place p4 maps to physical place p3, updating the mapping so that virtual place p4 maps to physical place p5 wherein p5 is live.
- 4. The method of claim 3 in which a place is at least one of a process and at least one thread.
- 5. The method of claim 1, in which the resilient executor includes a run method which runs an application while catching exceptions.
- 6. The method of claim 1, in which in response to catching at least one exception, the resilient executor handles the at least one exception, restores computation of the application from a previous checkpoint, and resumes execution of the computation.
- 7. The method of claim 1, in which the resilient executor invokes application-specific code to restore computation of the application from a previous checkpoint.
- 8. The method of claim 1, in which the resilient executor includes a method for resiliently executing an iterative computation.
- 9. The method of claim 1, in which the resilient executor handles an iterative computation by invoking at least one of an application-specific method to execute an iteration of computation of the application and an application-specific method to determine if the computation has finished.
- 10. The method of claim 1, in which the resilient executor handles at least one exception by recursively catching and handling additional exceptions which occur.
- 11 The method of claim 1, in which the resilient executor calls an application-specific method to checkpoint data across multiple places.
- 12. The method of claim 1, in which the resilient executor is provided to an application as an object.
- 13. The method of claim 12, in which the application invokes a method on the object to resiliently run the application.
- 14. The method of claim 12, in which the application program invokes a method on the object to resiliently checkpoint the application.
- 15. An information processing system capable of supporting resilient execution of computer programs, the information processing system comprising: memory; persistent memory for storing data and computer instructions; a resilient store, communicatively coupled with the memory and the persistent memory, wherein application state information stored in the resilient store can be accessed in response to detection of a failure of an application executing in the information processing system; a resilient executor, communicatively coupled with the memory and the persistent memory, for executing computations of applications by catching failures in the execution of the computations; a processor, communicatively coupled with the resilient executor, resilient store, the memory, the persistent memory, and wherein the processor, responsive to executing computer instructions, performs operations comprising: periodically checkpointing application state in the resilient store; executing, with the resilient executor, computations of an application while catching failures in the execution of the computations; restoring, based on the resilient executor detecting a failure in the execution of a computation of the application, application state information for the application from a checkpoint in the resilient store; and resuming, with the resilient executor, execution of the computation of the application with the restored application state information.
- 16. The information processing system of claim 15, wherein the resilient executor comprises: an interface allowing applications to use it; a resilient run method which an application invokes via the interface which executes the application while detecting and catching place failures as exceptions; and a recovery method which is invoiced when the resilient run method catches an exception resulting from a failed place wherein the recovery method recovers from the place failure, restores the application to a previous checkpoint, and resumes execution of the application from the restored checkpoint.
- 17. A computer readable storage medium, comprising computer instructions which, responsive to being executed by a processor, cause the processor to perform operations for supporting resilient execution of computer programs, the operations comprising: providing a resilient store wherein information in the resilient store can be accessed in the event of a failure; periodically checkpointing application state in the resilient store; providing a resilient executor which comprises software which executes applications by catching failures; using the resilient executor to execute at least one application; and in response to the resilient executor detecting a failure, restoring application state information to the at least one application from a checkpoint stored in the resilient store, the resilient executor resuming execution of the at least one application with the restored application state information.
- 18. The computer readable storage medium of claim 17, wherein the resilient executor comprising: an interface allowing applications to use it; a resilient run method which an application invokes via the interface which executes the application while detecting and catching place failures as exceptions; and a recovery method which is invoiced when the resilient run method catches an exception resulting from a failed place wherein the recovery method recovers from the place failure, restores the application to a previous checkpoint, and resumes execution of the application from the restored checkpoint.
- 19. The computer readable storage medium of claim 17, wherein the processor performed operations further comprising: providing an interface allowing programs to explicitly reference a place to communicate with or execute at least one computation on the place, wherein each place comprises an entity executing a computation; providing a virtual place abstraction layer which defines a mapping between virtual places and physical places; providing an interface allowing an application to communicate with or execute at least one computation on a place pl by referencing a virtual place p2 which is mapped to physical place pi; and in response to a physical place p3 failing, wherein virtual place p4 maps to physical place p3, updating the mapping so that virtual place p4 maps to physical place p5 wherein p5 is live.
- 20. The computer readable storage medium of claim 19, wherein a place is at least one of a process and at least one thread.
- 21. A computer program comprising program code means adapted to perform the method of any of claims 1 to 14, when said program is run on a computer.
- 22. A method for supporting resilient execution of computer programs substantially as herein described and illustrated.
- 23. An information processing system substantially as herein described and illustrated.
- 24. A computer readable medium substantially as herein described and illustrated.
- 25. A computer program substantially as herein described and illustrated.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/657,132 US9652336B2 (en) | 2015-03-13 | 2015-03-13 | Resilient programming frameworks for handling failures in parallel programs |
US14/749,835 US9652337B2 (en) | 2015-03-13 | 2015-06-25 | Resilient programming frameworks for handling failures in parallel programs |
Publications (3)
Publication Number | Publication Date |
---|---|
GB201604052D0 GB201604052D0 (en) | 2016-04-20 |
GB2537038A true GB2537038A (en) | 2016-10-05 |
GB2537038B GB2537038B (en) | 2017-08-30 |
Family
ID=55859230
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB1604052.9A Active GB2537038B (en) | 2015-03-13 | 2016-03-09 | Resilient programming frameworks for handling failures in parallel programs |
Country Status (1)
Country | Link |
---|---|
GB (1) | GB2537038B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116208705B (en) * | 2023-04-24 | 2023-09-05 | 荣耀终端有限公司 | Equipment abnormality recovery method and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10187616A (en) * | 1996-12-26 | 1998-07-21 | Toshiba Corp | State recording and reproducing method, computer system realising the same method, and memory device where the same method is programmed and stored |
US7165186B1 (en) * | 2003-10-07 | 2007-01-16 | Sun Microsystems, Inc. | Selective checkpointing mechanism for application components |
US20080276239A1 (en) * | 2007-05-03 | 2008-11-06 | International Business Machines Corporation | Recovery and restart of a batch application |
CN103853634A (en) * | 2014-02-26 | 2014-06-11 | 北京优炫软件股份有限公司 | Disaster recovery system and disaster recovery method |
-
2016
- 2016-03-09 GB GB1604052.9A patent/GB2537038B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10187616A (en) * | 1996-12-26 | 1998-07-21 | Toshiba Corp | State recording and reproducing method, computer system realising the same method, and memory device where the same method is programmed and stored |
US7165186B1 (en) * | 2003-10-07 | 2007-01-16 | Sun Microsystems, Inc. | Selective checkpointing mechanism for application components |
US20080276239A1 (en) * | 2007-05-03 | 2008-11-06 | International Business Machines Corporation | Recovery and restart of a batch application |
CN103853634A (en) * | 2014-02-26 | 2014-06-11 | 北京优炫软件股份有限公司 | Disaster recovery system and disaster recovery method |
Also Published As
Publication number | Publication date |
---|---|
GB2537038B (en) | 2017-08-30 |
GB201604052D0 (en) | 2016-04-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10831616B2 (en) | Resilient programming frameworks for iterative computations | |
US10621030B2 (en) | Restoring an application from a system dump file | |
US7774636B2 (en) | Method and system for kernel panic recovery | |
US9740582B2 (en) | System and method of failover recovery | |
EP4095677A1 (en) | Extensible data transformation authoring and validation system | |
US20110246823A1 (en) | Task-oriented node-centric checkpointing (toncc) | |
Losada et al. | Resilient MPI applications using an application-level checkpointing framework and ULFM | |
US11030060B2 (en) | Data validation during data recovery in a log-structured array storage system | |
US20150082303A1 (en) | Determining optimal methods for creating virtual machines | |
US20100017581A1 (en) | Low overhead atomic memory operations | |
US10185630B2 (en) | Failure recovery in shared storage operations | |
JP6134390B2 (en) | Dynamic firmware update | |
CN114144764A (en) | Stack tracing using shadow stack | |
US10127270B1 (en) | Transaction processing using a key-value store | |
CN115136133A (en) | Single use execution environment for on-demand code execution | |
GB2537038A (en) | Resilient programming frameworks for handling failures in parallel programs | |
Rodriguez et al. | Reducing application-level checkpoint file sizes: Towards scalable fault tolerance solutions | |
US9836315B1 (en) | De-referenced package execution | |
Gankevich et al. | Subordination: providing resilience to simultaneous failure of multiple cluster nodes | |
Weeks et al. | Challenges in developing mpi fault-tolerant fortran applications | |
Hao et al. | Check-pointing approach for fault tolerance in openshmem | |
US20240168786A1 (en) | Systems and methods for a remotebuild storage volume | |
Shohdy et al. | Fault tolerant frequent pattern mining | |
Shahzad | Efficient Application-level Fault Tolerance Methods for Large Scale HPC Applications | |
Popescu et al. | An application-assisted checkpoint-restart mechanism for java applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
746 | Register noted 'licences of right' (sect. 46/1977) |
Effective date: 20170919 |