WO1997000476A1 - Systemes de point de reprise et de restauration de l'etat persistant - Google Patents
Systemes de point de reprise et de restauration de l'etat persistant Download PDFInfo
- Publication number
- WO1997000476A1 WO1997000476A1 PCT/US1995/007629 US9507629W WO9700476A1 WO 1997000476 A1 WO1997000476 A1 WO 1997000476A1 US 9507629 W US9507629 W US 9507629W WO 9700476 A1 WO9700476 A1 WO 9700476A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- state
- checkpoint
- ofthe
- restoration
- persistent
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1405—Saving, restoring, recovering or retrying at machine instruction level
- G06F11/1407—Checkpointing the instruction stream
Definitions
- the present invention is related to the following International Patent Application "Checkpoint and Restoration Systems for Execution Control,” International Application No , filed contemporaneously Herewith (Attorney Docket No Chung 2-9-6-8-5), assigned to the assignee ofthe present invention and incorporated herein by reference
- the present invention relates to a system for checkpointing and restoring the state ofa process, and more particularly, to systems for checkpointing and restoring the process state, including lazy checkpoints ofthe persistent process state, or any specified portion thereof
- the PurifyTM software testing tool provides a system for detecting memory access errors and memory leaks.
- the PurifyTM system monitors the allocation and initialization status for each byte of memory.
- the PurifyTM system performs a test to ensure that the program is not writing to unallocated memory, and is not reading from uninitialized or unallocated memory.
- checkpointing and restoration techniques have been proposed to recover more efficiently from hardware and software failures.
- checkpoint and restoration techniques For a general discussion of checkpointing and rollback recovery techniques, see R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., Vol. SE-13, No. 1, pp. 23-31 (Jan. 1987).
- checkpoint and restoration techniques periodically save the process state during normal execution, and thereafter restore the saved state following a failure. In this manner, the amount of lost work is minimized to progress made by the user application process since the restored checkpoint.
- the state of a process includes the volatile state as well as the persistent state.
- the volatile state includes any process information that would normally be lost upon a failure.
- the persistent state includes all user files that are related to the current execution ofthe user application process. Although the persistent state is generally not lost upon a failure, it is necessary to restore the persistent state to the same point as the restored volatile state, in order to maintain data consistency.
- checkpointing and restoration techniques perform effectively in many application environments, they suffer from a number of limitations which, if overcome, could expand the consistency and transparency of checkpointing systems and extend their utility to other applications which heretofore have not been considered.
- checkpointing and restoration techniques have exploited the advantages of checkpointing and recovery outside of a failure recovery context.
- a checkpoint and restoration system is provided to implement checkpoint and restoration techniques in a user application process to save the process state during normal execution, and thereafter restore the saved state, for example, during a recovery mode following a failure.
- the checkpoint and restoration system performs checkpoints of both the volatile and persistent states.
- the checkpointing ofthe persistent state consists of a lazy checkpoint technique which delays the taking ofthe persistent state checkpoint until an inconsistency between the checkpointed volatile state and one or more user files is about to occur.
- the checkpoint and restoration system allows a user or a user application process to specify selected portions ofthe persistent state to be excluded from a checkpoint.
- a desired intermediate state can be . checkpointed and used as a starting point for executing new processing tasks. For example, if a user application process requires a long initialization process, and then utilizes the same initialized state to process different inputs, the input files can be excluded from the checkpoint and the initialized state can be checkpointed. Thereafter, the checkpointed initialized state can be restored to execute the processing task for each new set of inputs.
- a user application process can process future inputs from a desired or predictable state.
- the present checkpoint and restoration system can be utilized to exclude the entire persistent state , in other words, all user files, from the portion ofthe process state that is checkpointed.
- a further feature ofthe invention provides a method for checkpointing and restoring a user application process executing on a computer system, where the user application process has a process state which includes a volatile state and a persistent state that is comprised of one or more user files.
- the method comprises the steps of: checkpointing the volatile state at a checkpoint position; monitoring the persistent state to detect a file operation following the checkpoint position that will modify the persistent state; checkpointing at least the portions ofthe persistent state to be modified ifthe monitoring step detects that a modification ofthe persistent state is about to be performed; restoring the process state to the checkpoint position, whereby the modifications to the persistent state since the checkpoint position are undone; and resuming execution ofthe user application process from the checkpoint position.
- Yet another feature ofthe invention provides a method for restoring an initialized state associated with a user application process, where the user application process has a process state and performs a processing task based on the initialized state for at least two sets of input files.
- the method comprisies the steps of: (a) intializing the user application process to form an initialized state; (b) specifying one or more input files to be excluded from a checkpoint ofthe process state; (c) checkpointing the portions ofthe process state which have not been excluded;
- FIG. 1 is a schematic block diagram illustrating a checkpointing and restoration system according to the present invention
- FIG. 2 illustrates an execution graph for a user application process and demonstrates volatile checkpoints, persistent checkpoints and process migration to an alternate machine;
- FIG. 3 illustrates an interception routine which monitors file system calls between a user application process and the operating system to detect changes to the persistent state which will create an inconsistency between the persistent and volatile states;
- FIG. 4 illustrates a persistent checkpoint table that maintains persistent state checkpoint info ⁇ nation for each file modified since the latest volatile checkpoint;
- FIG. 5 is a flowchart describing an exemplary pre-execution checkpoint subroutine which is invoked before execution of a user application process
- FIG. 6 is a flowchart describing an exemplary volatile state checkpoint subroutine which is invoked to checkpoint the volatile state
- FIG. 7 is a flowchart describing an exemplary implementation ofthe file system call interception subroutine of FIG. 3, which is invoked to checkpoint a user file before a modification will create an inconsistency between the volatile and persistent states;
- FIGS. 8 A and 8B collectively, are a flowchart describing an exemplary restoration subroutine which is utilized to restore the process state to an identified checkpoint with a return value that may control post-restoration processing;
- FIG. 9 is a flowchart describing an exemplary clean-up subroutine which may be invoked following execution ofthe user application process
- FIG. 10 illustrates a sample source code file which incorporates features ofthe present invention to bypass premature software exits caused by an out-of-resource condition
- FIG. 11 is a flowchart describing an exemplary bypass long initialization routine which incorporates features ofthe present invention to checkpoint the initialized state and restore the process state to the initialized state for additional sets of input files and parameters;
- FIG. 12 is a flowchart describing an exemplary memory rejuvenation subroutine which incorporates features ofthe present invention to checkpoint a clean memory state and restore the process state to the clean memory state.
- checkpoint and restoration system 10 is shown in FIG. 1.
- the checkpoint and restoration system 10 allows checkpoint and restoration techniques to be implemented in user application processes in order to save the process state during normal execution, and thereafter restore the saved state, for example, during a recovery mode following a failure. In this manner, the amount of work that is lost by the application process is limited to what has been produced since the latest checkpoint.
- the checkpoint and restoration system 10 disclosed herein may be implemented on a processing node 20, such as a minicomputer, workstation or other general-purpose computing device, having at least one processing unit 25 and a memory storage device 30.
- the processing unit 25 and memory storage device 30 ofthe processing node 20 may be interconnected by a bus 60, or inter-process communication (IPC) facilities on the local processing node 20 for intra-node communication, in a known manner.
- IPC inter-process communication
- each node 20 may be interconnected with other nodes or a remote centralized recovery coordinator (not shown) by means of a network interface 70, such as an ATM host adapter card commercially available from Fore Systems, Inc.
- the processing unit 25 may be embodied as a single processor, or a number of processors operating in parallel.
- the memory storage device 30, which is typically an area of unstable volatile memory, is operable to store one or more instructions which the processing unit 25 is operable to retrieve, inte ⁇ ret and execute.
- the volatile memory storage device 30 will store the software code associated with each user application process, such as the process 40, being executed by the processing unit 25, as well as one or more checkpoint library functions 50 which are invoked by the user process 40.
- the volatile memory storage device 30 will include a data segment section 55 for storing the data associated with the respective user application process 40 and the checkpoint and restoration library functions 50, in a known manner.
- the checkpoint library functions 50 which are invoked by the user application process 40 are selected from a checkpoint and restoration library 150, which may be stored locally or on a centralized file system, such as the file system 120.
- a file system such as the file system 120, provides a centralized repository for storing files which may be accessed by one or more users.
- a centralized file system 120 is an area of non- volatile or persistent memory, which can retain information even in the absence of power.
- the functions contained in the checkpoint and restoration library 150 are user-level library functions preferably written in a high level programming language, such as the C programming language.
- the functions in the checkpoint and restoration library 150 can be invoked by a user application process to save the process state during normal execution, or to restore the saved state, for example, during a recovery mode following a failure.
- the user process 40 which invokes a function from the checkpoint and restoration function 150 will be bound together with the code ofthe invoked function during compilation or by a dynamic linking process.
- the checkpoint and restoration library 150 includes a pre- execution checkpoint subroutine 152, discussed further below in conjunction with FIG. 5, which is invoked before execution of a user application process.
- the checkpoint and restoration library 150 includes a volatile state checkpoint subroutine 154, discussed further below in conjunction with FIG. 6, which, when invoked by a user application process 40, will store a copy ofthe volatile state, from the volatile memory 30 in an area of nonvolatile memory, such as on a disk 100.
- the checkpoint disk 100 may reside locally on the processing node 20, or on a remote node ofa communication network.
- the checkpoint and restoration library 150 preferably includes a file system call interception subroutine 156, discussed below in conjunction with FIGS. 3 and 7, which provides a lazy technique for checkpointing desired portions ofthe persistent state.
- the library 150 also includes a restoration subroutine 158 which is invoked to restore a user application process to a desired checkpoint, as discussed in conjunction with FIGS. 8A and 8B.
- the restoration subroutine 158 provides a mechanism for allowing a user to specify one or more user files to be excluded from the persistent state checkpoint, which allows a user application process to process future inputs from a desired or predictable state.
- the checkpoint and restoration library 150 includes a clean-up subroutine 160 which is invoked following execution ofa user application process to delete the created checkpoint files, if necessary.
- the restoration subroutine 158 may be initiated automatically upon a detected fault or manually by a user, for example, by a command line entry, as would be apparent to a person of ordinary skill.
- each node such as the node 20 may have a watchdog 80 which includes an e ⁇ or detection monitor 85 for monitoring processes that are executing on the respective node.
- the error detection monitor 85 will continuously monitor one or more application processes executing on the node 20, such as process 40, to determine whether the process is hung or has crashed.
- the monitoring performed by the e ⁇ or detection monitor 85 may be either active or passive.
- the watchdog 80 may poll each monitored application process to determine its condition by periodically sending a message to the process using the Inter Process Communication (IPC) facilities on the local node 20 and evaluating the return value to determine whether that process is still active.
- IPC Inter Process Communication
- each application process includes a function from the library 150, which, when invoked by a user application process, such as the process 40, will send a heartbeat message at specified intervals to the watchdog 80, indicating that the associated process 40 is still active. Ifthe watchdog 80 does not receive another signal from the application process 40 before the end ofthe specified interval, the watchdog 80 will presume that the application process is hung or has crashed.
- a restart subsystem 90 upon detection of a fault in a user application process 40 by the error detection monitor 85, a restart subsystem 90 will attempt to recover the faulty application process by initiating a restart ofthe faulty application process, at its latest checkpoint, in the manner described below.
- the restart subsystem 90 will invoke the restoration subroutine 158, upon a detected failure, to initiate the restarting ofthe faulty user application process.
- checkpoint and recovery concepts and definitions For a general discussion of checkpoint and recovery concepts and definitions, see, for example, Yi-Min Wang et al., "Progressive Retry Technique for Software E ⁇ or Recovery in Distributed Systems", Proc. of 23d IEEE Conf. on Fault-Tolerant Computing Systems (FTCS), pp. 138-144 (June, 1993); or R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., Vol. SE-13, No. 1, pp. 23-31 (Jan. 1987). , each incorporated herein by reference.
- checkpoint and restoration techniques save the process state from time to time during normal program execution, and thereafter restore the saved state, for example, following a failure, in order to minimize the amount of lost work.
- FIG. 2 represents the execution of a user application process, such as the process 40.
- checkpoints ofthe volatile state are invoked, such as the volatile checkpoints VCi , VC2 and VC3.
- the term volatile state includes any information that would normally be lost upon a failure, such as the program stack, open file descriptors, static and dynamic data segments, as well as data structures associated with the operating system kernel that are essential to current program execution, such as operating system registers, the program counter and the stack pointer.
- a user application process 40 attempts to perform a file operation that will modify the persistent state, such as one or more attributes of a user file
- the affected files will be checkpointed in a manner described below, before the desired file operation is executed, as indicated by the persistent checkpoints PC3' and PC3''.
- the term persistent state includes all user files that are related to the cu ⁇ ent execution ofthe user application process. Although the persistent state is generally not lost upon a failure, the persistent checkpoints ensure that when a process rolls back to its latest volatile checkpoint, for example, upon a detected failure, the persistent state will be consistent with the volatile state.
- the persistent state is preferably not recorded until an update to a given file will make the file inconsistent with the volatile state associated with the latest checkpoint.
- the persistent checkpoints, PC3' and PC3 allow all ofthe modifications to the persistent state since the latest volatile checkpoint to be undone.
- the volatile state ofthe process can be rolled back to the latest volatile checkpoint, VC3, by restoring the checkpoint data associated with the checkpoint, VC3.
- the persistent checkpoints, PC3* and PC3 allow each ofthe modifications to the persistent state since the latest volatile checkpoint, VC3, to be undone.
- the overall persistent state will be consistent with the volatile state as it existed at the time of latest volatile checkpoint, VC3. It is again noted that if the process cannot be successfully restarted on machine A, process migration allows the process to be restarted on an alternate machine, such as machine B, as shown in FIG. 2.
- the persistent state includes all user files that are related to the cu ⁇ ent execution ofthe user application process.
- the only way in which a user application process can access, and thus potentially alter, a user file is by means of a file system call sent to the operating system kernel.
- each file system call generated by a user application process is intercepted and evaluated by the checkpoint and restoration system 10
- all ofthe potential changes to the persistent state may be identified.
- all file system calls generated by a user application process such as the process 40, are preferably intercepted and monitored by one or more interception routines 156, discussed below in conjunction with FIG. 7, before the desired file operation is actually performed by the operating system 300. In this manner, if a particular file operation is about to alter one or more files associated with the persistent state, the status ofthe affected files can be recorded to ensure consistency.
- the persistent state checkpoints are recorded in a persistent checkpoint table 400, shown in FIG. 4.
- the persistent checkpoint table 400 is preferably stored in a persistent memory, such as a disk, and is stored to the disk each time the table 400 is modified.
- Each persistent checkpoint table 400 is preferably associated with a particular user application process, as well as a particular volatile checkpoint, identified by its checkpoint id, and includes a plurality of rows, such as the rows 405 and 410, each associated with a user file which has been modified in some manner since the associated volatile checkpoint.
- the persistent checkpoint table 400 preferably includes an entry for each file attribute that may be modified.
- the persistent checkpoint table 400 preferably contains a column 435 for recording the modification time of each file, a column 440 for recording the access mode of each file, and a column 445 for recording the cu ⁇ ent size of each file.
- each entry in the table 400 is preferably initialized with a default value, such as "-1", when a row is created for a given file. Thereafter, if one or more attributes ofthe associated file are modified, the then cu ⁇ ent attribute value can be recorded before being modified. In this manner, if a given attribute ofa file being restored contains a value of "-1", the particular attribute has not been modified and need not be restored.
- entries are preferably created in the persistent checkpoint table 400 by the file system call interception subroutine 156. In addition, as discussed below in conjunction with FIGS.
- the restoration subroutine 158 will access the persistent checkpoint table 400 and utilize the information contained therein to restore the persistent state.
- the checkpoint and restoration library 150 preferably includes a pre-execution checkpoint subroutine 152.
- the pre-execution checkpoint subroutine 152 is preferably executed before execution ofthe user application process 40. For example, programs written in the C programming language normally begin execution on the first line with a "main" routine. Accordingly, execution ofthe pre-execution checkpoint subroutine 152 should be invoked before execution ofthe "main" routine.
- the checkpoint and restoration system 10 preferably provides two modes of operation for performing checkpoints, namely, an inserted mode and a transparent mode.
- the inserted mode allows a user application process to implement a checkpoint mechanism by inserting checkpoint function calls at desired places in the source code.
- the transparent mode provides a mechanism for automatically perfoiming checkpoints at specified time intervals.
- the transparent mode allows a user application process to inco ⁇ orate checkpointing mechanisms without requiring any changes to the source code ofthe user application process or any recompilation.
- a clock daemon process is preferably created by the pre-execution checkpoint subroutine 152 in order to initiate checkpoints at the predefined intervals.
- the created clock daemon process upon each specified interval, will cause a system interrupt call to be transmitted by the operating system to the associated user application process in order to initiate the checkpoint.
- the pre-execution checkpoint subroutine 152 will be entered at step 500 and will thereafter initialize any data structures required by the checkpoint and restoration system 10 during step 505, such as the open file table and the persistent checkpoint table 400.
- a test is performed during step 520 to determine if the user application process is executing in an inserted mode or a transparent mode, for example, as specified by the user on the command line, or by the setting of one or more environmental variables. If it is determined during step 520 that the user application process is executing in a transparent mode, then the clock daemon process is created during step 525, for example, by a fork system call. As indicated above, the clock daemon process will serve as the checkpoint timer to initiate checkpoints ofthe user application process at the specified interval. In one embodiment, the checkpoint will be initiated at a default interval, such as every thirty (30) minutes, if another interval is not specified.
- checkpoints will only be initiated when they are invoked by execution ofthe user application process.
- a test is performed during step 540 to determine if a valid checkpoint file already exists for the associated user application process. In other words, the test determines whether the cu ⁇ ent execution is a normal execution mode or a recovery mode. It is noted that when a user application process terminates normally, a clean-up subroutine 160, discussed below in conjunction with FIG. 9, will delete the checkpoint files associated with the user application process, unless specified otherwise. Thus, if a checkpoint file exists for a user application process upon initiation, either the previous execution did not terminate normally, for example, due to a failure, or the user application process has requested that the checkpoint file should be stored for subsequent restoration.
- the pre-execution checkpoint subroutine 152 will preferably return and execution ofthe restoration subroutine 158, discussed below in conjunction with FIGS. 8 A and 8B, will preferably commence during step 550, in order to restore the data associated with the existing checkpoint file and commence execution ofthe user application process at the point ofthe restored checkpoint.
- step 540 If, however, it is determined during step 540 that a valid checkpoint file does not exist for the associated user application process, then the pre-execution checkpoint subroutine 152 will preferably return and execution ofthe user application process is preferably initiated during step 560.
- the checkpoint and restoration library 150 preferably also includes a volatile state checkpoint subroutine 154, which is invoked in the transparent mode by an interrupt signal from the clock daemon, indicating that a checkpoint should be initiated, or in the inserted mode when a checkpoint function call inserted in the source code ofthe user application process is being executed.
- the volatile state checkpoint subroutine 154 is indirectly invoked at step 620 from the restoration subroutine 158 after the value ofthe program counter is restored.
- the volatile state checkpoint subroutine 154 will save all ofthe information that would otherwise be lost upon a failure that is needed to restore the user application process.
- the volatile state checkpoint subroutine 154 is passed a checkpoint id argument that may be utilized to identify each checkpoint interval. Ifthe volatile state checkpoint subroutine 154 is not passed a checkpoint id argument, the previous checkpoint data is preferably overwritten. Since the checkpoint id argument is preferably a global variable, it can be subsequently accessed by the file system call interception subroutine 156, which implements checkpoints ofthe persistent state, in order to associate the persistent state checkpoints with the appropriate (current) volatile checkpoint.
- the operating system kernel some ofthe volatile information related to cu ⁇ ent execution of a user application process, such as hardware registers for temporary storage of values in the central processing unit, the stack pointer and the program counter, are maintained by the operating system kernel. Although these memory elements are not normally accessible by user application processes, the operating system will typically provide one or more routines which allows the operating system information required by a particular user application process to be checkpointed. Thus, the routine provided by the operating system for performing this task is preferably executed during step 610 in order to save the contents ofthe registers, the stack pointer and the program counter.
- the Unix operating system provides a setj p call which accesses and saves these operating system data structures to one or more declared global data structures, which can then be checkpointed as part ofthe volatile state. For a detailed discussion ofthe operation ofthe setjmp system call, see W.R. Stevens, Advanced Programming in the Unix Environment, pp. 174-180 (Addison Wesley, 1992), inco ⁇ orated by reference herein.
- step 620 program control will proceed to step 620.
- the restoration subroutine 158 upon restoration of a desired checkpoint, the value ofthe program counter will be restored to the value associated with the restored checkpoint.
- the restoration subroutine 158 will jump to a position immediately after execution of step 610.
- the restoration subroutine 158 will provide a retum value, preferably greater than 0, that may be utilized to control the flow of execution following the restoration For example, certain code may be executed for one predefined return value, while a different sequence of code should be executed for another predefined return value
- a test is performed during step 620 to determine if the return value from the operating system routine, such as the setjmp system call, indicates a value of 0
- the restoration subroutine 158 allows a retum value greater than 0 to be utilized in a recover mode
- cu ⁇ ent execution ofthe volatile state checkpoint subroutine 154 has been invoked from the restoration subroutine 158 in a recover mode and program control should proceed directly to step 670 without performing any checkpointing
- the file descriptors for all ofthe files open at the time ofthe volatile checkpoint are preferably stored in an open file table during step 630, together with the associated file name and the cu ⁇ ent position ofthe file.
- the open file table will include the file descriptor, file name and position of each open file.
- the data segment associated with the user application process including all dynamically and statically allocated memory, such as global and static variables, including the open file table, is preferably saved during step 640
- the cu ⁇ ent contents ofthe stack are preferably saved during step 650.
- Execution ofthe volatile state checkpoint subroutine 154 will terminate during step 670 and thereafter retum with the indicated return value.
- the volatile state checkpoint subroutine 154 retums with a value of zero, this serves as an indication the checkpoint has been successfully taken.
- the volatile state checkpoint subroutine 154 retums with a value greater than zero, this serves as an indication that execution is returning indirectly from the restoration subroutine 158 with a retum value that may be utilized to control the flow of execution.
- the checkpoint and restoration library 150 includes a file system call interception subroutine 156 which implements persistent state checkpoints.
- the file system call interception subroutine 156 intercepts those file system calls that may modify certain attributes of one or more files, and if necessary, performs a lazy checkpoint ofthe portions ofthe persistent state that are about to be modified.
- the file system call interception subroutine 156 performs the persistent state checkpoint before actually executing the requested file operation.
- the file system call interception subroutine 156 preferably only performs checkpoints ofthe persistent state to the extent necessary.
- the file system call interception subroutine 156 will be entered at step 700 upon receipt of each intercepted file system call.
- a test is performed during step 710 to determine if the intercepted file operation modifies a file attribute that should initiate the taking ofa checkpoint. If it is determined during step 710 that the intercepted file operation does not modify a file attribute that should initiate the taking ofa checkpoint, then program control should proceed to step 750 to perform the desired file operation in the manner described below.
- step 710 If, however, it is determined during step 710 that the intercepted file operation does modify a file attribute that should initiate the taking ofa checkpoint, then a test is performed during step 720 to determine ifthe user has specified that the cu ⁇ ent file should be excluded from the checkpoint, for example, by means of executing a function call, entering a command line argument, or by the setting of an environmental variable. In this manner, a user or user application process can selectively specify on a per-file basis, whether or not given files should be included in the persistent state checkpoint.
- step 720 If it is determined during step 720 that the cu ⁇ ent file should be excluded from the checkpoint, then program control should proceed to step 750 to perform the desired file operation in the manner described below. If, however, it is determined during step 720 that the current file should not be excluded from the checkpoint, then a test is performed during step 730 to determine ifthe file has already been checkpointed since the latest volatile checkpoint, identified by the current value ofthe global variable, checkpoint id. If it is determined during step 730 that the file has already been checkpointed since the latest volatile checkpoint, then program control should proceed to step 750 to perform the desired file operation in the manner described below
- the file should be checkpointed during step 740 by creating a shadow copy ofthe file and adding the file name and previous values of modified attributes to the persistent checkpoint table 400 associated with the current value ofthe checkpoint id
- the persistent state checkpoint can be further optimized by checkpointing each file on a per-attribute basis, and only checkpointing those attributes which are affected by the cu ⁇ ent file system call.
- a file operation affects only a subset of all attributes, then only the affected subset of attributes need to be checkpointed before the file operation is performed during step 750
- a write system call only appends data at the end of an existing file, it suffices to checkpoint the file size without checkpointing any file content, since the file content ofthe file as it existed at the volatile checkpoint is not altered Upon restoration, the file can then be truncated to the appropriate size
- the desired file operation may be performed during step 750 Since the persistent state checkpoint is recorded before the file operation is performed, the information stored in the persistent checkpoint table 400 can be used to undo any modifications made to each user file since the latest volatile checkpoint.
- the checkpoint and restoration library 150 includes a restoration subroutine 158, shown in FIGS. 8A and 8B, which is invoked when an application process is restarted with a valid checkpoint, for example, by the watchdog 80 following a detected failure, or when a rollback(i) function call has been inserted in the source code associated with the user application process.
- a restoration subroutine 158 shown in FIGS. 8A and 8B, which is invoked when an application process is restarted with a valid checkpoint, for example, by the watchdog 80 following a detected failure, or when a rollback(i) function call has been inserted in the source code associated with the user application process.
- the term rollback indicates a restoration initiated by a user or user application process
- recovery indicates a restoration following a failure with a valid checkpoint file.
- the restoration subroutine 158 is passed the following arguments: a mode value indicating whether cu ⁇ ent execution is a recovery mode or a rollback mode, a checkpoint id value, a return value to be retained and retumed following execution ofthe restoration subroutine 158 and a list o ⁇ protected variables which should maintain their pre-restoration values, even after the process has been restored to a checkpoint. It is noted that if a checkpoint id value is not specified, the process is preferably restored to the latest checkpoint. In addition, if a return value is not specified, a positive retum value, such as one, is preferably utilized.
- the restoration subroutine 158 serves to restore the volatile and persistent states associated with the indicated checkpoint. As discussed below, the restoration subroutine 158 ensures consistency between the volatile and persistent states by restoring the volatile checkpoint and by undoing modifications made to the persistent state since the restored volatile checkpoint.
- a return value and protected variables a ⁇ ay may be specified.
- the current value ofthe variables indicated by the protected variables array, as well as the current value ofthe return value variable are protected when the restoration subroutine 158 rolls back to the indicated checkpoint.
- t e protected variables mechanism can be utilized to specify the variables which should maintain their cu ⁇ ent values following the restoration, li a return value is not specified, a default value of 1 is preferably utilized.
- the restoration subroutine 158 will be entered at step 800. Thereafter, the persistent checkpoint table 400 (FIG. 4) associated with the indicated value ofthe checkpoint id argument, will be read during step 810. A test is performed during step 815 to determine ifthe user has indicated that the persistent checkpoint table 400 should be modified, for example, by command line entries or by the setting of environmental variables indicating that one or more shadow files listed in the persistent checkpoint table 400 should not be restored. If it is determined during step 815 that the user has indicated that the persistent checkpoint table 400 should be modified, then the table 400 is modified during step 820 in accordance with the indicated modifications.
- the persistent state is restored during step 825 in accordance with the persistent checkpoint table 400 by searching the appropriate checkpoint data for the shadow file corresponding to each file listed in the table 400 and copying the shadow file onto the cu ⁇ ent file.
- the attributes associated with each file listed in the persistent checkpoint table 400 are modified in accordance with the values recorded in the respective entries in the table 400.
- a test is performed during step 830 to determine if the current execution mode ofthe restoration subroutine 158 is a recovery mode, following a failure, or a user-initiated roll back mode, and ifthe values in the protected variables a ⁇ ay are valid. If it is determined during step 830 that the current execution mode ofthe restoration subroutine 158 is a roll back mode, and that the values in t e protected variables a ⁇ ay are valid, then the variables specified by the protected variables a ⁇ ay are copied during step 835 from the data segment to a temporary file, in order to protect these variables while the checkpointed data segment is restored.
- the volatile checkpoint file identified by the checkpoint id argument is read during step 840.
- the data segment, including the open file table, is then restored during step 845.
- step 850 a test is again performed during step 850 to determine if the cu ⁇ ent execution mode ofthe restoration subroutine 158 is a roll back mode, and ifthe values in the protected variables a ⁇ ay are valid. If it is determined during step 850 that the cu ⁇ ent execution mode ofthe restoration subroutine 158 is a roll back mode, and that the values in the protected variables a ⁇ ay are valid, then the variables specified by the protected variables a ⁇ ay are then copied during step 855 back from the protected position in the temporary file back to the data segment. In this manner, each ofthe variables identified in the protected variables a ⁇ ay will maintain their pre-restoration values.
- a test is performed during step 865 to determine whether the user has indicated, for example, by a command line entry or by the setting of one or more environmental variables, that the open file table should be modified. If it is determined during step 865 that the user has indicated that the open file table should be modified, then the indicated modification is implemented during step 870. For example, in one application inco ⁇ orating features ofthe present invention, described below in a section entitled Bypassing Long Initialization, the restored open file table will list a first set of input files that have been previously processed. For each subsequent set of inputs to be processed, the open file table should be modified to replace the first set of input files with the set of input files appropriate for the current execution.
- the file descriptors indicated in the open file table are restored during step 875.
- the file is opened, the filename is associated with the indicated file descriptor, and the current position ofthe file is adjusted to the position recorded in the open file table entry.
- the stack space is preferably allocated during step 880 and the stack is restored during step 885, in accordance with the information in the volatile checkpoint file read during step 840.
- the operating system kernel some ofthe volatile information related to the execution of a user application process, such as the hardware registers for the temporary storage of values in the central processing unit, the stack pointer and the program counter, are maintained by the operating system kernel. Although these memory elements are not normally accessible by user application processes, the operating system will typically provide one or more routines which allows the operating system information required by a particular user application process to be restored. Thus, the routine provided by the operating system for performing this task is preferably executed during step 890 in order to restore the contents ofthe registers, the stack pointer and the program counter.
- the Unix operating system provides a longjmp call which restores these operating system data structures. For a detailed discussion ofthe operation ofthe longjmp system call, see W. R. Stevens, inco ⁇ orated by reference above.
- the restoration subroutine 158 will effectively retum from the volatile state checkpoint subroutine 154.
- the restoration subroutine 158 will retum with the indicated return value and with the variables indicated in t e protected variables a ⁇ ay maintaining their pre-restoration values.
- the checkpoint and restoration library 150 preferably includes a clean-up subroutine 160 which is preferably executed following execution ofthe user application process. As shown in FIG. 9, the clean-up subroutine 160 is preferably entered at step
- step 910 determines if the current execution mode ofthe user application process is the transparent mode.
- step 910 If it is determined during step 910 that the cu ⁇ ent execution mode is the transparent mode, then the clock daemon process created by the pre-execution checkpoint subroutine 152 is killed during step 930.
- step 950 a test is performed during step 950 to determine if one or more ofthe checkpoint files associated with the user application process should be maintained. If it is determined during step 950 that the checkpoint files should not be kept, then the checkpoint files associated with the user application process are preferably deleted during step 970. The clean-up subroutine 160 will then terminate execution during step 980.
- a user application process may terminate or exit prematurely because the process
- the process is still under control at the point just before the program exits.
- the term exception condition is defined to be any execution outside the normal execution flow, as defined by the user application process.
- the process will print an e ⁇ or message indicating the "unable to allocate resource" condition, and the program is exited prematurely.
- Such premature software exits are, of course, undesirable, because a lot of useful processing can be wasted, especially for long-running applications.
- the process must be restarted from the beginning or perhaps from the latest checkpoint which may have been taken at a specified interval in a transparent checkpoint mode.
- the checkpoint and restoration system 10 disclosed herein allows a checkpoint function call to be inserted into the source code, just prior to the point where the process will be exited. In this manner, the process state can later be restored to the point just prior to the position associated with the premature exit.
- the retum value from the restoration subroutine 158 provides an indication that the cu ⁇ ent execution is in a recovery mode which may initiate special recovery processing, if desired.
- FIG. 10 illustrates a segment of source code inco ⁇ orating features ofthe present invention which may be utilized to bypass a premature software exit, caused, for example, by the failure to allocate dynamic memory.
- the sequence of code indicated in lines 1015 through 1050 are executed for as long as the process is unable to allocate dynamic memory in line 1010.
- the malloc function call which is executed in line 1010 is a memory allocation function, commonly found in function libraries ofthe C programming language, which allocates a requested size block of memory and retums the value ofthe starting address ofthe allocated memory to the declared pointer, ⁇ tr.
- the process When the process is unable to allocate the desired dynamic memory, for example, where another process may have exhausted the remaining swap space, the process will attempt to retry the allocation until the maximum number of retries, specified by the variable, max retry count, is exceeded. It is noted that the defined maximum number of retries can be set to zero. Once the max retry count has been exceeded, a checkpoint is performed during step 1025, before the process exits during step 1035.
- the restoration subroutine 158 (FIGS. 8 A and 8B) will be invoked to restore the volatile and persistent states to the latest checkpoint, in other words, the checkpoint that was performed just prior to exiting. It is again noted that when the value ofthe program counter is restored during execution ofthe restoration subroutine 158, execution will jump from the restoration subroutine 158 to the volatile state checkpoint subroutine 154. Thus, the restoration subroutine 158 will retum from the volatile state checkpoint subroutine 154 with a positive retum value, indicating a recovery mode. Thus, in the illustrative example of FIG. 10, the positive retum value will cause program control to proceed to line 1040 for execution of recovery code.
- the recovery code consists of resetting the retry count to zero and reattempting to allocate the desired dynamic memory.
- other recovery code could be executed, as would be apparent to a person of ordinary skill.
- the out-of-resource condition may be transient and thus bypassed when the process is restored due to environmental diversity, where the same process is executed under different conditions. Ifthe out-of-resource condition is permanent, however, for example, where the current machine simply does not have enough ofthe given resource to satisfy the requirements ofthe user application process, process migration to an altemate processing node having a higher capacity may be necessary to bypass the premature exit.
- the techniques ofthe present invention allow a process to be started on a workstation and then migrated to another machine having a higher capacity ofthe desired resource only after the out-of-resource condition is encountered.
- the initialization state associated with a given software program can be checkpointed and restored for subsequent executions on different input data. Those input files which are to be replaced for each different execution, however, can be excluded from the checkpoint to allow new input files to be processed for each new execution.
- a bypassing long initialization routine 1100 is entered at step 1105.
- the bypassing long initialization routine 1100 will initially read a first set of input parameters, for example, from the command line or a data file, during step 1110, which includes a set of input file names. Thereafter, an initialization routine appropriate for the given user application process is performed during step 1115.
- the files which should be excluded from the checkpoint in other words, those files that are to be replaced for each subsequent execution, are preferably specified during step 1120. Thereafter, the volatile state and the portion ofthe persistent state which was not specified in the previous step are checkpointed during step 1130.
- step 1135 When control retums from the checkpoint function, A test is performed during step 1135 to determine if the retum value from the checkpoint function is greater than zero, indicating a recovery mode. If it is determined during step 1135 that the return value is not greater than zero, then this is the first execution of the bypassing long initialization routine 1100 and the first set of data should be processed during step 1150 according to the initialized state and the first set of input files and parameters.
- step 1160 A test is performed during step 1160 to determine ifthere are additional sets of input files and parameters to be processed. If it is determined during step 1160 that there are additional sets of input files and parameters to be processed, then program control will proceed to step 1170, where the restoration subroutine 158 will be executed with a positive return value. The restoration subroutine 158 will restore the process state to the checkpoint taken during step 1130. It is noted that the open file table which was checkpointed during step 1130 listed each of the input files associated with the first set of inputs. Upon subsequent executions, however, the same set o ⁇ file descriptors listed in the open file table should be associated with the input files associated with the respective execution. Thus, as indicated above, the restoration subroutine 158 includes a mechanism for allowing the user to alter the open file table to reflect these changes.
- step 1170 when the process state is restored to the latest checkpoint during step 1170, the program counter will also be restored to the value associated with the checkpoint, and thus, program control will jump to the checkpoint function performed during step 1130.
- program control retums from the checkpoint function during step 1130 with a positive retum value, in the manner described above, the test performed during step 1135 will result in program control proceeding to step 1140.
- the next set of input parameters which includes the list of input file names, will be read during step 1140 for processing during step 1150 in the manner described above, without requiring the initialization routine to be reexecuted.
- step 1160 If, however, it is determined during step 1160 that there are no additional sets of input files and parameters to be processed, execution ofthe bypassing long initialization routine 1100 will terminate during step 1180.
- a memory leak can arise where a software program, including many successful commercial products, fails to incorporate proper memory deallocation on certain execution paths.
- a memory leak will result in memory spaces that have been allocated but are no longer accessible, because they are no longer pointed to by any pointer.
- a memory leak will typically arise when a pointer that points to a first block of allocated memory is reassigned to point to a second block of allocated memory, without deallocating the first block.
- Memory leaks result in the cumulative degradation of overall performance and in theory, will cause a process to run out of memory over time.
- the memory caching and weak memory reuse mechanisms provided by some available memory managers can lead to an out-of-memory condition, even where the machine has enough physical capacity to satisfy the demand. For example, if a user application process repeatedly asks for small blocks of memory, such as blocks less than 32 bytes, the memory manager will maintain the small blocks, following deallocation, in a separate list, or memory cache, for anticipated future requests for small memory blocks. Thus, the small blocks are unavailable for larger memory requests. Ifthere have been enough requests for small blocks, a larger memory request will be denied, even where there is enough physical capacity.
- the weak memory reuse mechanism refers to a situation where a machine having, for example, 30 megabytes of memory, first allocates and then deallocates, for example, 15 megabytes of memory.
- a memory rejuvenation subroutine 1200 shown in FIG. 12 checkpoints the memory ofa process at a "clean" state as part ofthe volatile state, and rolls back the process to that clean state from time to time, in order to prevent software failures.
- the memory rejuvenation subroutine 1200 will initially set a loop index, /, to zero during step 12 0. Thereafter, an appropriate initialization routine will be performed during step 1215. It is again noted that the initialized state is part ofthe checkpointed volatile state.
- step 1220 it is specified that all user files should be excluded from the checkpoint. Thus, when the checkpoint is taken and later restored, only the clean memory state will be restored. Furthermore, by excluding the entire persistent state, in other words, all input files, from the checkpoint, the current contents ofthe user files will be maintained following the restoration.
- the volatile state is checkpointed during step 1230 by executing the volatile state checkpoint subroutine 154 (FIG. 6). Thereafter, the desired processing task is performed during step 1240 based on the initialized state and the current value ofthe loop index, / ' .
- the results of the processing task performed in the previous step are written to the output buffer during step 1245, in a known manner. The contents ofthe output buffer will not be sent to the intended destination, such as the disk, until the buffer is full or a. flush system call is executed.
- a test is performed during step 1250 to determine ifthere are additional values of the loop index, / ' , to be processed. If it is determined during step 1250 that there are additional values ofthe loop index, / ' , to be processed, then the loop index will be incremented during step 1255. Thereafter, a test is performed during step 1270 to determine ifthe cu ⁇ ent value ofthe loop index, / ' , is a multiple ofthe specified rejuvenation period. In other words, ifthe clean memory state should be restored upon every 15 executions, then a test is performed to determine whether the cu ⁇ ent value ofthe loop index is a multiple of 15. If it is determined during step 1270 that the cu ⁇ ent value ofthe loop index, i, is not a multiple ofthe specified rejuvenation period, then program control will retum to step 1240 and continue processing in the manner described above.
- step 1270 If, however, it is determined during step 1270 that the cu ⁇ ent value ofthe loop index, / ' , is a multiple ofthe specified rejuvenation period, then the output buffer should be flushed during step 1275, before the memory is restored to the clean state. Thereafter, the volatile state is rolled back during step 1280 by executing the restoration subroutine 158 with a return value equal to the cu ⁇ ent value ofthe loop index, / ' . Because the checkpoint does not include any user files, only the clean memory state is restored. As previously indicated, the restoration subroutine 158 will retum from the checkpoint function in step 1230 with the return value. Thus, by retaining the return value, which is equal to the loop index, correct progress ofthe user application process is ensured. When the restoration subroutine 158 returns from the checkpoint function, program control will proceed to step 1240 and continue in the manner described above.
- step 1250 If it is determined during step 1250 that there are no additional values ofthe loop index, /, to be processed, program control will proceed to step 1290 where execution ofthe memory rejuvenation subroutine 1200 will terminate.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Retry When Errors Occur (AREA)
Abstract
Un système (10) de point de reprise et de restauration utilise des techniques de restauration pour effectuer des processus d'application d'utilisateur qui sauvegardent l'état non rémanent et des parties de l'état persistant, pendant l'exécution, et restaurent ensuite l'état sauvegardé. Une technique de point de reprise lente retarde la constitution du point de reprise de l'état persistant jusqu'à ce qu'une incohérence entre l'état non rémanent jalonné de points de reprise et l'état persistant soit sur le point de se produire. Le système (10) de point de reprise et de restauration permet à un utilisateur ou à un processus (40) d'application d'utilisateur de spécifier des parties de l'état persistant à exclure d'un point de reprise. Une partie sélectionnée de l'état du processus de pré-restauration, telle qu'un argument de valeur de retour peut être protégée avant la restauration du processus d'application d'utilisateur à un état de point de reprise, de telle manière que les valeurs de pré-restauration de l'état protégé sont conservées après la restauration. La valeur de retour conservée permet à des segments du code de restauration d'être exécutés après une restauration et permet à un mode d'exécution normale d'être distingué d'un mode de restauration.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP9503017A JPH11508069A (ja) | 1995-06-16 | 1995-06-16 | 持続性状態のチェックポイントおよび復旧システム |
US08/981,298 US6105148A (en) | 1995-06-16 | 1995-06-16 | Persistent state checkpoint and restoration systems |
PCT/US1995/007629 WO1997000476A1 (fr) | 1995-06-16 | 1995-06-16 | Systemes de point de reprise et de restauration de l'etat persistant |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US1995/007629 WO1997000476A1 (fr) | 1995-06-16 | 1995-06-16 | Systemes de point de reprise et de restauration de l'etat persistant |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1997000476A1 true WO1997000476A1 (fr) | 1997-01-03 |
Family
ID=22249319
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1995/007629 WO1997000476A1 (fr) | 1995-06-16 | 1995-06-16 | Systemes de point de reprise et de restauration de l'etat persistant |
Country Status (2)
Country | Link |
---|---|
JP (1) | JPH11508069A (fr) |
WO (1) | WO1997000476A1 (fr) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000003320A1 (fr) * | 1998-07-13 | 2000-01-20 | Stratfor Systems, Inc. | Procede et systeme pour l'identification de l'etat d'un appareil a support d'enregistrement par surveillance des macros de gestion de systemes de fichiers |
WO2000036510A1 (fr) * | 1998-12-14 | 2000-06-22 | Sun Microsystems, Inc. | Procede et systeme de reprise par logiciel oriente objet |
US7003770B1 (en) | 1998-12-16 | 2006-02-21 | Kent Ridge Digital Labs | Method of detaching and re-attaching components of a computing process |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113515430A (zh) * | 2021-09-14 | 2021-10-19 | 国汽智控(北京)科技有限公司 | 进程的状态监控方法、装置和设备 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4498145A (en) * | 1982-06-30 | 1985-02-05 | International Business Machines Corporation | Method for assuring atomicity of multi-row update operations in a database system |
US4814971A (en) * | 1985-09-11 | 1989-03-21 | Texas Instruments Incorporated | Virtual memory recovery system using persistent roots for selective garbage collection and sibling page timestamping for defining checkpoint state |
US4819156A (en) * | 1986-06-13 | 1989-04-04 | International Business Machines Corporation | Database index journaling for enhanced recovery |
US4907150A (en) * | 1986-01-17 | 1990-03-06 | International Business Machines Corporation | Apparatus and method for suspending and resuming software applications on a computer |
US5201044A (en) * | 1990-04-16 | 1993-04-06 | International Business Machines Corporation | Data processing method for file status recovery includes providing a log file of atomic transactions that may span both volatile and non volatile memory |
US5263154A (en) * | 1992-04-20 | 1993-11-16 | International Business Machines Corporation | Method and system for incremental time zero backup copying of data |
US5333303A (en) * | 1991-03-28 | 1994-07-26 | International Business Machines Corporation | Method for providing data availability in a transaction-oriented system during restart after a failure |
US5440726A (en) * | 1994-06-22 | 1995-08-08 | At&T Corp. | Progressive retry method and apparatus having reusable software modules for software failure recovery in multi-process message-passing applications |
-
1995
- 1995-06-16 JP JP9503017A patent/JPH11508069A/ja active Pending
- 1995-06-16 WO PCT/US1995/007629 patent/WO1997000476A1/fr active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4498145A (en) * | 1982-06-30 | 1985-02-05 | International Business Machines Corporation | Method for assuring atomicity of multi-row update operations in a database system |
US4814971A (en) * | 1985-09-11 | 1989-03-21 | Texas Instruments Incorporated | Virtual memory recovery system using persistent roots for selective garbage collection and sibling page timestamping for defining checkpoint state |
US4907150A (en) * | 1986-01-17 | 1990-03-06 | International Business Machines Corporation | Apparatus and method for suspending and resuming software applications on a computer |
US4819156A (en) * | 1986-06-13 | 1989-04-04 | International Business Machines Corporation | Database index journaling for enhanced recovery |
US5201044A (en) * | 1990-04-16 | 1993-04-06 | International Business Machines Corporation | Data processing method for file status recovery includes providing a log file of atomic transactions that may span both volatile and non volatile memory |
US5333303A (en) * | 1991-03-28 | 1994-07-26 | International Business Machines Corporation | Method for providing data availability in a transaction-oriented system during restart after a failure |
US5263154A (en) * | 1992-04-20 | 1993-11-16 | International Business Machines Corporation | Method and system for incremental time zero backup copying of data |
US5440726A (en) * | 1994-06-22 | 1995-08-08 | At&T Corp. | Progressive retry method and apparatus having reusable software modules for software failure recovery in multi-process message-passing applications |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000003320A1 (fr) * | 1998-07-13 | 2000-01-20 | Stratfor Systems, Inc. | Procede et systeme pour l'identification de l'etat d'un appareil a support d'enregistrement par surveillance des macros de gestion de systemes de fichiers |
WO2000036510A1 (fr) * | 1998-12-14 | 2000-06-22 | Sun Microsystems, Inc. | Procede et systeme de reprise par logiciel oriente objet |
US6295611B1 (en) | 1998-12-14 | 2001-09-25 | Sun Microsystems, Inc.. | Method and system for software recovery |
US6430703B1 (en) | 1998-12-14 | 2002-08-06 | Sun Microsystems, Inc. | Method and system for software recovery |
US7003770B1 (en) | 1998-12-16 | 2006-02-21 | Kent Ridge Digital Labs | Method of detaching and re-attaching components of a computing process |
Also Published As
Publication number | Publication date |
---|---|
JPH11508069A (ja) | 1999-07-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6105148A (en) | Persistent state checkpoint and restoration systems | |
US6044475A (en) | Checkpoint and restoration systems for execution control | |
US6401216B1 (en) | System of performing checkpoint/restart of a parallel program | |
US5590277A (en) | Progressive retry method and apparatus for software failure recovery in multi-process message-passing applications | |
US5530802A (en) | Input sequence reordering method for software failure recovery | |
Bressoud | TFT: A software system for application-transparent fault tolerance | |
US6393583B1 (en) | Method of performing checkpoint/restart of a parallel program | |
CA2150059C (fr) | Methode et appareil de relance progressive utilisant des modules logiciels reutilisables pour les reprises apres les defaillances de logiciel dans les applications de transmissionde messages a operations multiples | |
US6332200B1 (en) | Capturing and identifying a complete and consistent set of checkpoint files | |
US5948112A (en) | Method and apparatus for recovering from software faults | |
JP3145236B2 (ja) | フォールトトレラントコンピューティング装置 | |
US5938775A (en) | Distributed recovery with κ-optimistic logging | |
US4674038A (en) | Recovery of guest virtual machines after failure of a host real machine | |
JP4321705B2 (ja) | スナップショットの取得を制御するための装置及び記憶システム | |
JP3675802B2 (ja) | 計算の状態を再構成する方法ならびにシステム | |
US5448718A (en) | Method and system for time zero backup session security | |
US6338147B1 (en) | Program products for performing checkpoint/restart of a parallel program | |
EP0119806B1 (fr) | Méthode d'attribution de points de contrôle asynchrone pour recouvrement après erreur | |
Silva et al. | Fault-tolerant execution of mobile agents | |
EP0566966A2 (fr) | Procédé et système de sauvegarde de données incrémentales | |
KR950014175B1 (ko) | 데이타의 타임제로 백업 복사 방법과 수단 | |
JP3481737B2 (ja) | ダンプ採取装置およびダンプ採取方法 | |
EP1001343B1 (fr) | Entrée-sortie asynchrone à haute disponibilité pour des systèmes d'ordinateur groupés | |
US7165160B2 (en) | Computing system with memory mirroring and snapshot reliability | |
US6256751B1 (en) | Restoring checkpointed processes without restoring attributes of external data referenced by the processes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): CA JP US |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
ENP | Entry into the national phase |
Ref country code: JP Ref document number: 1997 503017 Kind code of ref document: A Format of ref document f/p: F |
|
WWE | Wipo information: entry into national phase |
Ref document number: 08981298 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: CA |