New! View global litigation for patent families

US5712971A - Methods and systems for reconstructing the state of a computation - Google Patents

Methods and systems for reconstructing the state of a computation Download PDF

Info

Publication number
US5712971A
US5712971A US08570724 US57072495A US5712971A US 5712971 A US5712971 A US 5712971A US 08570724 US08570724 US 08570724 US 57072495 A US57072495 A US 57072495A US 5712971 A US5712971 A US 5712971A
Authority
US
Grant status
Grant
Patent type
Prior art keywords
program
state
phase
execution
application
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US08570724
Inventor
Craig Stanfill
Cliff Lasser
Robert Lordi
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ab Initio Technology LLC
Original Assignee
Ab Initio Software LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Grant date

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1471Saving, restoring, recovering or retrying involving logging of persistent data for recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1474Saving, restoring, recovering or retrying in transactions

Abstract

Methods and systems for running and checkpointing parallel and distributed applications which does not require modification to the programs used in the system nor changes to the underlying operating system. One embodiment of the invention includes the following general steps: (1) starting an application on a parallel processing system; (2) controlling processes for the application, including recording of commands and responses; (3) controlling a commit protocol; (4) detecting failures of the application; (5) continuing execution of the application from the most recently committed transaction after "replaying" the recorded commands and responses. A second embodiment comprises the following general steps: (1) starting an application on a parallel processing system; (2) controlling processes for the application, including recurrent recording of the memory image of a driver program that controls the application; (3) controlling a commit protocol; (4) detecting failures of the application; (5) continuing execution of the application from the most recently committed transaction after "restoring" the recorded memory image of the driver program.

Description

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to computer processing systems, and more particularly to methods and systems for reconstructing the state of an interrupted computation in a parallel processing computer environment.

2. Description of Related Art

Computational speeds of single processor computers have advanced tremendously over the past three decades. However, many fields require computational capacity that exceeds even the fastest single processor computer. An example is in transactional processing, where multiple users access computer resources concurrently, and where response times must be low for the system to be commercially acceptable. Another example is in database mining, where hundreds of gigabytes of information must be processed, and where processing data on a serial computer might take days or weeks. Accordingly, a variety of "parallel processing" systems have been developed to handle such problems. For purposes of this discussion, parallel processing systems include any configuration of computer systems using multiple central processing units (CPUs), either local (e.g., multiprocessor systems such as SMP computers), or locally distributed (e.g., multiple processors coupled as clusters or MPPs), or remotely, or remotely distributed (e.g., multiple processors coupled via LAN or WAN networks), or any combination thereof.

Complex data processing applications running on parallel processing systems typically make changes to multiple external collections of data (files, databases, etc.). Such applications do this by running one or more programs either concurrently or sequentially. If a failure occurs, partial changes may have been made to the external collections of data, which render that data unusable by the current application or other applications. In parallel processing systems, the problem is intensified since the collection of data will often be spread over many different nodes and storage units (e.g., magnetic disks), making the work required to "roll back" the state of the data increase proportionately with the number of storage units. Similarly, the number of programs which must be terminated can be large.

To recover from such failures, it is necessary to shut down the current (i.e., failed) application, and then either:

(1) undo all changes made by the application since its start (a "full rollback"), or

(2) restore the state of the system to an intermediate "checkpoint" and restart execution from that point (a "partial rollback").

Partial rollbacks from a checkpoint (also known as "checkpointing") has advantages over full rollbacks, in that less work will be lost if a failure occurs, and partial rollbacks require less information to be retained. However, checkpointing is a complex technical problem, because it is difficult to (1) capture the state of running programs; (2) consistently roll-back the state of all data files being modified; and (3) capture data in transit between programs (e.g., data being sent via a network). The problem is compounded by the fact that, in most cases, application programs must be specially written to provide checkpointing. In general, it is not possible to modify programs not designed for checkpointing to add explicit calls to a checkpointing software package without substantial changes to the source code for the program. Furthermore, most operating systems do not provide facilities to capture data in transit between programs.

Accordingly, there is a need for a method of providing checkpointing for applications which do not specifically provide for checkpointing. The present invention provides such a method that is particularly useful for applications running on parallel processing systems, and is also used for applications running on distributed processing systems.

SUMMARY OF THE INVENTION

The present invention is a method and system for running and checkpointing parallel and distributed applications which does not require modification to the programs used in the system nor changes to the underlying operating system. The invention encompasses two distinct embodiments. The first preferred embodiment comprises the following general steps:

(1) starting an application on a parallel processing system;

(2) controlling processes for the application, including recording of commands and responses;

(3) controlling a commit protocol;

(4) detecting failures of the application;

(5) continuing execution of the application from the most recently committed transaction after "replaying" the recorded commands and responses.

The second preferred embodiment comprises the following general steps:

(1) starting an application on a parallel processing system;

(2) controlling processes for the application, including recurrent recording of the memory image of a driver program that controls the application;

(3) controlling a commit protocol;

(4) detecting failures of the application;

(5) continuing execution of the application from the most recently committed transaction after "restoring" the recorded memory image of the driver program.

The principal features of the inventive architecture are:

(1) Central Control. Applications are run from a central point of control. In the preferred embodiment, a single "driver" program with a single thread of control instantiates and monitors all programs and data collections which form the application.

(2) Control via Host and Agents. To allow for distribution of processing over multiple nodes, a program called an "agent" is used to actuate changes on remote nodes. In the preferred embodiment, a separate agent is instantiated on each node. Overall control of the system is maintained by a "host" program, which manages communications with the driver program and agents, and maintains the global system state.

(3) A Single Command Channel. A "command channel" is maintained between the driver program and the host program. In the preferred embodiment, the driver program effects changes on the system solely through a set of commands and replies using the command channel.

(4) Recording of Command Channel Traffic or Memory Image. In the first embodiment, all commands and replies passing over the command channel are recorded by the host program and saved in non-volatile storage. In the second embodiment, the memory image of the driver program that controls the application is recurrently recorded by the host program and saved in non-volatile storage.

(5) Transaction-based Control. In the preferred embodiment, all operations performed via the command channel use a commit protocol (preferably a two-phase commit protocol) to ensure global atomicity.

(6) Recovery With Recapitulation. Using the above mechanisms, the invention provides the ability to "recover" a failed application by simply rerunning it. Briefly stated, the state of all data is restored via the commit protocol, then either the recorded traffic on the command channel is used to "trick" the driver program into believing the driver program is executing the application de novo, or the memory image of the last known good state for the driver program is restored. Owing to the deterministic nature of single-threaded computer programs, the driver program will of necessity end up, as of the last known good state, in the same state as it did the first time the program was run.

The principal intended use of the invention is in traditional data processing applications (e.g., accounting systems, batch transaction systems, etc.), but the invention could be applied to almost any computer application which makes changes to files or databases.

The details of the preferred embodiment of the present invention are set forth in the accompanying drawings and the description below. Once the details of the invention are known, numerous additional innovations and changes will become obvious to one skilled in the art.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing the software components and the flow of control of a checkpointing system in accordance with the present invention.

FIG. 2a is block diagram showing normal execution of a checkpointed program in accordance with the present invention.

FIG. 2b is block diagram showing a failure during execution of the checkpointed program of FIG. 2a.

FIG. 2c is block diagram showing recovery from the failure of FIG. 2b.

FIG. 3 is block diagram showing the software components and the flow of control of a checkpointing system during recovery from a failure in accordance with the present invention.

FIG. 4 is a flow chart showing in summary form the basic functional operations of the recapitulation embodiment of the present invention.

FIG. 5 is a flow chart showing in summary form the basic functional operations of the restoration embodiment of the present invention.

Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION OF THE INVENTION

Throughout this description, the preferred embodiment and examples shown should be considered as exemplars, rather than as limitations on the present invention.

Overview

FIG. 1 is a block diagram showing the software components and the flow of control of a checkpointing system in accordance with the present invention. A host system 10 includes a host program 12, a driver program 14, and a data storage system 16 for recording commands and replies. The host program 12 and driver program 14 are intercoupled by a command channel 18 (which may be, for example, a logical channel on a physical bus). In the preferred embodiment, the host program 12 is actually an object within the address space of the driver program 14. A separate process running the host program could be implemented instead. However, the division into a `host program` and a `driver program` is a convenient way to describe the architecture of the present invention.

The host system 10 is coupled to at least one remote system 20 by means of an agent communication channel 22 (which may be, for example, a logical channel on a conventional physical data link 22). Within each remote system 20 is an agent 24 which is coupled to remote data storage 26.

All components shown in FIG. 1 are active during normal execution. The driver program 14 issues commands to the host program 12 to effect operations on applications on various remote systems 20. The host program 12 responds to such commands by issuing commands to one or more agents 24 to perform the requested operations. The agents 24 reply back to the host program 12 when the operations are completed, and the host program 12 in turn replies back to the driver program 14. All commands and replies between the driver program 14 and host program 12 are recorded in the data storage system 16.

FIG. 2a shows normal execution of a checkpointed system in accordance with the present invention. In the example shown, the application executes in three phases, beginning in an Initial state 0, proceeding through Phase 0 to a Checkpoint 1 state, then proceeding through Phase 1 to a Checkpoint 2 state, and then through Phase 2 to a Final state.

FIG. 2b is a block diagram showing a failure during execution of the checkpointed program of FIG. 2a. A failure may occur, for example, if one of the nodes "crashes" and has to be restarted. In the example shown, sometime after the Checkpoint 1 state has been reached, a failure occurs. Execution is halted in the middle of Phase 1, leaving the external state of the parallel processing systems in an undesirable Failure state.

FIG. 2c is a block diagram showing recovery from the failure shown in FIG. 2b using the present invention. When the application is recovered by re-executing it, operations performed in failed Phase 1 are rolled back, returning the external state of the processing system to the state that existed at the Checkpoint 1 state. All completed phases (in this example, Phase 0) are then "recapitulated" or "restored". In recapitulation, the driver program 14 restarts from its initial state and functions normally. However, no external state changes occur until the driver program 14 reaches the same state that existed at the Checkpoint 1 state. In restoration, a saved image of the driver program 14 is restored and then functions normally from that point. Thereafter, the failed phase (Phase 1 in this example) and all subsequent phases are executed normally, taking the application through the Checkpoint 2 state and thence to the Final state.

FIG. 3 is a block diagram showing the architecture of a checkpointing system in accordance with the present invention during recapitulation. During the recapitulation mode, the driver program 14 starts over from the Initial state. Each command from the driver program 14 is reissued to the host program 12, which matches that command to recorded commands and replies previously stored in the data storage system 16. As long as the sequence of commands from the driver program 14 matches the recorded commands, the corresponding recorded replies can be fed back by the host program 12 to the driver program 14, in effect "tricking" the driver program 14 into thinking that the phases being recapitulated are in fact executing normally. However, no data is actually transformed, moved, etc. Thus, the recapitulation stage proceeds extremely fast, until the driver program 14 reaches the last known good checkpoint state. At that point, the host program 12 switches out of the recapitulation mode, and back into the normal operating mode, supplying commands from the driver program 14 to agents 24 in the remote systems 20, in normal fashion.

The invention may be implemented in hardware or software, or a combination of both. However, preferably, the invention is implemented in computer programs executing on programmable computers each comprising a processor, a data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Program code is applied to input data to perform the functions described herein and generate output information. The output information is applied to one or more output devices, in known fashion.

Each program is preferably implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the programs can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language.

Each such computer program is preferably stored on a storage media or device (e.g., ROM or magnetic diskette) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein. The inventive system may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.

The Driver Program

The invention uses a driver program 14 that provides supervisory functions for setting up and controlling execution of one or more existing non-checkpointing programs. Additional functions may be included if desired. In the preferred embodiment, the driver program 14 performs at least the functions defined below:

(a) Start Job. When the driver program 14 starts, it requests that the host system 10 instantiate the host program 12. The host system 10 responds by providing the driver program 14 with an identifier or pointer to a command channel 18 for communication with the host program 12 and the data storage system 16. The driver program 14 connects to the host program 12 over the command channel 18 and issues a "Start Job" command to the host program 12. This command includes the name of a "recovery file" to be established by the host system 10 in the data storage system 16. (In some implementations, `connecting` may require starting a separate process; in other implementations, `connecting` only requires initializing some internal data structures. In either event, the first action is always to start a job as described.)

(b) Commands. The command channel 18 accepts at least the following commands from the driver program 14:

(1) Remote Procedure Call. This call causes a command to be executed by a remote agent 24. The Remote Procedure Call (RPC) command specifies the node on which the command is to be executed. If no agent is currently running on that node, the host program 12 starts up a remote agent 24 on that node.

(2) Start Process. This command causes a process to be started by an agent 24. Again, the command specifies the node on which the process is to be run. If no agent is currently running on that node, the host program 12 starts up a remote agent 24 on that node.

(3) Wait. This command causes function execution of the driver program 14 to be suspended until all processes have finished execution.

(4) Prepare; Commit; Rollback. These three commands have their usual prior art meaning with regard to a conventional two-phase commit transaction processing protocol.

(c) Receipt of Replies. Every command results in exactly one message in reply. Commands and replies may be arbitrarily intermixed (e.g., several commands may be issued before corresponding replies come back). The driver program 14 accepts at least the following replies from the host program 12:

(1) Remote Procedure Call Reply. The contents of this message are specific to the procedure which was invoked by an RPC command.

(2) Process ID. When a process is started on a system (host 10 or remote 20), the system replies with an identifier for that process.

(3) Wait Status. An indication as to whether the various processes terminated successfully.

(4) Prepare/Commit/Rollback Status. A success or failure status indicator.

(d) Abort. This command signifies that execution is to be halted. The host program 12 will attempt to perform a rollback (this may fail if, for example, one of the nodes involved in the computation has crashed). Whether or not the rollback succeeds, the command channel 18 is typically destroyed. An Abort may be manually issued by the driver program 14 if, for example, the driver program 14 detects a run-time failure. An Abort may also be implicitly issued if the driver program 14 fails.

(e) Finish Job. This command signifies that no more work is to be done. The command channel 18 is typically destroyed, and all changes made by the application become irrevocable.

(f) Additional Commands. Additional commands may be added as desired, but are not important for purposes of this disclosure.

Importantly, the driver program 14 divides execution of an application into a series of "phases," such that all processes are required to quiesce (e.g., exit or reach an idle state with no outstanding data transfers) between phases. A phase consists of the following steps:

(1) The driver program 14 issues a series of RPCs (e.g., to set up data files, etc.), which will be needed by one or more application programs.

(2) Optionally, the driver program 14 issues a series of Start Process commands.

(3) If any processes have been started, then after a desired number of processes have been started, the driver program 14 issues a Wait command, and suspends operation until the wait is complete, thus giving time for the processes to complete. In general, all processes that need to cross-communicate with other processes should be started concurrently.

(4) The sequence of steps (1)-(3) may be repeated several times, if desired.

(5) The driver program 14 issues Prepare and Commit commands, causing the current transaction to be committed and any changes made during the current transaction to become permanent.

(6) Further phases of execution follow.

A key step is issuing the Wait command on a recurring basis, since after this command terminates it is guaranteed that there are no active application programs in the system, and that there is no data in transit in communication channels. This characteristic allows the invention to work around the difficulties inherent in capturing the states of running programs and capturing data in transit in communication channels.

A consequence of this design is that checkpoints may not be created while programs are running. If one of the application programs runs for several hours, then there will necessarily be a period of several hours when it is not possible to create a checkpoint. It is the responsibility of the driver program 14, which will often be written by a user of this system, to ensure that the run-time of any phase is not excessive.

There are two techniques in the prior art which may be used to reduce the duration of phases. The first technique is to reduce the use of "pipelining" between successive stages of processing within a phase. Specifically, it is common practice to compose applications from component programs by linking them via communication channels, a technique called "pipelining." Both component programs would necessarily run in the same phase. If this would result in too long a phase, then the writer of the driver could substitute temporary files for the communication channels, and run each component program in a separate execution phase. The second technique is to divide the data more finely. For example, rather than processing a single file with 10 gigabytes, one might divide it into 10 sub-files of 1 gigabyte each, and process each sub-file in a single execution phase. Owing to the fact that the run-times of most applications is roughly proportional to the lengths of their input files, this method might achieve, e.g., a 10-fold decrease in the duration of phases, greatly improving the frequency with which checkpoints may be created. (In prior art, this additional subdivision has been done on an ad-hoc basis, and has generally required the modification of programs and perhaps the writing of additional software. Reference is made to the co-pending patent application entitled "Overpartitioning System and Method for Increasing Checkpoints in Component-based Parallel Applications", assigned to the assignee of the present invention, in which some general methods are explained by which this subdivision may be accomplished without modification to the original programs.)

State Databases (SDB)

The host system 10 creates a state database for itself(the "Host SDB"). Each agent 24 also creates its own state database (the "Agent SDBs"). The Host SDB is used to record command channel traffic when using the recapitulation embodiment of the present invention, to record which phase is being executed, and to record information needed for commit processing. The Agent SDBs are used to record information for recovery processing and commit processing. In the preferred embodiment, an SDB exists in memory only for the life of the program which accesses it. However, all changes to the SDB are recorded sequentially in an ordered journal file (a "log") in non-volatile storage, such as the data storage system 16. At any time, an SDB can be reconstructed in memory from the corresponding log. In the preferred embodiment, the log is the only persistent storage associated with an SDB. Reconstruction of an SDB is performed by starting with an empty database and reading from the log a series of changes to the database, and reflecting the changes in the in-memory database contents.

In the preferred embodiment, all entries to an SDB are in the form of a pair of text strings: a key and a value. When writing an entry (a "Put"), the calling program supplies a key/value entry which is stored in the database. If an entry having the identical key existed before the Put operation, it is replaced. When reading an entry (a "Get"), the calling program supplies a key, and if an entry exists having that key, its value string is returned. In addition, in the preferred embodiment, the SDB interface allows for the creation of lists, which are sequences of entries which can be accessed sequentially rather than by key. List entries are normal string values.

A user "opens" an SDB by supplying the name of a log file. If the log file exists, it is read, and the corresponding SDB is reconstructed from the log contents. If the log file does not exist, a log file is created and an SDB is created in memory as an empty data structure. The SDB is then available for Put and Get operations until it is closed. Closing a database is effected explicitly using a Close operation, or implicitly if the accessing program ceases to exist.

In the preferred embodiment, all operations on an SDB are grouped into "transactions". Any Get or Put will start a new transaction if one is not in progress. A transaction continues through subsequent Puts or Gets, until:

(1) An SDB Commit operation is performed.

(2) An SDB Rollback operation is performed, which cancels the effects of the entire transaction on the SDB.

(3) The SDB is closed, which implicitly rolls back any active transactions to the last known good state.

In the preferred embodiment, transactions on an SDB are not part of the global commit architecture, and have a much finer granularity.

The Host Program

When the driver program 14 starts a job, the host system 10 creates a command channel 18 to permit communication between the driver program 14 and a host program 12. The host program 12 handles data at the host system 10 end of the command channel 18. There is one host program 12 for each application. The host program 12 operates as follows:

(a) Starting a Job. In the preferred embodiment, the following functions are performed by the host program 12 at the start of an application:

(1) Job ID. The host program 12 creates a unique identifier called the "Job ID" using the Internet address of the system on which the host program 12 is running, a time stamp, and the process ID of the host program 12.

(2) Host SDB. The host program 12 creates the Host SDB, using the Job ID as an identifier. In the preferred embodiment, the Job ID is stored in the Host SDB.

(3) Recovery File. The host program 12 also writes a file called a "recovery file" to the data storage system 16 at the start of its execution. This file also contains the Job ID, which can be used to open the Host SDB.

(b) Remote Nodes and Agents. The host program 12 starts processes and operates on files/databases via one or more agents 24. Each agent 24 establishes a bidirectional connection (an agent command channel 22) with the host program 12, for transmitting commands to the agent 24 and receiving replies from the agent 24. An agent 24 is started on each remote system 20 where application programs will be run or on which files or other data collections are located. In the preferred embodiment, agents 24 are started on an as-needed basis, instead of all at once at the start of an application.

(c) Execution of Phases. The host program 12 uses agents 24 to implement the division of execution of a non-checkpointing application into phases, as directed by the driver program 14. The host program 12 is responsible for the transactional mechanisms which implement this division, and performs certain bookkeeping and coordination functions, as described below.

In the preferred embodiment, phases are numbered starting at zero. A phase is always in one of four "states": RUNNING, ERROR, PREPARED, or COMMITTING. When a new phase is started, it is in the RUNNING state. It continues in the RUNNING state as remote operations are performed during the phase.

The current phase number and its state are recorded in the Host SDB. This information is recorded at every state transition. At any time during the phase, the driver program 14 can invoke the Rollback function. This function causes the effects of all operations performed so far during a current phase to be nullified, returning all changed state back to what it was at the start of the phase, and putting the phase into the RUNNING state.

During each phase, while in the RUNNING state, the driver program 14 may issue Start Process and Remote Procedure Calls (RPC) commands to the host program 12. The host program 12 forwards those commands to the appropriate agent 24, gathers replies from each agent 24, and sends the replies to the driver program 14. In the recapitulation embodiment of the present invention, all of these commands and corresponding replies are recorded in the Host SDB. This information is used for the "recapitulation mode", described below.

Once all processes started by the driver program 14 have exited, the driver program 14 can invoke the Prepare function, putting the current phase of an application into the PREPARED state. Following this, the driver program 14 can invoke the Commit function, which causes all effects of all operations performed during the phase to be completely, correctly, and irrevocably made, thus ending the phase. The Rollback function can also be called in the PREPARED state, with the same effect as if called before the Prepare function.

The preferred embodiment of the present invention uses a conventional two-phase commit protocol. In the preferred recapitulation embodiment, the two-phase commit protocol is as follows:

(1) Prepare. The Prepare command is performed by the host program 12 by:

1) Storing the recorded command channel data from the current phase in the Host SDB, using a key containing the current phase number. If such information already existed (e.g., due to a prior execution of the same phase which failed during commit processing), it will be overwritten.

2) Sending a Prepare command from the host program 12 to each agent 24 which executed commands during the phase. Each agent 24 will, in accordance with the conventional two-phase commit protocol, enter a PREPARED state such that, at any subsequent time, it may either execute a Rollback command which will restore the state of all resources under the control of the agent 24, or a Commit command which will make permanent all changes to all resources under the control of the agent 24. This PREPARED state must be durable, i.e., it must be possible to reconstruct the prepared state after a system failure, and then to either execute a Rollback or a Commit operation. When the PREPARED state has been attained, each agent 24 will signal this fact by responding to the Prepare command.

3) Waiting for all agents 24 to indicate successful completion of the Prepare command.

4) Setting the state of the host program 12 to PREPARED, and noting that change in the Host SDB.

(2) Commit. The Commit command is performed by:

1) Setting the state of the host program 12 to COMMITTING, and noting that change in the Host SDB.

2) Sending a Commit command to each agent 24 which executed commands during the phase. Each agent 24 will then cause all changes to all resources under its control to become permanent, possibly erasing information which might have been required in the event of a rollback. When such processing is complete, it will signal this fact by responding to the Commit command.

3) Waiting for all agents 24 to indicate successful completion of the Commit command.

4) Setting the state of the host program 12 to RUNNING, incrementing the phase number, and noting these changes in the Host SDB.

During the RUNNING or PREPARED state, if an error condition is detected by the operating system or the application, the phase will be placed in the ERROR state. In the preferred embodiment, while in this state, no further remote operations can be performed, nor can the state of the phase be changed, nor can a new phase be started. In the preferred embodiment, the only legal actions at this point are:

(1) Debugging. The driver program 14 may use some informational commands to debug the application and/or gather diagnostic information, in known fashion.

(2) Exiting. As the driver program 14 exits, the host program 12 may attempt to issue a Rollback command on behalf of the driver program 14.

(3) Rollback. The driver program 14 may issue a Rollback command, which will place the system in its state as of the start of the current phase, undoing any changes to files/databases, as described above.

In summary, the legal state changes for the host program 12 are as follows:

(1) Initial state: RUNNING.

(2) In any state other than COMMITTING, an error condition causes transition to the ERROR state. This state may be exited by a Rollback command, which will place the system in the RUNNING state, or if the driver program 14 exits.

(3) In the RUNNING state, a Prepare operation causes transition to the PREPARED state. During the RUNNING or PREPARED state, a Rollback operation causes the phase to be undone, in which case the phase number stays the same, and the system transitions back to the RUNNING state.

(4) In the PREPARED state, a Commit operation causes transition to the COMMITTING state. This state endures for the duration of the Commit operation, then advances the phase number, ending the current phase and starting a new phase in the RUNNING state. Once the Host SDB records the transition to the COMMITTING state, the detection of an error will cause the system to abort. Upon restarting the system, the COMMIT operation will be completed. No rollback is possible while in the COMMITTING state.

After completion (committing) of all phases of the application, the driver program 14 issues a Close command, which indicates that the application has successfully completed. This operation deletes the recovery file and the Host SDB.

In the restoration embodiment of the present invention, the procedure is similar, but with several exceptions. First, commands and replies are not stored in the Host SDB. Rather, after the Prepare command has been issued by the driver program 14 and executed by the host program 12 and agents 24, a memory image file of the driver program 14 is stored, preferably to non-volatile storage such as a disk drive, in known fashion. The memory image comprises either the entire address space (including swap files, etc.) for the driver program 14, or only those critical data structures of the driver program 14 (as determined by a programmer for a particular implementation of the driver program 14) necessary to recreate a saved state for the driver program 14. Once writing of a memory image file is confirmed (for example, by comparing the program image still in memory to the stored image file) and the system enters the Prepared state, the Commit command is issued by the driver program 14 and executed by the host program 12 and agents 24, as above. In the preferred embodiment, a next memory image file is written and confirmed before a prior memory image file is deleted (i.e., "A" and "B" copies are maintained, in known fashion). A prior memory image file should be deleted only after a Commit operation completes.

(d) Recovery. Any application which terminated without having executed the Close operation is considered to have failed. When the application is restarted, recovery is triggered. In the preferred embodiment, whenever a host program 12 is started, it checks for the existence of a recovery file. In the preferred embodiment, if the file exists, the host program 12 assumes that a prior failure occurred, and that the intent is to restart the prior job.

The first step in recovery is to restore all files and databases to their most recently committed state. If the Host SDB indicates a state of RUNNING, ERROR, or PREPARED, then the host program 12 issues a Rollback command, causing all uncommitted operations to be undone, in known fashion. If the Host SDB indicates a state of COMMITTING, the host program 12 re-issues the Commit command, completing what was evidently an interrupted Commit operation.

In the recapitulation embodiment of the present invention, the host program 12 then enters the recapitulation mode. As noted above, the driver program 14 interacts with the rest of the system via a single command channel 18, the traffic on which is automatically stored in the Host SDB during Commit processing, using a key containing the appropriate phase number. When recapitulating the phase, the host program 12 will start by retrieving the saved command channel traffic from the Host SDB. For recapitulation, the driver program 14 is restarted from its initial state, and functions normally. For the duration of recapitulation, all commands sent by the driver program 14 are discarded by the host program 12 (after, for safety sake, comparing them with the recorded command message traffic; however, this is optional). Whenever the driver program 14 expects to receive a reply message via the command channel 18, a reply is fetched by the host program 12 from the recorded incoming reply traffic on the data storage system 16 and immediately provided to the driver program 14. Owing to the deterministic nature of single-threaded computer programs, this process will result in the driver program 14 executing the same sequence of commands as it did during the failed run, and the controlled application program will end up in the same state as it did on the previous ran.

When the recorded command channel traffic from all committed phases has been replayed, it is guaranteed that:

(1) The driver program 14 has been restored (by recapitulation) to its state as of the most recent Commit operation; and

(2) All files and databases have also been restored (by use of the two-phase commit protocol) to their state as of the most recent Commit operation.

Thus, the system's state is restored and execution may proceed normally.

In the restoration embodiment of the present invention, the procedure is somewhat different:

(1) If the Host SDB indicates a state of PREPARED and there are two saved memory image files (A and B), then the host program 12 deletes the newer file (thus prohibiting a double Commit possibility), issues a Rollback command, and reloads into memory the older image file (i.e., the last known good saved image of the driver program 14). If the Host SDB indicates a state of PREPARED and there is one saved memory image file (A or B), then the host program 12 issues a Rollback command and reloads into memory that image file.

(2) If the Host SDB indicates a state of COMMITTING, then the host program 12 issues a Commit command and reloads into memory the newest image file.

(3) If the Host SDB indicates a state of RUNNING or ERROR and there are two saved memory image files, then the host program 12 issues a Rollback command and reloads into memory the newer image file; if there is one saved memory image file, then the host program 12 issues a Rollback command and reloads into memory that image file.

In any event, the host program 12 then re-establishes the command channel 18 and resumes execution of the driver program 14.

The restoration protocol guarantees that:

(1) If failure comes before entering the COMMIT state, a Rollback is performed and the oldest (pre-PREPARE state) memory image is used. If failure comes while in the COMMIT state, the Commit operation is finished and the newest (post-PREPARE state) memory image is used. If failure comes after entering the COMMIT state, a Rollback is performed and the newest (post-PREPARE state) memory image is used.

(2) All files and databases have also been restored (by use, for example, of the two-phase commit protocol) to their state as of the most recent Commit operation.

Thus, the system's state is restored and execution may proceed normally.

Agents

The description below applies to each agent 24 started by the host program 12. The term "local node" is used to indicate the system on which a particular agent 24 is running.

Each agent 24 performs the actual operations necessary to execute an application. An agent 24 is responsible for performing operations only on the remote system 20 on which it is running. These operations include execution of remote procedure calls, committing and rolling back such operations, and creating and monitoring processes. An agent 24 may be considered to reside "between" an application and the operating system in the sense that an agent 24 controls when and how an application can execute.

Commands are sent by the driver program 14 to an agent 24 via the host program 12 in the form of Remote Procedure Calls (RPCs). In the preferred embodiment, an RPC command consists of a command identifier followed by a series of arguments, all of which are text strings. The agent 24 contains a table mapping RPC command identifiers to "RPC handlers," where the handler is an object enabling the invocation of subroutines to perform the RPC, to commit the RPC, and to rollback the RPC. The agent 24 thus handles the RPC by locating the appropriate RPC handler, then providing that RPC handler with the RPC's arguments. The RPC handler routine parses the argument strings and performs the requested operation. Following this, the RPC handler routine produces a reply string which is sent back to the driver program 14 via the host program 12. Each reply string includes information on the success of the command and any requested return data. In the preferred embodiment, a special RPC is used to start processes, as explained below.

When an agent 24 is started, the first RPC command it receives is a Start Agent command. This command notifies the agent 24 of the Job ID for the application, and assigns to the local node a unique "Node ID". The agent 24 then opens a state database called the "Agent SDB". The Agent SDB name is derived from the Job ID and the Node ID, and so is unique throughout the application.

Each agent 24 tracks the phases of the application along with the host program 12. When the host program 12 performs a Prepare or Commit operation, it does so by sending Prepare Node and Commit Node RPC commands to each of its agents 24. In the preferred embodiment, the driver program 14 will only consider the application as a whole to be in the PREPARED state once all agents 24 have successfully responded to their individual Prepare Node commands. Similarly, the application will only consider a phase committed and advance to the next phase when all agents 24 have successfully responded to their Commit Node commands.

Each agent 24 records the current phase and as state in its Agent SDB. The four defined states and the allowed state changes are as in the host program 12, and in the normal case, track those of the host program 12. The current phase state (and the current phase number) can be retrieved by the driver program 14 using a "NodeState" RPC command.

When the driver program 14 invokes the Close function, it issues a Close command to each agent 24. Each agent 24 responds by verifying that the local phase state is running and that no processes are executing, and then deletes its associated Agent SDB.

In the preferred embodiment, each agent 24 performs RPC operations which are part of the phase and therefore are subject to the commit/rollback transaction architecture. To do this, each agent 24 makes use of its Agent SDB. Specifically, for each phase, an agent 24 creates in its Agent SDB a list called the "CR-- LIST", into which an entry is placed for each operation. Each entry carries enough information to undo the operation, in known fashion. The list is preferably ordered so that the operations can be undone in the reverse order in which they were performed.

In the preferred embodiment, for uniformity, all RPCs must conform to the following restrictions:

(1) If an RPC changes the state of a file/database, it must save any information which may be required to roll back changes to that file/database, and create an entry in the CR-- LIST. This entry must contain the identity of the RPC command being executed, so that the appropriate RPC handler may be located during commit/rollback processing.

(2) Each RPC handler must provide a means of implementing the Prepare, Commit, and Rollback operations (which may be Null operations if the RPC does not make changes to any databases or files).

(3) Each RPC handler may optionally use the SDB of each agent 24 to store any information needed to fulfill these requirements.

(4) No special action is required for RPCs that do not change the state of files/databases.

In the preferred embodiment, application programs must obey the following rules:

(1) If a process modifies files/databases, it must provide a means of rolling back changes and of implementing the Prepare/Commit/Rollback operations. Processes under control of an agent 24 also have access to the agent's SDB. For example, the application program may create entries on the CR13 LIST. Such entries must contain the identifier for an RPC command which implements the appropriate commit/rollback operations. However, in this case, an RPC did not actually take place, so an identifier for a "dummy RPC" is entered.

(2) Alternatively, the driver program 14 may issue RPCs on behalf of a process which will have the same effect.

The Start Process command to the agent 24 causes the agent 24 to run a specified application program image file, thus starting a "process" on the local node. In the preferred embodiment, the arguments to this command supply:

(1) The executable image file for the program.

(2) The program's argument list.

(3) Any operating system environment information required by the program.

(4) Files or pathnames to be opened for the process for use as its standard input, output, and error channels.

(5) The exit status code with which the program should exit to indicate successful execution.

(6) A debug mode (debugging is described below).

Each agent 24 maintains a list of all processes under its control. As processes are started, identifiers for those processes are added to this list.

In the preferred embodiment, each agent 24 does not wait for the termination of a process before replying to the driver program 14. Each agent 24 allows the process to execute concurrently with the agent 24, while monitoring execution of the process. At all times, an agent 24 is aware of the "process state" of the process, which is one of PS-- RUNNING, PS-- ERROR, PS-- DEBUG, or PS-- EXITED in the preferred embodiment.

The PS-- RUNNING state indicates that a program process is executing without known problems. The PS-- ERROR state indicates that the process is known to have encountered an unresolvable problem, and has either (1) signaled an error condition (a signal or error trap), (2) exited with an error status indicating failure, or (3) exited in an abnormal manner (e.g., by aborting or by being manually terminated by an operator). The PS-- EXITED state indicates that the process has successfully completed execution and has terminated itself normally. The PS-- DEBUG state is described under "debugging" below.

In the preferred embodiment, the driver program 14 can interrogate the state of a process using a "ProcState" RPC command. Each agent 24 also maintains an aggregate process state, indicating the state of all processes it has been commanded to start as a whole. This aggregate state is called the "node process state," and is distinct from the node's commit/rollback state (RUNNING, PREPARED, COMMITTING, ERROR). The node process states are the same as the four process states and are defined as follows:

(1) if any process is in the PS-- DEBUG state, then the aggregate state is PS-- DEBUG, otherwise,

(2) if any process is in the PS-- ERROR state, then the aggregate state is PS-- ERROR, otherwise,

(3) if any process is in the PS-- RUNNING state, then the aggregate state is PS-- RUNNING, otherwise,

(4) the aggregate state is PS-- EXITED (all processes have exited normally).

Transitions in the node process state affect the node's commit/rollback state. Specifically, if the node process state transitions into the PS-- ERROR, state, then the node's commit/rollback state will automatically transition to the ERROR state. Additionally, it is only legal to transition from the RUNNING state to the PREPARED state, or from the PREPARED state to the COMMITTING state if the process state is PS-- EXITED.

The aggregate process state can be retrieved by the driver program 14 using the NodeState agent command.

In the preferred embodiment, processes may emit error messages through standard error I/O channels. For example under UNIX, this is the "stderr" I/O file. Such output may be optionally routed to the agent 24 from any process, and is available to the driver program 14 via an "Eread" RPC command.

The driver program 14 is often in a circumstance where it cannot continue execution until all processes started on a node or set of nodes complete execution. To accommodate this circumstance, the agent 24 supports a "Wait" command from the driver program 14. The Wait command causes an agent 24 to delay its reply until the node process state ceases to be the PS-- RUNNING state (i.e., the state is PS-- DEBUG, PS-- ERROR, or PS-- EXITED). The reply to the Wait command indicates the processes that caused the Wait condition to terminate. The driver program 14 can also cancel the Wait condition by sending a "sync" RPC command to the agent 24 during the Wait condition. The Sync command works whether or not an intervening wait reply is accidentally received (because the wait reply and the sync command crossed in the command channel 18).

Process Debugging

From time to time, it is useful for the driver program 14 to allow the user to debug a particular process in the system. Debugging entails running a process under the control of a standard debugger as available under a particular operating system. A user may wish to debug a process from the beginning of its execution. Alternatively, the user may wish to debug a process only if it encounters an error condition, i.e. when it transitions to the PS-- ERROR process state.

In the preferred embodiment, when a process is started (using the Start Process command), it can be specified to be run in any of three "debug modes": DEBUG-- NONE, DEBUG-- START, and DEBUG-- TRACE. No debugger will be run on a process in the DEBUG-- NONE debug mode. Processes specified with the DEBUG-- START mode will be run from the beginning with a debugger attached. Processes specified with the DEBUG-- TRACE mode will be monitored by the agent 24, and if they should go into any of various detectable error conditions (which may include error traps, signals, or aborts), they will be stopped and a debugger will be run attached to the failing process.

In the preferred embodiment, an agent 24 does not autonomously start the debugger. Instead, when a process is in need of debugging (as indicated by the debug mode), the agent 24 transitions the process to the PS-- DEBUG process state. This causes the aggregate process state to transition to PS-- DEBUG. This state is communicated to the driver program 14 (for example, this state will terminate a wait condition). At that time, the driver program 14 can invoke a debugger for the process using a "debug" RPC command. This command specifies a program to be executed, presumably a shell script, to which will be passed sufficient information (via the argument list) to start a debugger of choice.

Recovery

Each agent maintains a local phase number and state which is stored in its Agent SDB. The phase number is kept in sync with that of the host program 12 via the prepare/commit protocol. The phase state is derived principally from the process state, and is used by the driver program 14 to compute the state of the current phase for the application as a whole.

When the driver program 14 mis re-invoked after a failure, the driver program 14 tells the host program 12 to start a job. If a "recovery file" is found, then the host program 12 enters "recovery mode," and recovers the state of the agents 24 as follows:

(1) The Host SDB is used to determine the set of agent 24 processes running at the time of the failure.

(2) A new agent 24 is created on each such node.

(3) Each agent 24 is given a "Start Agent" command with the Job ID.

(4) The agents 24 recognizes this Job ID as an existing application, because the Agent SDB still exists (its name is derived from the Job ID).

(5) Each agent 24 opens its SDB, which is reconstructed from its log, and extracts the current phase number, state, and commit-rollback list.

(6) If the host program 12 is in state other than COMMITTING, it will then transmit a Rollback command to the agents 24, which causes the agents 24 to undo all operations performed in that phase, in reverse order. If, on the other hand, the host program 12 was in the COMMITTING state, it will re-issue a Commit command to the agents 24. Any agent 24 which finds itself in the COMMITTING state will complete what was evidently an interrupted commit operation by traversing the commit-rollback list in forward order, executing the commit methods of all entries. Any agent 24 which finds itself in the RUNNING state will treat the Commit command as a Null command (since the prior Commit operation evidently had completed on its node but not on some other nodes).

(7) At that point, the agents 24 consider themselves to be at the start of that phase in the RUNNING state, and can proceed to take commands from the driver program 14.

Summary

FIG. 4 is a flow chart showing in summary form the basic functional operations of the recapitulation embodiment of the present invention. The driver program 14 starts processes on remote systems 20 (Step 40). The host program 12 records all control commands from the driver program 14 (Step 41), as well as all replies to the driver program 14 (Step 42). Each agent 24 executes an application in phases on its respective remote system 20 (Step 43). The applications execute a prepare-commit protocol to store system and file states while maintaining system consistency (Step 44). If a failure occurs, the state of the system is restored and the driver program 14 restarts, issuing commands to the host program 12 (Step 45). The host program 12 reads matching replies for each command and sends the replies to the driver program 14, in a recapitulation mode, until done (Step 46). The driver program 14 then continues controlling the application processes from the last good checkpoint (Step 47).

FIG. 5 is a flow chart showing in summary form the basic functional operations of the restoration embodiment of the present invention. The driver program 14 starts processes on remote systems 20 (Step 50). Each agent 24 executes an application in phases on its respective remote system 20 (Step 51). The applications execute the prepare part of a commit protocol to store system and file states (Step 52). A memory image of the driver program 14 is stored after the prepare protocol is done (Step 53). The applications execute a commit protocol to complete saving of the system and file states while maintaining system consistency (Step 54). If a failure occurs, the state of the system is restored and the stored memory image of the driver program 14 is reloaded into memory (Step 55). The driver program 14 then continues controlling the application processes from the last good checkpoint (Step 56).

A number of embodiments of the present invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. For example, the invention could be applied to single-CPU systems. Further, although a two-phase commit protocol is preferred, other commit protocols that safely save system state while maintaining system consistency may be used. Accordingly, it is to be understood that the invention is not to be limited by the specific illustrated embodiment, but only by the scope of the appended claims.

Claims (12)

What is claimed is:
1. A method for executing a computer application on a parallel processing system, where such application does not have pre-programmed checkpointing capability, comprising the steps of:
(a) executing an application in distinct execution phases on a parallel processing system;
(b) controlling processing of each execution phase of the application by issuing commands and replies to such commands;
(c) recording all such commands and replies to such commands;
(d) saving the end-state of each successfully completed execution phase;
(e) detecting failure of the application in any of such execution phases;
(f) restoring the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(g) recapitulating all recorded commands and replies to such commands from the beginning of execution of the application up through the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(h) restarting the application at the beginning of the execution phase in which failure was detected.
2. A computer program for executing a computer application on a parallel processing system, where such application does not have pre-programmed checkpointing capability, the computer program being stored on a media readable by a computer system, for configuring the computer system upon being read and executed by the computer system to:
(a) execute an application in distinct execution phases on a parallel processing system;
(b) control processing of each execution phase of the application by issuing commands and replies to such commands;
(c) record all such commands and replies to such commands;
(d) save the end-state of each successfully completed execution phase;
(e) detect failure of the application in any of such execution phases;
(f) restore the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(g) recapitulate all recorded commands and replies to such commands from the beginning of execution of the application up through the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(h) restart the application at the beginning of the execution phase in which failure was detected.
3. A computer-readable storage medium, configured with a computer program for executing a computer application on a parallel processing system, where such application does not have pre-programmed checkpointing capability, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions of:
(a) executing an application in distinct execution phases on a parallel processing system;
(b) controlling processing of each execution phase of the application by issuing commands and replies to such commands;
(c) recording all such commands and replies to such commands;
(d) saving the end-state of each successfully completed execution phase;
(e) detecting failure of the application in any of such execution phases;
(f) restoring the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(g) recapitulating all recorded commands and replies to such commands from the beginning of execution of the application up through the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(h) restarting the application at the beginning of the execution phase in which failure was detected.
4. A method for executing a computer application on a parallel processing system, where such application does not have pre-programmed checkpointing capability, comprising the steps of:
(a) dividing an application into distinct execution phases;
(b) starting execution of the application on a parallel processing system;
(c) controlling processing of each execution phase of the application by issuing commands and replies to such commands;
(d) recording all such commands and replies to such commands;
(e) saving the end-state of each successfully completed execution phase by a two-phase commit protocol;
(f) detecting failure of the application in any of such execution phases;
(g) restoring the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(h) recapitulating all recorded commands and replies to such commands from the beginning of execution of the application up through the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(i) restarting the application at the beginning of the execution phase in which failure was detected.
5. A computer program for executing a computer application on a parallel processing system, where such application does not have pre-programmed checkpointing capability, the computer program being stored on a media readable by a computer system, for configuring the computer system upon being read and executed by the computer system to:
(a) divide an application into distinct execution phases;
(b) start execution of the application on a parallel processing system;
(c) control processing of each execution phase of the application by issuing commands and replies to such commands;
(d) record all such commands and replies to such commands;
(e) save the end-state of each successfully completed execution phase by a two-phase commit protocol;
(f) detect failure of the application in any of such execution phases;
(g) restore the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(h) recapitulate all recorded commands and replies to such commands from the beginning of execution of the application up through the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(i) restart the application at the beginning of the execution phase in which failure was detected.
6. A computer-readable storage medium, configured with a computer program for executing a computer application on a parallel processing system, where such application does not have pre-programmed checkpointing capability, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions of:
(a) dividing an application into distinct execution phases;
(b) starting execution of the application on a parallel processing system;
(c) controlling processing of each execution phase of the application by issuing commands and replies to such commands;
(d) recording all such commands and replies to such commands;
(e) saving the end-state of each successfully completed execution phase by a two-phase commit protocol;
(f) detecting failure of the application in any of such execution phases;
(g) restoring the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(h) recapitulating all recorded commands and replies to such commands from the beginning of execution of the application up through the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(i) restarting the application at the beginning of the execution phase in which failure was detected.
7. A method for executing a computer application on a parallel processing system, where such application does not have pre-programmed checkpointing capability, comprising the steps of:
(a) executing an application in distinct execution phases on a parallel processing system;
(b) controlling processing of each execution phase of the application by a driver program;
(c) saving the end-state of each successfully completed execution phase;
(d) saving, at the end of each successfully completed execution phase, at least those data structures of the driver program necessary to recreate a saved state for the driver program;
(e) detecting failure of the application in any of such execution phases;
(f) restoring the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(g) restoring the driver program by reloading into memory the saved data structures of the driver program up through the end of the execution phase prior to the execution phase in which failure was detected;
(h) restarting the application at the beginning of the execution phase in which failure was detected.
8. A computer program for executing a computer application on a parallel processing system, where such application does not have pre-programmed checkpointing capability, the computer program being stored on a media readable by a computer system, for configuring the computer system upon being read and executed by the computer system to:
(a) execute an application in distinct execution phases on a parallel processing system;
(b) control processing of each execution phase of the application by a driver program;
(c) save the end-state of each successfully completed execution phase;
(d) save, at the end of each successfully completed execution phase, at least those data structures of the driver program necessary to recreate a saved state for the driver program;
(e) detect failure of the application in any of such execution phases;
(f) restore the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(g) restore the driver program by reloading into memory the saved data structures of the driver program up through the end of the execution phase prior to the execution phase in which failure was detected;
(h) restart the application at the beginning of the execution phase in which failure was detected.
9. A computer-readable storage medium, configured with a computer program for executing a computer application on a parallel processing system, where such application does not have pre-programmed checkpointing capability, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions of:
(a) executing an application in distinct execution phases on a parallel processing system;
(b) controlling processing of each execution phase of the application by a driver program;
(c) saving the end-state of each successfully completed execution phase;
(d) saving, at the end of each successfully completed execution phase, at least those data structures of the driver program necessary to recreate a saved state for the driver program;
(e) detecting failure of the application in any of such execution phases;
(f) restoring the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(g) restoring the driver program by reloading into memory the saved data structures of the driver program up through the end of the execution phase prior to the execution phase in which failure was detected;
(h) restarting the application at the beginning of the execution phase in which failure was detected.
10. A method for executing a computer application on a parallel processing system, where such application does not have pre-programmed checkpointing capability, comprising the steps of:
(a) dividing an application into distinct execution phases;
(b) starting execution of the application on a parallel processing system;
(c) controlling processing of each execution phase of the application by a driver program;
(d) saving the end-state of each successfully completed execution phase by a two-phase commit protocol;
(e) saving, at the end of each successfully completed execution phase, at least those data structures of the driver program necessary to recreate a saved state for the driver program;
(f) detecting failure of the application in any of such execution phases;
(g) restoring the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(h) restoring the driver program by reloading into memory the saved data structures of the driver program up through the end of the execution phase prior to the execution phase in which failure was detected;
(i) restarting the application at the beginning of the execution phase in which failure was detected.
11. A computer program for executing a computer application on a parallel processing system, where such application does not have pre-programmed checkpointing capability, the computer program being stored on a media readable by a computer system, for configuring the computer system upon being read and executed by the computer system to:
(a) divide an application into distinct execution phases;
(b) start execution of the application on a parallel processing system;
(c) control processing of each execution phase of the application by a driver program;
(d) save the end-state of each successfully completed execution phase by a two-phase commit protocol;
(e) save, at the end of each successfully completed execution phase, at least those data structures of the driver program necessary to recreate a saved state for the driver program;
(f) detect failure of the application in any of such execution phases;
(g) restore the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(h) restore the driver program by reloading into memory the saved data structures of the driver program up through the end of the execution phase prior to the execution phase in which failure was detected;
(i) restart the application at the beginning of the execution phase in which failure was detected.
12. A computer-readable storage medium, configured with a computer program for executing a computer application on a parallel processing system, where such application does not have pre-programmed checkpointing capability, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions of:
(a) dividing an application into distinct execution phases;
(b) starting execution of the application on a parallel processing system;
(c) controlling processing of each execution phase of the application by a driver program;
(d) saving the end-state of each successfully completed execution phase by a two-phase commit protocol;
(e) saving, at the end of each successfully completed execution phase, at least those data structures of the driver program necessary to recreate a saved state for the driver program;
(f) detecting failure of the application in any of such execution phases;
(g) restoring the last saved end-state of the execution phase prior to the execution phase in which failure was detected;
(h) restoring the driver program by reloading into memory the saved data structures of the driver program up through the end of the execution phase prior to the execution phase in which failure was detected;
(i) restarting the application at the beginning of the execution phase in which failure was detected.
US08570724 1995-12-11 1995-12-11 Methods and systems for reconstructing the state of a computation Expired - Lifetime US5712971A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US08570724 US5712971A (en) 1995-12-11 1995-12-11 Methods and systems for reconstructing the state of a computation

Applications Claiming Priority (9)

Application Number Priority Date Filing Date Title
US08570724 US5712971A (en) 1995-12-11 1995-12-11 Methods and systems for reconstructing the state of a computation
DK96943730T DK0954779T3 (en) 1995-12-11 1996-12-11 Process for the reconstruction of a calculation mode
JP52221797A JP3573463B2 (en) 1995-12-11 1996-12-11 How to reconstruct the state of the computing and system
ES96943730T ES2320601T3 (en) 1995-12-11 1996-12-11 Method to reconstruct the state of a calculation.
DE1996637836 DE69637836D1 (en) 1995-12-11 1996-12-11 Ustands
PCT/US1996/019836 WO1997022052A1 (en) 1995-12-11 1996-12-11 Methods and systems for reconstructing the state of a computation
EP19960943730 EP0954779B8 (en) 1995-12-11 1996-12-11 Method for reconstructing the state of a computation
CA 2240347 CA2240347C (en) 1995-12-11 1996-12-11 Methods and systems for reconstructing the state of a computation
JP2003349169A JP3675802B2 (en) 1995-12-11 2003-10-08 How to reconstruct the state of the computing and system

Publications (1)

Publication Number Publication Date
US5712971A true US5712971A (en) 1998-01-27

Family

ID=24280795

Family Applications (1)

Application Number Title Priority Date Filing Date
US08570724 Expired - Lifetime US5712971A (en) 1995-12-11 1995-12-11 Methods and systems for reconstructing the state of a computation

Country Status (8)

Country Link
US (1) US5712971A (en)
EP (1) EP0954779B8 (en)
JP (2) JP3573463B2 (en)
CA (1) CA2240347C (en)
DE (1) DE69637836D1 (en)
DK (1) DK0954779T3 (en)
ES (1) ES2320601T3 (en)
WO (1) WO1997022052A1 (en)

Cited By (122)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999006925A1 (en) * 1997-07-31 1999-02-11 Data Net Corporation Method and apparatus for implementing software connectivity for client/server applications
US5923833A (en) * 1996-03-19 1999-07-13 International Business Machines Coporation Restart and recovery of OMG-compliant transaction systems
US5931954A (en) * 1996-01-31 1999-08-03 Kabushiki Kaisha Toshiba I/O control apparatus having check recovery function
US6009258A (en) * 1997-09-26 1999-12-28 Symantec Corporation Methods and devices for unwinding stack of frozen program and for restarting the program from unwound state
US6029177A (en) * 1997-11-13 2000-02-22 Electronic Data Systems Corporation Method and system for maintaining the integrity of a database providing persistent storage for objects
US6151569A (en) * 1997-09-26 2000-11-21 Symantec Corporation Automated sequence of machine-performed attempts to unfreeze an apparently frozen application program
US6175932B1 (en) * 1998-04-20 2001-01-16 National Instruments Corporation System and method for providing state capture and restoration to an I/O system
US6226759B1 (en) 1998-09-28 2001-05-01 International Business Machines Corporation Method and apparatus for immediate data backup by duplicating pointers and freezing pointer/data counterparts
WO2001042920A1 (en) 1999-12-06 2001-06-14 Ab Initio Software Corporation Continuous flow checkpointing data processing
US6256751B1 (en) * 1998-10-29 2001-07-03 International Business Machines Corporation Restoring checkpointed processes without restoring attributes of external data referenced by the processes
US6324567B2 (en) * 1997-06-11 2001-11-27 Oracle Corporation Method and apparatus for providing multiple commands to a server
US20010046144A1 (en) * 2000-05-29 2001-11-29 Omron Corporation Power supply module and power supply unit using the same
US6338147B1 (en) 1998-10-29 2002-01-08 International Business Machines Corporation Program products for performing checkpoint/restart of a parallel program
US6393583B1 (en) 1998-10-29 2002-05-21 International Business Machines Corporation Method of performing checkpoint/restart of a parallel program
US6397351B1 (en) 1998-09-28 2002-05-28 International Business Machines Corporation Method and apparatus for rapid data restoration including on-demand output of sorted logged changes
US6401216B1 (en) 1998-10-29 2002-06-04 International Business Machines Corporation System of performing checkpoint/restart of a parallel program
US6415286B1 (en) 1996-03-25 2002-07-02 Torrent Systems, Inc. Computer system and computerized method for partitioning data for parallel processing
US20020147938A1 (en) * 2001-04-05 2002-10-10 International Business Machines Corporation System and method for collecting and restoring user environment data using removable storage
US6477663B1 (en) * 1998-04-09 2002-11-05 Compaq Computer Corporation Method and apparatus for providing process pair protection for complex applications
US20030018573A1 (en) * 2001-06-28 2003-01-23 Andrew Comas System and method for characterizing and selecting technology transition options
US20030078910A1 (en) * 1999-09-29 2003-04-24 Kabushiki Kaisha Toshiba Transaction processing system using efficient file update processing and recovery processing
US20030088812A1 (en) * 2001-11-08 2003-05-08 M-Systems Flash Disk Pioneers Ltd. Ruggedized block device driver
US20030120731A1 (en) * 2000-02-22 2003-06-26 Amir Weinberg Cooperative software application architecture
US20030126163A1 (en) * 2001-12-28 2003-07-03 Hong-Yeon Kim Method for file deletion and recovery against system failures in database management system
US20030145253A1 (en) * 2002-01-18 2003-07-31 De Bonet Jeremy S. Method and system for isolating and protecting software components
US20030177324A1 (en) * 2002-03-14 2003-09-18 International Business Machines Corporation Method, system, and program for maintaining backup copies of files in a backup storage device
US6630946B2 (en) 1999-11-10 2003-10-07 Symantec Corporation Methods for automatically locating data-containing windows in frozen applications program and saving contents
US6631480B2 (en) 1999-11-10 2003-10-07 Symantec Corporation Methods and systems for protecting data from potential corruption by a crashed computer program
US6662310B2 (en) 1999-11-10 2003-12-09 Symantec Corporation Methods for automatically locating url-containing or other data-containing windows in frozen browser or other application program, saving contents, and relaunching application program with link to saved data
US6678701B1 (en) 2000-01-05 2004-01-13 International Business Machines Corporation Technique for establishing a point of consistency in a parallel database loading system
WO2004008320A1 (en) * 2002-07-12 2004-01-22 Crossroads Systems, Inc. Mechanism for enabling enhanced fibre channel error recovery across redundant paths using scsi level commands
US20040078659A1 (en) * 2002-05-16 2004-04-22 International Business Machines Corporation Method, apparatus and computer program for reducing the amount of data checkpointed
US20040083158A1 (en) * 2002-10-09 2004-04-29 Mark Addison Systems and methods for distributing pricing data for complex derivative securities
US20040088278A1 (en) * 2002-10-30 2004-05-06 Jp Morgan Chase Method to measure stored procedure execution statistics
US20040107183A1 (en) * 2002-12-03 2004-06-03 Jp Morgan Chase Bank Method for simplifying databinding in application programs
US20040109200A1 (en) * 1998-03-27 2004-06-10 Canon Kabushiki Kaisha Image processing apparatus, control method of image processing apparatus, and storage medium storing therein control program for image processing apparatus
US20040153535A1 (en) * 2003-02-03 2004-08-05 Chau Tony Ka Wai Method for software suspension in a networked computer system
US20040167878A1 (en) * 2003-02-24 2004-08-26 Andrew Doddington Systems, methods, and software for preventing redundant processing of transmissions sent to a remote host computer
US20040205377A1 (en) * 2003-03-28 2004-10-14 Nec Corporation Fault tolerant multi-node computing system for parallel-running a program under different environments
US20040215725A1 (en) * 2003-03-31 2004-10-28 Lorraine Love System and method for multi-platform queue queries
US20040230587A1 (en) * 2003-05-15 2004-11-18 Andrew Doddington System and method for specifying application services and distributing them across multiple processors using XML
US20040230602A1 (en) * 2003-05-14 2004-11-18 Andrew Doddington System and method for decoupling data presentation layer and data gathering and storage layer in a distributed data processing system
US20040254824A1 (en) * 2003-01-07 2004-12-16 Alex Loucaides System and method for process scheduling
US20050030555A1 (en) * 2003-05-16 2005-02-10 Phenix John Kevin Job processing framework
US20050131966A1 (en) * 2003-12-15 2005-06-16 Sbc Knowledge Ventures, L.P. Architecture of database application with robust online recoverability
US20050144174A1 (en) * 2003-12-31 2005-06-30 Leonid Pesenson Framework for providing remote processing of a graphical user interface
US20050172288A1 (en) * 2004-01-30 2005-08-04 Pratima Ahuja Method, system, and program for system recovery
US20050172054A1 (en) * 2004-01-30 2005-08-04 Ramani Mathrubutham Method, system, and program for buffering work requests
US20050171789A1 (en) * 2004-01-30 2005-08-04 Ramani Mathrubutham Method, system, and program for facilitating flow control
US20050204029A1 (en) * 2004-03-09 2005-09-15 John Connolly User connectivity process management system
US20050222990A1 (en) * 2004-04-06 2005-10-06 Milne Kenneth T Methods and systems for using script files to obtain, format and disseminate database information
US20060031586A1 (en) * 2004-04-26 2006-02-09 Jp Morgan Chase Bank System and method for routing messages
US7003770B1 (en) 1998-12-16 2006-02-21 Kent Ridge Digital Labs Method of detaching and re-attaching components of a computing process
US20060085492A1 (en) * 2004-10-14 2006-04-20 Singh Arun K System and method for modifying process navigation
US7047232B1 (en) * 1999-01-13 2006-05-16 Ab Initio Software Corporation Parallelizing applications of script-driven tools
US20060167950A1 (en) * 2005-01-21 2006-07-27 Vertes Marc P Method for the management, logging or replay of the execution of an application process
US7085759B2 (en) 2002-12-06 2006-08-01 Jpmorgan Chase Bank System and method for communicating data to a process
US20060236152A1 (en) * 2005-04-14 2006-10-19 International Business Machines Corporation Method and apparatus for template based parallel checkpointing
US20070018823A1 (en) * 2005-05-30 2007-01-25 Semiconductor Energy Laboratory Co., Ltd. Semiconductor device and driving method thereof
US20070162785A1 (en) * 2006-01-12 2007-07-12 Microsoft Corporation Capturing and restoring application state after unexpected application shutdown
US20070162779A1 (en) * 2006-01-12 2007-07-12 Microsoft Corporation Capturing and restoring application state after unexpected application shutdown
US20070180530A1 (en) * 2005-08-31 2007-08-02 Microsoft Corporation Unwanted file modification and transactions
US20070179923A1 (en) * 2002-10-10 2007-08-02 Stanfill Craig W Transactional Graph-Based Computation
US20070220298A1 (en) * 2006-03-20 2007-09-20 Gross Kenny C Method and apparatus for providing fault-tolerance in parallel-processing systems
US20070294056A1 (en) * 2006-06-16 2007-12-20 Jpmorgan Chase Bank, N.A. Method and system for monitoring non-occurring events
US20080071781A1 (en) * 2006-09-19 2008-03-20 Netlogic Microsystems, Inc. Inexact pattern searching using bitmap contained in a bitcheck command
US20080071780A1 (en) * 2006-09-19 2008-03-20 Netlogic Microsystems, Inc. Search Circuit having individually selectable search engines
US20080071765A1 (en) * 2006-09-19 2008-03-20 Netlogic Microsystems, Inc. Regular expression searching of packet contents using dedicated search circuits
US20080071757A1 (en) * 2006-09-19 2008-03-20 Netlogic Microsystems, Inc. Search engine having multiple co-processors for performing inexact pattern search operations
US20080071779A1 (en) * 2006-09-19 2008-03-20 Netlogic Microsystems, Inc. Method and apparatus for managing multiple data flows in a content search system
KR100833681B1 (en) 2005-12-08 2008-05-29 한국전자통신연구원 A system and method for a memory management of parallel process using commit protocol
US7386752B1 (en) * 2004-06-30 2008-06-10 Symantec Operating Corporation Using asset dependencies to identify the recovery set and optionally automate and/or optimize the recovery
US7392471B1 (en) 2004-07-28 2008-06-24 Jp Morgan Chase Bank System and method for comparing extensible markup language (XML) documents
US20080212581A1 (en) * 2005-10-11 2008-09-04 Integrated Device Technology, Inc. Switching Circuit Implementing Variable String Matching
US7440304B1 (en) 2003-11-03 2008-10-21 Netlogic Microsystems, Inc. Multiple string searching using ternary content addressable memory
US20080294937A1 (en) * 2007-05-25 2008-11-27 Fujitsu Limited Distributed processing method
US7499933B1 (en) 2005-11-12 2009-03-03 Jpmorgan Chase Bank, N.A. System and method for managing enterprise application configuration
WO2009134264A1 (en) * 2008-05-01 2009-11-05 Hewlett-Packard Development Company, L.P. Storing checkpoint data in non-volatile memory
US20090282042A1 (en) * 2008-05-12 2009-11-12 Expressor Software Method and system for managing the development of data integration projects to facilitate project development and analysis thereof
US7636717B1 (en) 2007-01-18 2009-12-22 Netlogic Microsystems, Inc. Method and apparatus for optimizing string search operations
US7643353B1 (en) 2007-10-25 2010-01-05 Netlogic Microsystems, Inc. Content addressable memory having programmable interconnect structure
US7665127B1 (en) 2004-06-30 2010-02-16 Jp Morgan Chase Bank System and method for providing access to protected services
US20100115334A1 (en) * 2008-11-05 2010-05-06 Mark Allen Malleck Lightweight application-level runtime state save-and-restore utility
US20100211953A1 (en) * 2009-02-13 2010-08-19 Ab Initio Technology Llc Managing task execution
US7783654B1 (en) 2006-09-19 2010-08-24 Netlogic Microsystems, Inc. Multiple string searching using content addressable memory
US20100293407A1 (en) * 2007-01-26 2010-11-18 The Trustees Of Columbia University In The City Of Systems, Methods, and Media for Recovering an Application from a Fault or Attack
US7895565B1 (en) 2006-03-15 2011-02-22 Jp Morgan Chase Bank, N.A. Integrated system and method for validating the functionality and performance of software applications
US7913249B1 (en) 2006-03-07 2011-03-22 Jpmorgan Chase Bank, N.A. Software installation checker
US7916510B1 (en) 2009-08-10 2011-03-29 Netlogic Microsystems, Inc. Reformulating regular expressions into architecture-dependent bit groups
US20110078500A1 (en) * 2009-09-25 2011-03-31 Ab Initio Software Llc Processing transactions in graph-based applications
US7924589B1 (en) 2008-06-03 2011-04-12 Netlogic Microsystems, Inc. Row redundancy for content addressable memory having programmable interconnect structure
US7924590B1 (en) 2009-08-10 2011-04-12 Netlogic Microsystems, Inc. Compiling regular expressions for programmable content addressable memory devices
US20110093433A1 (en) * 2005-06-27 2011-04-21 Ab Initio Technology Llc Managing metadata for graph-based computations
US20110179014A1 (en) * 2010-01-15 2011-07-21 Ian Schechter Managing data queries
US8069183B2 (en) * 2007-02-24 2011-11-29 Trend Micro Incorporated Fast identification of complex strings in a data stream
CN102263652A (en) * 2010-05-31 2011-11-30 鸿富锦精密工业(深圳)有限公司 And a network device parameter setting method which changes
US20120054261A1 (en) * 2010-08-25 2012-03-01 Autodesk, Inc. Dual modeling environment
US8181016B1 (en) 2005-12-01 2012-05-15 Jpmorgan Chase Bank, N.A. Applications access re-certification system
WO2012112763A1 (en) 2011-02-18 2012-08-23 Ab Initio Technology Llc Restarting data processing systems
WO2012112748A1 (en) 2011-02-18 2012-08-23 Ab Initio Technology Llc Restarting processes
US8527488B1 (en) 2010-07-08 2013-09-03 Netlogic Microsystems, Inc. Negative regular expression search operations
US8572516B1 (en) 2005-08-24 2013-10-29 Jpmorgan Chase Bank, N.A. System and method for controlling a screen saver
US8572236B2 (en) 2006-08-10 2013-10-29 Ab Initio Technology Llc Distributing services in graph-based computations
US20130290772A1 (en) * 2012-04-30 2013-10-31 Curtis C. Ballard Sequence indicator for command communicated to a sequential access storage device
US8689185B1 (en) * 2004-01-27 2014-04-01 United Services Automobile Association (Usaa) System and method for processing electronic data
US8706667B2 (en) 2007-07-26 2014-04-22 Ab Initio Technology Llc Transactional graph-based computation with error handling
US8862603B1 (en) 2010-11-03 2014-10-14 Netlogic Microsystems, Inc. Minimizing state lists for non-deterministic finite state automatons
US8875145B2 (en) 2010-06-15 2014-10-28 Ab Initio Technology Llc Dynamically loading graph-based computations
US9088459B1 (en) 2013-02-22 2015-07-21 Jpmorgan Chase Bank, N.A. Breadth-first resource allocation system and methods
US20150213077A1 (en) * 2012-10-09 2015-07-30 Huawei Technologies Co., Ltd. Method and system for causing a web application to obtain a database change
US9116955B2 (en) 2011-05-02 2015-08-25 Ab Initio Technology Llc Managing data queries
US9274926B2 (en) 2013-01-03 2016-03-01 Ab Initio Technology Llc Configurable testing of computer programs
US9317467B2 (en) 2012-09-27 2016-04-19 Hewlett Packard Enterprise Development Lp Session key associated with communication path
US9354981B2 (en) 2013-10-21 2016-05-31 Ab Initio Technology Llc Checkpointing a collection of data units
US9507682B2 (en) 2012-11-16 2016-11-29 Ab Initio Technology Llc Dynamic graph performance monitoring
US9542259B1 (en) 2013-12-23 2017-01-10 Jpmorgan Chase Bank, N.A. Automated incident resolution system and method
US9619410B1 (en) 2013-10-03 2017-04-11 Jpmorgan Chase Bank, N.A. Systems and methods for packet switching
US9720655B1 (en) 2013-02-01 2017-08-01 Jpmorgan Chase Bank, N.A. User interface event orchestration
US9734222B1 (en) 2004-04-06 2017-08-15 Jpmorgan Chase Bank, N.A. Methods and systems for using script files to obtain, format and transport data
US9868054B1 (en) 2014-02-10 2018-01-16 Jpmorgan Chase Bank, N.A. Dynamic game deployment
US9886241B2 (en) 2013-12-05 2018-02-06 Ab Initio Technology Llc Managing interfaces for sub-graphs
US9891901B2 (en) 2013-12-06 2018-02-13 Ab Initio Technology Llc Source code translation

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1690163A4 (en) * 2003-11-17 2011-07-13 Virginia Tech Intell Prop Transparent checkpointing and process migration in a distributed system
US20070276879A1 (en) * 2006-05-26 2007-11-29 Rothman Michael A Sparse checkpoint and rollback
EP1873643A1 (en) * 2006-06-30 2008-01-02 Alcatel Lucent Service objects with rollback-recovery
US8776018B2 (en) 2008-01-11 2014-07-08 International Business Machines Corporation System and method for restartable provisioning of software components
JP2010165251A (en) 2009-01-16 2010-07-29 Toshiba Corp Information processing device, processor, and information processing method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4703481A (en) * 1985-08-16 1987-10-27 Hewlett-Packard Company Method and apparatus for fault recovery within a computing system
US4823256A (en) * 1984-06-22 1989-04-18 American Telephone And Telegraph Company, At&T Bell Laboratories Reconfigurable dual processor system
US5313584A (en) * 1991-11-25 1994-05-17 Unisys Corporation Multiple I/O processor system
US5435003A (en) * 1993-10-07 1995-07-18 British Telecommunications Public Limited Company Restoration in communications networks
US5440726A (en) * 1994-06-22 1995-08-08 At&T Corp. Progressive retry method and apparatus having reusable software modules for software failure recovery in multi-process message-passing applications
US5499342A (en) * 1987-11-20 1996-03-12 Hitachi, Ltd. System for dynamically switching logical sessions between terminal device and a processor which stops its operation to another working processor under control of communication control processor
US5530802A (en) * 1994-06-22 1996-06-25 At&T Corp. Input sequence reordering method for software failure recovery
US5590277A (en) * 1994-06-22 1996-12-31 Lucent Technologies Inc. Progressive retry method and apparatus for software failure recovery in multi-process message-passing applications

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4665520A (en) * 1985-02-01 1987-05-12 International Business Machines Corporation Optimistic recovery in a distributed processing system
US5335343A (en) * 1992-07-06 1994-08-02 Digital Equipment Corporation Distributed transaction processing using two-phase commit protocol with presumed-commit without log force

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4823256A (en) * 1984-06-22 1989-04-18 American Telephone And Telegraph Company, At&T Bell Laboratories Reconfigurable dual processor system
US4703481A (en) * 1985-08-16 1987-10-27 Hewlett-Packard Company Method and apparatus for fault recovery within a computing system
US5499342A (en) * 1987-11-20 1996-03-12 Hitachi, Ltd. System for dynamically switching logical sessions between terminal device and a processor which stops its operation to another working processor under control of communication control processor
US5313584A (en) * 1991-11-25 1994-05-17 Unisys Corporation Multiple I/O processor system
US5435003A (en) * 1993-10-07 1995-07-18 British Telecommunications Public Limited Company Restoration in communications networks
US5440726A (en) * 1994-06-22 1995-08-08 At&T Corp. Progressive retry method and apparatus having reusable software modules for software failure recovery in multi-process message-passing applications
US5530802A (en) * 1994-06-22 1996-06-25 At&T Corp. Input sequence reordering method for software failure recovery
US5590277A (en) * 1994-06-22 1996-12-31 Lucent Technologies Inc. Progressive retry method and apparatus for software failure recovery in multi-process message-passing applications

Non-Patent Citations (20)

* Cited by examiner, † Cited by third party
Title
Apers, Peter M.G., et al., "PRISMA/DB: A Parallel, Main Memory Relational DBMS", Dec. 1992, IEEE Transactions on Knowledge and Data Engineering, vol. 4, No. 6, pp. 541-554.
Apers, Peter M.G., et al., PRISMA/DB: A Parallel, Main Memory Relational DBMS , Dec. 1992, IEEE Transactions on Knowledge and Data Engineering, vol. 4, No. 6, pp. 541 554. *
Boral, Haran, et al., "Prototyping Bubba, A Highly Parallel Database System", Mar. 1990, IEEE Transactions on Knowledge and Data Engineering, vol. 2, No. 1, pp. 4-23.
Boral, Haran, et al., Prototyping Bubba, A Highly Parallel Database System , Mar. 1990, IEEE Transactions on Knowledge and Data Engineering, vol. 2, No. 1, pp. 4 23. *
Casas, et al., Mist:PVM with Transparent Migration and Checkpointing, Oregon Graduate Institute of Science & Technology, pp. 1 13, May 1995. *
Casas, et al., Mist:PVM with Transparent Migration and Checkpointing, Oregon Graduate Institute of Science & Technology, pp. 1-13, May 1995.
DeWitt, David J., et al., "The Gamma Database Machine Project", Mar. 1990, IEEE Transactions on Knowledge and Data Engineering, vol. 2, No. 1.,pp. 44-62.
DeWitt, David J., et al., The Gamma Database Machine Project , Mar. 1990, IEEE Transactions on Knowledge and Data Engineering, vol. 2, No. 1.,pp. 44 62. *
Frieder, Ophir and Chaitanya K. Baru, "Site and Query Scheduling Policies in Multicomputer Database Systems", Aug. 1994, IEEE Transactions on Knowledge and Data Engineering, vol. 6, No. 4, pp. 609-619.
Frieder, Ophir and Chaitanya K. Baru, Site and Query Scheduling Policies in Multicomputer Database Systems , Aug. 1994, IEEE Transactions on Knowledge and Data Engineering, vol. 6, No. 4, pp. 609 619. *
Goetz Graefe, Query Evaluation Techniques for Large Database, Portland State University, Computer Science Department, pp. 88 94. *
Goetz Graefe, Query Evaluation Techniques for Large Database, Portland State University, Computer Science Department, pp. 88-94.
Graefe, Goetz and Diane L. Davison, "Encapsulation of Parallelism and Architecture-Independence in Extensible Database Query Execution", Aug. 1993, IEEE Transactions on Software Engineering, vol. 19, No. 8, pp. 749-764.
Graefe, Goetz and Diane L. Davison, Encapsulation of Parallelism and Architecture Independence in Extensible Database Query Execution , Aug. 1993, IEEE Transactions on Software Engineering, vol. 19, No. 8, pp. 749 764. *
Graefe, Goetz, "Volcano--An Extensible and Parallel Query Evaluation System", Feb. 1994, IEEE Transactions on Knowledge and Data Engineering, vol. 6, No. 1, pp. 120-135.
Graefe, Goetz, Volcano An Extensible and Parallel Query Evaluation System , Feb. 1994, IEEE Transactions on Knowledge and Data Engineering, vol. 6, No. 1, pp. 120 135. *
IBM, Database 2 AIX/6000 Programming Reference manual, 1993, pp. 282 283. *
IBM, Database 2 AIX/6000 Programming Reference manual, 1993, pp. 282-283.
Lutifiyya and Cowan, Depart of Computer Science, University of Western Ontario, pp. 1 18, Feb. 20, 1995. *
Lutifiyya and Cowan, Depart of Computer Science, University of Western Ontario, pp. 1-18, Feb. 20, 1995.

Cited By (238)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5931954A (en) * 1996-01-31 1999-08-03 Kabushiki Kaisha Toshiba I/O control apparatus having check recovery function
US5923833A (en) * 1996-03-19 1999-07-13 International Business Machines Coporation Restart and recovery of OMG-compliant transaction systems
US6415286B1 (en) 1996-03-25 2002-07-02 Torrent Systems, Inc. Computer system and computerized method for partitioning data for parallel processing
US6324567B2 (en) * 1997-06-11 2001-11-27 Oracle Corporation Method and apparatus for providing multiple commands to a server
WO1999006925A1 (en) * 1997-07-31 1999-02-11 Data Net Corporation Method and apparatus for implementing software connectivity for client/server applications
US6009258A (en) * 1997-09-26 1999-12-28 Symantec Corporation Methods and devices for unwinding stack of frozen program and for restarting the program from unwound state
US6151569A (en) * 1997-09-26 2000-11-21 Symantec Corporation Automated sequence of machine-performed attempts to unfreeze an apparently frozen application program
US6029177A (en) * 1997-11-13 2000-02-22 Electronic Data Systems Corporation Method and system for maintaining the integrity of a database providing persistent storage for objects
US7973963B2 (en) 1998-03-27 2011-07-05 Canon Kabushiki Kaisha Image forming apparatus, method of controlling image forming apparatus, and memory medium for storing computer program for executing method, with function program providing API
US7633643B2 (en) 1998-03-27 2009-12-15 Canon Kabushiki Kaisha Image processing apparatus, control method thereof, and storage medium storing control program, with interpreter for program objects represented in byte codes, and with application programming interface function programs used commonly by program objects
US20040109200A1 (en) * 1998-03-27 2004-06-10 Canon Kabushiki Kaisha Image processing apparatus, control method of image processing apparatus, and storage medium storing therein control program for image processing apparatus
US20070247671A1 (en) * 1998-03-27 2007-10-25 Canon Kabushiki Kaisha Image processing apparatus, control method of image processing apparatus, and storage medium storing therein control program for image processing apparatus
US8300253B2 (en) 1998-03-27 2012-10-30 Canon Kabushika Kaisha Image forming apparatus, method of controlling image forming apparatus, and memory medium for storing computer program for executing method, with interpreter for control programs that are provided for execution on OS-independent platform
US20090316207A1 (en) * 1998-03-27 2009-12-24 Canon Kabushiki Kaisha Image processing apparatus, control method of image processing apparatus, and storage medium storing therein control program for image processing apparatus
US20100309521A1 (en) * 1998-03-27 2010-12-09 Canon Kabushiki Kaisha Image processing apparatus, control method of image processing apparatus, and storage medium storing therein control program for image processing apparatus
US7259883B2 (en) * 1998-03-27 2007-08-21 Canon Kabushiki Kaisha Image processing apparatus, control method of image processing apparatus, and storage medium storing therein control program for image processing apparatus
US6622261B1 (en) 1998-04-09 2003-09-16 Compaq Information Technologies Group, L.P. Process pair protection for complex applications
US6477663B1 (en) * 1998-04-09 2002-11-05 Compaq Computer Corporation Method and apparatus for providing process pair protection for complex applications
US6175932B1 (en) * 1998-04-20 2001-01-16 National Instruments Corporation System and method for providing state capture and restoration to an I/O system
US6226759B1 (en) 1998-09-28 2001-05-01 International Business Machines Corporation Method and apparatus for immediate data backup by duplicating pointers and freezing pointer/data counterparts
US6397351B1 (en) 1998-09-28 2002-05-28 International Business Machines Corporation Method and apparatus for rapid data restoration including on-demand output of sorted logged changes
US6393583B1 (en) 1998-10-29 2002-05-21 International Business Machines Corporation Method of performing checkpoint/restart of a parallel program
US6256751B1 (en) * 1998-10-29 2001-07-03 International Business Machines Corporation Restoring checkpointed processes without restoring attributes of external data referenced by the processes
US6401216B1 (en) 1998-10-29 2002-06-04 International Business Machines Corporation System of performing checkpoint/restart of a parallel program
US6338147B1 (en) 1998-10-29 2002-01-08 International Business Machines Corporation Program products for performing checkpoint/restart of a parallel program
US7003770B1 (en) 1998-12-16 2006-02-21 Kent Ridge Digital Labs Method of detaching and re-attaching components of a computing process
US7047232B1 (en) * 1999-01-13 2006-05-16 Ab Initio Software Corporation Parallelizing applications of script-driven tools
US6834275B2 (en) * 1999-09-29 2004-12-21 Kabushiki Kaisha Toshiba Transaction processing system using efficient file update processing and recovery processing
US20030078910A1 (en) * 1999-09-29 2003-04-24 Kabushiki Kaisha Toshiba Transaction processing system using efficient file update processing and recovery processing
US6630946B2 (en) 1999-11-10 2003-10-07 Symantec Corporation Methods for automatically locating data-containing windows in frozen applications program and saving contents
US6631480B2 (en) 1999-11-10 2003-10-07 Symantec Corporation Methods and systems for protecting data from potential corruption by a crashed computer program
US6662310B2 (en) 1999-11-10 2003-12-09 Symantec Corporation Methods for automatically locating url-containing or other data-containing windows in frozen browser or other application program, saving contents, and relaunching application program with link to saved data
WO2001042920A1 (en) 1999-12-06 2001-06-14 Ab Initio Software Corporation Continuous flow checkpointing data processing
US6584581B1 (en) * 1999-12-06 2003-06-24 Ab Initio Software Corporation Continuous flow checkpointing data processing
US6678701B1 (en) 2000-01-05 2004-01-13 International Business Machines Corporation Technique for establishing a point of consistency in a parallel database loading system
US20030120731A1 (en) * 2000-02-22 2003-06-26 Amir Weinberg Cooperative software application architecture
US6981247B2 (en) * 2000-02-22 2005-12-27 Orsus Solutions Limited Cooperative software application architecture
US20010046144A1 (en) * 2000-05-29 2001-11-29 Omron Corporation Power supply module and power supply unit using the same
US6944790B2 (en) 2001-04-05 2005-09-13 International Business Machines Corporation System and method for collecting and restoring user environment data using removable storage
US20020147938A1 (en) * 2001-04-05 2002-10-10 International Business Machines Corporation System and method for collecting and restoring user environment data using removable storage
US8234156B2 (en) 2001-06-28 2012-07-31 Jpmorgan Chase Bank, N.A. System and method for characterizing and selecting technology transition options
US20030018573A1 (en) * 2001-06-28 2003-01-23 Andrew Comas System and method for characterizing and selecting technology transition options
US20030088812A1 (en) * 2001-11-08 2003-05-08 M-Systems Flash Disk Pioneers Ltd. Ruggedized block device driver
US6883114B2 (en) * 2001-11-08 2005-04-19 M-Systems Flash Disk Pioneers Ltd. Block device driver enabling a ruggedized file system
US6944635B2 (en) * 2001-12-28 2005-09-13 Electronics And Telecommunications Research Insitute Method for file deletion and recovery against system failures in database management system
US20030126163A1 (en) * 2001-12-28 2003-07-03 Hong-Yeon Kim Method for file deletion and recovery against system failures in database management system
US7168008B2 (en) * 2002-01-18 2007-01-23 Mobitv, Inc. Method and system for isolating and protecting software components
US20030145253A1 (en) * 2002-01-18 2003-07-31 De Bonet Jeremy S. Method and system for isolating and protecting software components
US20030177324A1 (en) * 2002-03-14 2003-09-18 International Business Machines Corporation Method, system, and program for maintaining backup copies of files in a backup storage device
US6880051B2 (en) 2002-03-14 2005-04-12 International Business Machines Corporation Method, system, and program for maintaining backup copies of files in a backup storage device
US20040078659A1 (en) * 2002-05-16 2004-04-22 International Business Machines Corporation Method, apparatus and computer program for reducing the amount of data checkpointed
US6978400B2 (en) * 2002-05-16 2005-12-20 International Business Machines Corporation Method, apparatus and computer program for reducing the amount of data checkpointed
US7024591B2 (en) 2002-07-12 2006-04-04 Crossroads Systems, Inc. Mechanism for enabling enhanced fibre channel error recovery across redundant paths using SCSI level commands
US7350114B2 (en) 2002-07-12 2008-03-25 Crossroads Systems, Inc. Mechanism for enabling enhanced fibre channel error recovery across redundant paths using SCSI level commands
US20060020846A1 (en) * 2002-07-12 2006-01-26 Moody William H Ii Mechanism for enabling enhanced fibre channel error recovery across redundant paths using SCSI level commands
US20040081082A1 (en) * 2002-07-12 2004-04-29 Crossroads Systems, Inc. Mechanism for enabling enhanced fibre channel error recovery across redundant paths using SCSI level commands
WO2004008320A1 (en) * 2002-07-12 2004-01-22 Crossroads Systems, Inc. Mechanism for enabling enhanced fibre channel error recovery across redundant paths using scsi level commands
US20040083158A1 (en) * 2002-10-09 2004-04-29 Mark Addison Systems and methods for distributing pricing data for complex derivative securities
US20070179923A1 (en) * 2002-10-10 2007-08-02 Stanfill Craig W Transactional Graph-Based Computation
US7636699B2 (en) 2002-10-10 2009-12-22 Ab Initio Technology Llc Processing transactions using graph-based computations including instances of computation graphs associated with the transactions
US20040088278A1 (en) * 2002-10-30 2004-05-06 Jp Morgan Chase Method to measure stored procedure execution statistics
US7340650B2 (en) 2002-10-30 2008-03-04 Jp Morgan Chase & Co. Method to measure stored procedure execution statistics
US8321467B2 (en) 2002-12-03 2012-11-27 Jp Morgan Chase Bank System and method for communicating between an application and a database
US20040107183A1 (en) * 2002-12-03 2004-06-03 Jp Morgan Chase Bank Method for simplifying databinding in application programs
US7149752B2 (en) 2002-12-03 2006-12-12 Jp Morgan Chase Bank Method for simplifying databinding in application programs
US20070143337A1 (en) * 2002-12-03 2007-06-21 Mangan John P Method For Simplifying Databinding In Application Programs
US7085759B2 (en) 2002-12-06 2006-08-01 Jpmorgan Chase Bank System and method for communicating data to a process
US8032439B2 (en) 2003-01-07 2011-10-04 Jpmorgan Chase Bank, N.A. System and method for process scheduling
US20040254824A1 (en) * 2003-01-07 2004-12-16 Alex Loucaides System and method for process scheduling
US7401156B2 (en) 2003-02-03 2008-07-15 Jp Morgan Chase Bank Method using control interface to suspend software network environment running on network devices for loading and executing another software network environment
US20040153535A1 (en) * 2003-02-03 2004-08-05 Chau Tony Ka Wai Method for software suspension in a networked computer system
US20040167878A1 (en) * 2003-02-24 2004-08-26 Andrew Doddington Systems, methods, and software for preventing redundant processing of transmissions sent to a remote host computer
US7484087B2 (en) 2003-02-24 2009-01-27 Jp Morgan Chase Bank Systems, methods, and software for preventing redundant processing of transmissions sent to a remote host computer
US20040205377A1 (en) * 2003-03-28 2004-10-14 Nec Corporation Fault tolerant multi-node computing system for parallel-running a program under different environments
US7237140B2 (en) 2003-03-28 2007-06-26 Nec Corporation Fault tolerant multi-node computing system for parallel-running a program under different environments
US20040215725A1 (en) * 2003-03-31 2004-10-28 Lorraine Love System and method for multi-platform queue queries
US7379998B2 (en) 2003-03-31 2008-05-27 Jp Morgan Chase Bank System and method for multi-platform queue queries
US20040230602A1 (en) * 2003-05-14 2004-11-18 Andrew Doddington System and method for decoupling data presentation layer and data gathering and storage layer in a distributed data processing system
US20040230587A1 (en) * 2003-05-15 2004-11-18 Andrew Doddington System and method for specifying application services and distributing them across multiple processors using XML
US7366722B2 (en) 2003-05-15 2008-04-29 Jp Morgan Chase Bank System and method for specifying application services and distributing them across multiple processors using XML
US8095659B2 (en) 2003-05-16 2012-01-10 Jp Morgan Chase Bank Service interface
US7509641B2 (en) 2003-05-16 2009-03-24 Jp Morgan Chase Bank Job processing framework
US20050030555A1 (en) * 2003-05-16 2005-02-10 Phenix John Kevin Job processing framework
US7440304B1 (en) 2003-11-03 2008-10-21 Netlogic Microsystems, Inc. Multiple string searching using ternary content addressable memory
US7634500B1 (en) * 2003-11-03 2009-12-15 Netlogic Microsystems, Inc. Multiple string searching using content addressable memory
US20090012958A1 (en) * 2003-11-03 2009-01-08 Sunder Rathnavelu Raj Multiple string searching using ternary content addressable memory
US7969758B2 (en) 2003-11-03 2011-06-28 Netlogic Microsystems, Inc. Multiple string searching using ternary content addressable memory
US20080046480A1 (en) * 2003-12-15 2008-02-21 At&T Knowledge Ventures, L.P. Architecture of database application with robust online recoverability
US20050131966A1 (en) * 2003-12-15 2005-06-16 Sbc Knowledge Ventures, L.P. Architecture of database application with robust online recoverability
US7281023B2 (en) 2003-12-15 2007-10-09 At&T Knowledge Ventures, L.P. Architecture of database application with robust online recoverability
US20050144174A1 (en) * 2003-12-31 2005-06-30 Leonid Pesenson Framework for providing remote processing of a graphical user interface
US8689185B1 (en) * 2004-01-27 2014-04-01 United Services Automobile Association (Usaa) System and method for processing electronic data
US20050172288A1 (en) * 2004-01-30 2005-08-04 Pratima Ahuja Method, system, and program for system recovery
US7366801B2 (en) 2004-01-30 2008-04-29 International Business Machines Corporation Method for buffering work requests
US20050172054A1 (en) * 2004-01-30 2005-08-04 Ramani Mathrubutham Method, system, and program for buffering work requests
US7650606B2 (en) * 2004-01-30 2010-01-19 International Business Machines Corporation System recovery
US8140348B2 (en) 2004-01-30 2012-03-20 International Business Machines Corporation Method, system, and program for facilitating flow control
US20080155140A1 (en) * 2004-01-30 2008-06-26 International Business Machines Corporation System and program for buffering work requests
US20050171789A1 (en) * 2004-01-30 2005-08-04 Ramani Mathrubutham Method, system, and program for facilitating flow control
US7702767B2 (en) 2004-03-09 2010-04-20 Jp Morgan Chase Bank User connectivity process management system
US20050204029A1 (en) * 2004-03-09 2005-09-15 John Connolly User connectivity process management system
US9734222B1 (en) 2004-04-06 2017-08-15 Jpmorgan Chase Bank, N.A. Methods and systems for using script files to obtain, format and transport data
US20050222990A1 (en) * 2004-04-06 2005-10-06 Milne Kenneth T Methods and systems for using script files to obtain, format and disseminate database information
US7376830B2 (en) 2004-04-26 2008-05-20 Jp Morgan Chase Bank System and method for routing messages
US20060031586A1 (en) * 2004-04-26 2006-02-09 Jp Morgan Chase Bank System and method for routing messages
US7665127B1 (en) 2004-06-30 2010-02-16 Jp Morgan Chase Bank System and method for providing access to protected services
US7386752B1 (en) * 2004-06-30 2008-06-10 Symantec Operating Corporation Using asset dependencies to identify the recovery set and optionally automate and/or optimize the recovery
US8015430B1 (en) 2004-06-30 2011-09-06 Symantec Operating Corporation Using asset dependencies to identify the recovery set and optionally automate and/or optimize the recovery
US7392471B1 (en) 2004-07-28 2008-06-24 Jp Morgan Chase Bank System and method for comparing extensible markup language (XML) documents
US20060085492A1 (en) * 2004-10-14 2006-04-20 Singh Arun K System and method for modifying process navigation
US20060167950A1 (en) * 2005-01-21 2006-07-27 Vertes Marc P Method for the management, logging or replay of the execution of an application process
US8539434B2 (en) * 2005-01-21 2013-09-17 International Business Machines Corporation Method for the management, logging or replay of the execution of an application process
US7487393B2 (en) * 2005-04-14 2009-02-03 International Business Machines Corporation Template based parallel checkpointing in a massively parallel computer system
US20080215916A1 (en) * 2005-04-14 2008-09-04 International Business Machines Corporation Template based parallel checkpointing in a massively parallel computer system
US20080195892A1 (en) * 2005-04-14 2008-08-14 International Business Machines Corporation Template based parallel checkpointing in a massively parallel computer system
US20080092030A1 (en) * 2005-04-14 2008-04-17 International Business Machines Corporation Method and apparatus for template based parallel checkpointing
US7627783B2 (en) 2005-04-14 2009-12-01 International Business Machines Corporation Template based parallel checkpointing in a massively parallel computer system
US7478278B2 (en) * 2005-04-14 2009-01-13 International Business Machines Corporation Template based parallel checkpointing in a massively parallel computer system
US20060236152A1 (en) * 2005-04-14 2006-10-19 International Business Machines Corporation Method and apparatus for template based parallel checkpointing
US20070018823A1 (en) * 2005-05-30 2007-01-25 Semiconductor Energy Laboratory Co., Ltd. Semiconductor device and driving method thereof
US20110093433A1 (en) * 2005-06-27 2011-04-21 Ab Initio Technology Llc Managing metadata for graph-based computations
US8484159B2 (en) 2005-06-27 2013-07-09 Ab Initio Technology Llc Managing metadata for graph-based computations
US9158797B2 (en) 2005-06-27 2015-10-13 Ab Initio Technology Llc Managing metadata for graph-based computations
US8572516B1 (en) 2005-08-24 2013-10-29 Jpmorgan Chase Bank, N.A. System and method for controlling a screen saver
US8972906B1 (en) 2005-08-24 2015-03-03 Jpmorgan Chase Bank, N.A. System and method for controlling a screen saver
US7636946B2 (en) * 2005-08-31 2009-12-22 Microsoft Corporation Unwanted file modification and transactions
US20070180530A1 (en) * 2005-08-31 2007-08-02 Microsoft Corporation Unwanted file modification and transactions
US7889727B2 (en) 2005-10-11 2011-02-15 Netlogic Microsystems, Inc. Switching circuit implementing variable string matching
US20080212581A1 (en) * 2005-10-11 2008-09-04 Integrated Device Technology, Inc. Switching Circuit Implementing Variable String Matching
US7499933B1 (en) 2005-11-12 2009-03-03 Jpmorgan Chase Bank, N.A. System and method for managing enterprise application configuration
US8181016B1 (en) 2005-12-01 2012-05-15 Jpmorgan Chase Bank, N.A. Applications access re-certification system
KR100833681B1 (en) 2005-12-08 2008-05-29 한국전자통신연구원 A system and method for a memory management of parallel process using commit protocol
US7716461B2 (en) 2006-01-12 2010-05-11 Microsoft Corporation Capturing and restoring application state after unexpected application shutdown
US7574591B2 (en) * 2006-01-12 2009-08-11 Microsoft Corporation Capturing and restoring application state after unexpected application shutdown
US20070162785A1 (en) * 2006-01-12 2007-07-12 Microsoft Corporation Capturing and restoring application state after unexpected application shutdown
US20070162779A1 (en) * 2006-01-12 2007-07-12 Microsoft Corporation Capturing and restoring application state after unexpected application shutdown
CN101371250B (en) 2006-01-12 2011-05-25 微软公司 Capturing and restoring application state after unexpected application shutdown
US7913249B1 (en) 2006-03-07 2011-03-22 Jpmorgan Chase Bank, N.A. Software installation checker
US7895565B1 (en) 2006-03-15 2011-02-22 Jp Morgan Chase Bank, N.A. Integrated system and method for validating the functionality and performance of software applications
US9477581B2 (en) 2006-03-15 2016-10-25 Jpmorgan Chase Bank, N.A. Integrated system and method for validating the functionality and performance of software applications
US7571347B2 (en) * 2006-03-20 2009-08-04 Sun Microsystems, Inc. Method and apparatus for providing fault-tolerance in parallel-processing systems
US20070220298A1 (en) * 2006-03-20 2007-09-20 Gross Kenny C Method and apparatus for providing fault-tolerance in parallel-processing systems
US20070294056A1 (en) * 2006-06-16 2007-12-20 Jpmorgan Chase Bank, N.A. Method and system for monitoring non-occurring events
US7610172B2 (en) 2006-06-16 2009-10-27 Jpmorgan Chase Bank, N.A. Method and system for monitoring non-occurring events
US8572236B2 (en) 2006-08-10 2013-10-29 Ab Initio Technology Llc Distributing services in graph-based computations
US20080071765A1 (en) * 2006-09-19 2008-03-20 Netlogic Microsystems, Inc. Regular expression searching of packet contents using dedicated search circuits
US20080071780A1 (en) * 2006-09-19 2008-03-20 Netlogic Microsystems, Inc. Search Circuit having individually selectable search engines
US7783654B1 (en) 2006-09-19 2010-08-24 Netlogic Microsystems, Inc. Multiple string searching using content addressable memory
US7624105B2 (en) 2006-09-19 2009-11-24 Netlogic Microsystems, Inc. Search engine having multiple co-processors for performing inexact pattern search operations
US20080071757A1 (en) * 2006-09-19 2008-03-20 Netlogic Microsystems, Inc. Search engine having multiple co-processors for performing inexact pattern search operations
US20080071779A1 (en) * 2006-09-19 2008-03-20 Netlogic Microsystems, Inc. Method and apparatus for managing multiple data flows in a content search system
US7539031B2 (en) 2006-09-19 2009-05-26 Netlogic Microsystems, Inc. Inexact pattern searching using bitmap contained in a bitcheck command
US7644080B2 (en) 2006-09-19 2010-01-05 Netlogic Microsystems, Inc. Method and apparatus for managing multiple data flows in a content search system
US7539032B2 (en) 2006-09-19 2009-05-26 Netlogic Microsystems, Inc. Regular expression searching of packet contents using dedicated search circuits
US7529746B2 (en) 2006-09-19 2009-05-05 Netlogic Microsystems, Inc. Search circuit having individually selectable search engines
US20080071781A1 (en) * 2006-09-19 2008-03-20 Netlogic Microsystems, Inc. Inexact pattern searching using bitmap contained in a bitcheck command
US7917486B1 (en) 2007-01-18 2011-03-29 Netlogic Microsystems, Inc. Optimizing search trees by increasing failure size parameter
US7676444B1 (en) 2007-01-18 2010-03-09 Netlogic Microsystems, Inc. Iterative compare operations using next success size bitmap
US7860849B1 (en) 2007-01-18 2010-12-28 Netlogic Microsystems, Inc. Optimizing search trees by increasing success size parameter
US7636717B1 (en) 2007-01-18 2009-12-22 Netlogic Microsystems, Inc. Method and apparatus for optimizing string search operations
US20100293407A1 (en) * 2007-01-26 2010-11-18 The Trustees Of Columbia University In The City Of Systems, Methods, and Media for Recovering an Application from a Fault or Attack
US8924782B2 (en) * 2007-01-26 2014-12-30 The Trustees Of Columbia University In The City Of New York Systems, methods, and media for recovering an application from a fault or attack
US9218254B2 (en) 2007-01-26 2015-12-22 The Trustees Of Columbia University In The City Of New York Systems, methods, and media for recovering an application from a fault or attack
US8423572B2 (en) 2007-02-24 2013-04-16 Trend Micro Incorporated Fast identification of complex strings in a data stream
US8069183B2 (en) * 2007-02-24 2011-11-29 Trend Micro Incorporated Fast identification of complex strings in a data stream
US8812547B2 (en) 2007-02-24 2014-08-19 Trend Micro Incorporated Fast identification of complex strings in a data stream
US8117229B1 (en) 2007-02-24 2012-02-14 Trend Micro Incorporated Fast identification of complex strings in a data stream
US9600537B2 (en) 2007-02-24 2017-03-21 Trend Micro Incorporated Fast identification of complex strings in a data stream
US8214686B2 (en) * 2007-05-25 2012-07-03 Fujitsu Limited Distributed processing method
US20080294937A1 (en) * 2007-05-25 2008-11-27 Fujitsu Limited Distributed processing method
US8706667B2 (en) 2007-07-26 2014-04-22 Ab Initio Technology Llc Transactional graph-based computation with error handling
US7881125B2 (en) 2007-10-25 2011-02-01 Netlogic Microsystems, Inc. Power reduction in a content addressable memory having programmable interconnect structure
US20100054013A1 (en) * 2007-10-25 2010-03-04 Sachin Joshi Content addresable memory having selectively interconnected counter circuits
US20100321970A1 (en) * 2007-10-25 2010-12-23 Maheshwaran Srinivasan Content addressable memory having programmable interconnect structure
US20100054012A1 (en) * 2007-10-25 2010-03-04 Maheshwaran Srinivasan Content addresable memory having programmable interconnect structure
US20100321971A1 (en) * 2007-10-25 2010-12-23 Sachin Joshi Content addressable memory having selectively interconnected counter circuits
US7660140B1 (en) 2007-10-25 2010-02-09 Netlogic Microsystems, Inc. Content addresable memory having selectively interconnected counter circuits
US7656716B1 (en) 2007-10-25 2010-02-02 Netlogic Microsystems, Inc Regular expression search engine
US7787275B1 (en) 2007-10-25 2010-08-31 Netlogic Microsystems, Inc. Content addressable memory having programmable combinational logic circuits
US8631195B1 (en) 2007-10-25 2014-01-14 Netlogic Microsystems, Inc. Content addressable memory having selectively interconnected shift register circuits
US7643353B1 (en) 2007-10-25 2010-01-05 Netlogic Microsystems, Inc. Content addressable memory having programmable interconnect structure
US7826242B2 (en) 2007-10-25 2010-11-02 Netlogic Microsystems, Inc. Content addresable memory having selectively interconnected counter circuits
US7876590B2 (en) 2007-10-25 2011-01-25 Netlogic Microsystems, Inc. Content addressable memory having selectively interconnected rows of counter circuits
US7821844B2 (en) 2007-10-25 2010-10-26 Netlogic Microsystems, Inc Content addresable memory having programmable interconnect structure
WO2009134264A1 (en) * 2008-05-01 2009-11-05 Hewlett-Packard Development Company, L.P. Storing checkpoint data in non-volatile memory
US20110113208A1 (en) * 2008-05-01 2011-05-12 Norman Paul Jouppi Storing checkpoint data in non-volatile memory
US20090282392A1 (en) * 2008-05-12 2009-11-12 Expressor Software Method and system for debugging data integration applications with reusable synthetic data values
US20090282066A1 (en) * 2008-05-12 2009-11-12 Expressor Software Method and system for developing data integration applications with reusable semantic identifiers to represent application data sources and variables
US20090282383A1 (en) * 2008-05-12 2009-11-12 Expressor Software Method and system for executing a data integration application using executable units that operate independently of each other
US20090282058A1 (en) * 2008-05-12 2009-11-12 Expressor Software Method and system for developing data integration applications with reusable functional rules that are managed according to their output variables
US8112742B2 (en) 2008-05-12 2012-02-07 Expressor Software Method and system for debugging data integration applications with reusable synthetic data values
US8141029B2 (en) 2008-05-12 2012-03-20 Expressor Software Method and system for executing a data integration application using executable units that operate independently of each other
US20090282042A1 (en) * 2008-05-12 2009-11-12 Expressor Software Method and system for managing the development of data integration projects to facilitate project development and analysis thereof
US8312414B2 (en) 2008-05-12 2012-11-13 Expressor Software Method and system for executing a data integration application using executable units that operate independently of each other
US7924589B1 (en) 2008-06-03 2011-04-12 Netlogic Microsystems, Inc. Row redundancy for content addressable memory having programmable interconnect structure
US8291261B2 (en) * 2008-11-05 2012-10-16 Vulcan Technologies Llc Lightweight application-level runtime state save-and-restore utility
US20100115334A1 (en) * 2008-11-05 2010-05-06 Mark Allen Malleck Lightweight application-level runtime state save-and-restore utility
US20100211953A1 (en) * 2009-02-13 2010-08-19 Ab Initio Technology Llc Managing task execution
US9886319B2 (en) 2009-02-13 2018-02-06 Ab Initio Technology Llc Task managing application for performing tasks based on messages received from a data processing application initiated by the task managing application
US7924590B1 (en) 2009-08-10 2011-04-12 Netlogic Microsystems, Inc. Compiling regular expressions for programmable content addressable memory devices
US7916510B1 (en) 2009-08-10 2011-03-29 Netlogic Microsystems, Inc. Reformulating regular expressions into architecture-dependent bit groups
US8667329B2 (en) 2009-09-25 2014-03-04 Ab Initio Technology Llc Processing transactions in graph-based applications
US20110078500A1 (en) * 2009-09-25 2011-03-31 Ab Initio Software Llc Processing transactions in graph-based applications
US9665620B2 (en) 2010-01-15 2017-05-30 Ab Initio Technology Llc Managing data queries
US20110179014A1 (en) * 2010-01-15 2011-07-21 Ian Schechter Managing data queries
CN102263652A (en) * 2010-05-31 2011-11-30 鸿富锦精密工业(深圳)有限公司 And a network device parameter setting method which changes
US20110296444A1 (en) * 2010-05-31 2011-12-01 Hon Hai Precision Industry Co., Ltd. Customer premises equipment and method for processing commands
US9753751B2 (en) 2010-06-15 2017-09-05 Ab Initio Technology Llc Dynamically loading graph-based computations
US8875145B2 (en) 2010-06-15 2014-10-28 Ab Initio Technology Llc Dynamically loading graph-based computations
US8527488B1 (en) 2010-07-08 2013-09-03 Netlogic Microsystems, Inc. Negative regular expression search operations
US9002946B2 (en) * 2010-08-25 2015-04-07 Autodesk, Inc. Dual modeling environment in which commands are executed concurrently and independently on both a light weight version of a proxy module on a client and a precise version of the proxy module on a server
US20120054261A1 (en) * 2010-08-25 2012-03-01 Autodesk, Inc. Dual modeling environment
US8862603B1 (en) 2010-11-03 2014-10-14 Netlogic Microsystems, Inc. Minimizing state lists for non-deterministic finite state automatons
WO2012112748A1 (en) 2011-02-18 2012-08-23 Ab Initio Technology Llc Restarting processes
WO2012112763A1 (en) 2011-02-18 2012-08-23 Ab Initio Technology Llc Restarting data processing systems
US9268645B2 (en) 2011-02-18 2016-02-23 Ab Initio Technology Llc Restarting processes
US9116759B2 (en) 2011-02-18 2015-08-25 Ab Initio Technology Llc Restarting data processing systems
US9021299B2 (en) 2011-02-18 2015-04-28 Ab Initio Technology Llc Restarting processes
US9576028B2 (en) 2011-05-02 2017-02-21 Ab Initio Technology Llc Managing data queries
US9116955B2 (en) 2011-05-02 2015-08-25 Ab Initio Technology Llc Managing data queries
US20130290772A1 (en) * 2012-04-30 2013-10-31 Curtis C. Ballard Sequence indicator for command communicated to a sequential access storage device
US9135124B2 (en) * 2012-04-30 2015-09-15 Hewlett-Packard Development Company, L.P. Sequence indicator for command communicated to a sequential access storage device
US9317467B2 (en) 2012-09-27 2016-04-19 Hewlett Packard Enterprise Development Lp Session key associated with communication path
US20150213077A1 (en) * 2012-10-09 2015-07-30 Huawei Technologies Co., Ltd. Method and system for causing a web application to obtain a database change
US9507682B2 (en) 2012-11-16 2016-11-29 Ab Initio Technology Llc Dynamic graph performance monitoring
US9274926B2 (en) 2013-01-03 2016-03-01 Ab Initio Technology Llc Configurable testing of computer programs
US9720655B1 (en) 2013-02-01 2017-08-01 Jpmorgan Chase Bank, N.A. User interface event orchestration
US9898262B2 (en) 2013-02-01 2018-02-20 Jpmorgan Chase Bank, N.A. User interface event orchestration
US9882973B2 (en) 2013-02-22 2018-01-30 Jpmorgan Chase Bank, N.A. Breadth-first resource allocation system and methods
US9537790B1 (en) 2013-02-22 2017-01-03 Jpmorgan Chase Bank, N.A. Breadth-first resource allocation system and methods
US9088459B1 (en) 2013-02-22 2015-07-21 Jpmorgan Chase Bank, N.A. Breadth-first resource allocation system and methods
US9619410B1 (en) 2013-10-03 2017-04-11 Jpmorgan Chase Bank, N.A. Systems and methods for packet switching
US9900267B2 (en) 2013-10-03 2018-02-20 Jpmorgan Chase Bank, N.A. Systems and methods for packet switching
US9354981B2 (en) 2013-10-21 2016-05-31 Ab Initio Technology Llc Checkpointing a collection of data units
US9886241B2 (en) 2013-12-05 2018-02-06 Ab Initio Technology Llc Managing interfaces for sub-graphs
US9891901B2 (en) 2013-12-06 2018-02-13 Ab Initio Technology Llc Source code translation
US9542259B1 (en) 2013-12-23 2017-01-10 Jpmorgan Chase Bank, N.A. Automated incident resolution system and method
US9868054B1 (en) 2014-02-10 2018-01-16 Jpmorgan Chase Bank, N.A. Dynamic game deployment

Also Published As

Publication number Publication date Type
EP0954779A1 (en) 1999-11-10 application
EP0954779B1 (en) 2009-02-18 grant
CA2240347A1 (en) 1997-06-19 application
CA2240347C (en) 2001-07-10 grant
EP0954779A4 (en) 2007-05-09 application
ES2320601T3 (en) 2009-05-25 grant
JP3675802B2 (en) 2005-07-27 grant
EP0954779B8 (en) 2009-04-22 grant
JP2004094963A (en) 2004-03-25 application
DE69637836D1 (en) 2009-04-02 grant
WO1997022052A1 (en) 1997-06-19 application
JP2002505768A (en) 2002-02-19 application
JP3573463B2 (en) 2004-10-06 grant
DK0954779T3 (en) 2009-04-06 grant

Similar Documents

Publication Publication Date Title
Denning Fault tolerant operating systems
Strom et al. Volatile logging in n-fault-tolerant distributed systems
Borr Transaction monitoring in Encompass
Strom et al. Optimistic recovery in distributed systems
US5155678A (en) Data availability in restartable data base system
Powell et al. Publishing: A reliable broadcast communication mechanism
US6266698B1 (en) Logging of transaction branch information for implementing presumed nothing and other protocols
Lin et al. Imprecise results: Utilizing partial computations in real-time systems
US5689633A (en) Computer program product and program storage device for including stored procedure user defined function or trigger processing within a unit of work
US7047380B2 (en) System and method for using file system snapshots for online data backup
US6401216B1 (en) System of performing checkpoint/restart of a parallel program
US5485608A (en) Methods and apparatus for updating information in a computer system using logs and state identifiers
US6934877B2 (en) Data backup/recovery system
US5440726A (en) Progressive retry method and apparatus having reusable software modules for software failure recovery in multi-process message-passing applications
US5815651A (en) Method and apparatus for CPU failure recovery in symmetric multi-processing systems
US6161198A (en) System for providing transaction indivisibility in a transaction processing system upon recovery from a host processor failure by monitoring source message sequencing
US4498145A (en) Method for assuring atomicity of multi-row update operations in a database system
US6493826B1 (en) Method and system for fault tolerant transaction-oriented data processing system
US4751702A (en) Improving availability of a restartable staged storage data base system that uses logging facilities
EP0465018B1 (en) Method and apparatus for optimizing undo log usage
US6338147B1 (en) Program products for performing checkpoint/restart of a parallel program
Lampson et al. Crash recovery in a distributed data storage system
US5727203A (en) Methods and apparatus for managing a database in a distributed object operating environment using persistent and transient cache
US5748882A (en) Apparatus and method for fault-tolerant computing
US6978279B1 (en) Database computer system using logical logging to extend recovery

Legal Events

Date Code Title Description
AS Assignment

Owner name: AB INITIO SOFTWARE CORPORATION, MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:STANFILL, CRAIG;LASSER, CLIFF;LORDI, ROBERT;REEL/FRAME:007825/0846

Effective date: 19960208

FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

AS Assignment

Owner name: AB INITIO SOFTWARE LLC, MASSACHUSETTS

Free format text: CHANGE OF NAME;ASSIGNOR:AB INITLO SOFTWARE CORPORATION;REEL/FRAME:022288/0828

Effective date: 20080716

Owner name: AB INITIO SOFTWARE LLC,MASSACHUSETTS

Free format text: CHANGE OF NAME;ASSIGNOR:AB INITIO SOFTWARE CORPORATION;REEL/FRAME:022288/0828

Effective date: 20080716

Owner name: AB INITIO SOFTWARE LLC, MASSACHUSETTS

Free format text: CHANGE OF NAME;ASSIGNOR:AB INITIO SOFTWARE CORPORATION;REEL/FRAME:022288/0828

Effective date: 20080716

AS Assignment

Owner name: ARCHITECTURE LLC, MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AB INITIO SOFTWARE LLC;REEL/FRAME:022460/0496

Effective date: 20080716

Owner name: AB INITIO TECHNOLOGY LLC, MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ARCHITECTURE LLC;REEL/FRAME:022460/0546

Effective date: 20080716

Owner name: ARCHITECTURE LLC,MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AB INITIO SOFTWARE LLC;REEL/FRAME:022460/0496

Effective date: 20080716

Owner name: AB INITIO TECHNOLOGY LLC,MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ARCHITECTURE LLC;REEL/FRAME:022460/0546

Effective date: 20080716

FPAY Fee payment

Year of fee payment: 12