US4852092A - Error recovery system of a multiprocessor system for recovering an error in a processor by making the processor into a checking condition after completion of microprogram restart from a checkpoint - Google Patents

Error recovery system of a multiprocessor system for recovering an error in a processor by making the processor into a checking condition after completion of microprogram restart from a checkpoint Download PDF

Info

Publication number
US4852092A
US4852092A US07086638 US8663887A US4852092A US 4852092 A US4852092 A US 4852092A US 07086638 US07086638 US 07086638 US 8663887 A US8663887 A US 8663887A US 4852092 A US4852092 A US 4852092A
Authority
US
Grant status
Grant
Patent type
Prior art keywords
processor
signal
error
instruction
retry
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
US07086638
Inventor
Akihisa Makita
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Grant date

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2028Failover techniques eliminating a faulty processor or activating a spare
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1405Saving, restoring, recovering or retrying at machine instruction level
    • G06F11/1407Checkpointing the instruction stream
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2038Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with a single idle spare processing component
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2043Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant where the redundant components share a common memory address space
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2025Failover techniques using centralised failover control functionality
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing

Abstract

In an error recovery system for use in combination with a multiprocessor system processing instructions under microprogram control which is energized on occurrence of an intermittent error in one of the processors to restart the microprogram from a checkpoint in the faulty processor when the microstep restart is allowable and which is energized upon occurrence of a physical error to make another processor take over execution of an instruction processed in the faulty processor, the faulty processor generates a physical error signal after completion of the microprogram restart so that another processor is forced to take over next succeeding procession to be carried out in the faulty processor. When retry of execution of the instruction is allowable on occurrence of the intermittent error, another processor is also forced to take over execution of the instruction. A retry request can previously and manually be inputed into the one processor by an operator so that retry of execution of the instruction is carried out in the faulty processor.

Description

BACKGROUND OF THE INVENTION

(1) Field of the Invention

The present invention relates to an error recovery system for use in an electronic digital computer system comprising a plurality of processors and, in particular, to such an error recovery system for use in a tightly coupled multiprocessor system.

(2) Description of the Prior Art

As an electronic digital computer system, a tightly coupled multiprocessor system is known in the prior art which comprises a main memory for storing a plurality of programs and a plurality of processors for processing the programs. Each program comprises a succession of instructions. As a known one of the tightly coupled multiprocessor system, ACOS 1500 manufactured by NEC Corporation is disclosed by M. Baba et al in NIKKEI ELECTRONICS No. 373 issued by Nikkei McGraw-Hill Co. in July 15, 1985 under the title of "A large computer ACOS 1500 having an increased processing speed by use of two-level cashe and an improvement of pipeline processing" (Reference 1).

On occurrence of an error or fault during execution of one instruction in one of the processors in ACOS 1500, the processor is made to retry execution of the instruction in order to recover the error in one of the processors, as disclosed in Reference 1. When the error is intermittent or transient, retry results in success. Then, the processor is continuously used in the computer system. When the error is a long lived, hardware, or physical error, retry is not well completed or ends in failure. Then, the processor is made into a checking condition and another of the processors is made to take over execution of the instruction by transferring status data in the faulty processor into another processor through the main memory.

An instruction fetched in one processor is executed by an executing means in the one processor under control of a microprogram comprising a succession of microsteps. In ACOS 1500, the microprogram has at least one predetermined checkpoint in the microsteps. When an error occurs in one of the processors, the microprogram is restarted from the last checkpoint before the error occurrence, as disclosed in Reference 1. When the microprogram restart ends in success, the one processor is continuously used as a normal processor in the system.

However, once a processor encounters an error, another error tends to again occur in the processor even after retry is well completed, which results in the system going down.

British Patent Specification No. 1,163,859 (Reference 2) by J. A. Arulpragasam discloses an error recovery system for, on occurrence of an error in one of the processors, making another processor take over execution of an instruction executed in the faulty processor by transferring status data in the faulty processor into another processor through the main memory.

U.S. Pat. No. 4,443,849 (Reference 3) by Ohwada assigned to Nippon Electric Co., Ltd. discloses an error recovery system for transferring the status data in the faulty processor to another processor through not the main memory but an additional storage.

However, the references 2 and 3 are silent as to the microprogram restart.

SUMMARY OF THE INVENTION

It is an object of the present invention to provide an error recovery system for use in a multiprocessor system which, on occurrence of an error in one of the processors, is capable of making the one processor into a checking condition after completion of the microprogram restart from the checkpoint to thereby reduce the effect of a system going down.

An error recovery system to which this invention is applicable is for use in combination with an electronic computer system comprising a main memory for storing a plurality of programs and a plurality of processors for processing the programs. Each program comprises a succession of instructions. Each processor comprises an executing unit for fetching selected ones of the instructions and for executing under microprogram control each of the selected instructions, during a first period of time during which retry of execution of the selected instruction is allowable and a second period of time during which retry of execution of the selected instruction is not allowable, to produce masses of information.

The microprogram comprises a succession of microsteps and has a first interval during which restart of the microprogram is allowable from a checkpoint at a predetermined microstep. Each processor further comprises a monitoring unit for monitoring operation of the executing unit to produce an error signal when an error is detected during execution of a particular one of the selected instructions and to suspend execution of the particular instruction, instruction retry enable signal producing unit operatively coupled to the monitoring unit for producing an instruction retry enable signal during the first time period, and microprogram restart enable signal producing unit operatively coupled to the monitoring unit for producing a microprogram restart enable signal during the first interval. The error recovery system is responsive to the error signal from the monitoring unit in a first of the processors and accesses the microprogram restart enable signal producing unit in the first processor to produce a microprogram restart signal when the microprogram restart enable signal is detected from the microprogram restart enable signal producing unit. The first processor carries out, in response to the microprogram restart signal, restart of the microprogram from the checkpoint. The error recovery system is energized on occurrence of a physical error in the first processor to make a second of processors take over execution of the particular instruction.

According to the present invention, the first processor further comprises a physical error signal generating unit being operatively coupled with the monitoring unit therein for detecting, after completion of restart of the microprogram, the instruction retry enable signal from the instruction retry enable signal producing unit in the first processor to produce a physical error signal. The error recovery system comprises a unit responsive to the physical error signal for producing a taking-over signal to thereby put the first processor in a checking condition.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an error recovery system in combination with a multiprocessor system according to an embodiment of the present invention;

FIG. 2 is a flow chart for illustrating execution of a typical instruction in a processor;

FIG. 3 is a flow chart for exemplarily illustrating a microprogram restart interval in connection with a microprogram for controlling execution of a data save instruction for stacking;

FIGS. 4A, 4B, and 4C are views of different sections of a flow chart illustrating operation of an error recovery unit shown in FIG. 1, ○ and ○ in FIG. 4A being connected to ○ in FIG. 4B and ○ in FIG. 4C, respectively; and

FIG. 5 is a flow chart for illustrating operation of a physical error generator shown in FIG. 1.

DESCRIPTION OF PREFERRED EMBODIMENT

Referring to FIG. 1, an electronic digital computer system or a tightly coupled multiprocessor system in combination with an error recovery system according to an embodiment of the present invention comprises a main memory 10 storing a plurality of programs. Each program comprises a succession of instructions. The computer system has a plurality of processors (first and second processors 11 and 12 are exemplarily shown in the figure) for processing selected ones of the instructions. The first and second processors 11 and 12 are coupled with the main memory 10 through a system control unit 13. The main memory 10 has an operating system (OS) 14 therein for supporting activities of the computer system itself.

The main memory 10 has an memory area 15 which has first and second activity indicators 151 and 152 of memory cells for indicating active conditions of the first and second processors 11 and 12, respectively.

The system control unit 13 is provided with first and second connection indicators 131 and 132 such as flipflops for indicating connection of the first and second processors 11 and 12 to the computer system, respectively. When the processors 11 and 12 are not connected to the computer system, the flipflops 131 and 132 are reset, respectively. The flipflops 131 and 132 are set by the completion of connection of respective processors to the system.

The system control unit 13 further has first and second processor check indicators 136 and 137 such as flipflops for indicating checking conditions of first and second processors 11 and 12, respectively. When one of the first and second processors 11 and 12 encounters a physical error, the one of the processors is in a checking condition and the corresponding one of the first and second processor check indicators 136 and 137 is set.

The first processor 11 comprises an executing circuit 111 for fetching a selected one of the instructions from the main memory 10 and for executing, under microprogram control, the selected instruction to produce masses of information. Execution of the selected instruction by the executing circuit 111 is carried out over a time duration consisting of a first period of time during which retry of execution of the selected instruction is allowable and a second period of time during which retry of execution of the selected instruction is not allowable, as will later be described in detail with reference to FIG. 2. The executing circuit 111 has registers (not shown) such as a general register, instruction counter, and others, which are called software visible registers.

The first processor 11 further comprises a monitoring circuit 112 for monitoring activity of the executing circuit 111. The monitoring circuit 112 comprises an error detecting circuit (not shown) such as a parity check circuit, a coincidence deciding circuit of arithmetic results, a sequence legality checking circuit, and/or others. Each error detecting circuit produces an error signal (ER) when detecting an error during execution of the selected instruction by the executing circuit 111. In response to the error signal, the monitoring circuit 112 suspends execution of the selected instruction, as described in Reference 3.

Referring to FIG. 2, execution of the selected instruction by the executing circuit 111 comprises a plurality of sequential steps S1 -S5, as has been well known in the art. After start of the executing circuit 111, the selected one of the instructions in main memory 10 is fetched at first step S1 and the fetched instruction is then interpreted at second step S2. Then, an execution such as arithmetic is carried out at a third step S3 and an executed result is stored at the fourth step S4 into a predetermined area, for example, a register in the processor. The final step S5 is a step for updating the instruction address.

In a processor executing a usual instruction, when an error occurs within a range from step S1 to an intermediate point in step S4 just before executed result storing completion, retry can be made from the beginning of the usual instruction. Accordingly, the range is the first period of time as described above. However, on occurrence of an error in another range from the executed result storing completion point in step S4 to step S5, retry of the instruction is impossible. The above-described second period of time is the range from the executed result storing completion point in step S4 to step S5.

In connection with some of the instructions such as instructions for updating contents in the software visible registers and for updating stored data in the main memory 10, retry of the instruction is impossible after updating information in the software visible registers and/or the main memory 10 at step S3 because information required for execution of the instruction is changed by the updating. Therefore, for some instructions, the first period time is a range from step S1 to an intermediate point in step S3 just before completion of updating, while the second period of time is a range from an updating completion point in step S3 to step S5.

Referring to FIG. 1 again, the first processor 11 has an instruction retry enable signal producing circuit 113 being operably coupled with the monitoring circuit 112. The instruction retry enable signal producing circuit 113 produces an instruction retry enable signal during the first period of time. A flipflop is also used as the instruction retry enable signal producing circuit 113 and is set during the first period of time but reset during the second period of time under hardware or microprogram control.

In connection with the above-described some instructions, there are some cases where, even when an error occurs after the updating completion point during execution of an instruction, execution of the instruction can be completed by restart of the microprogram. That is, if the updated information is in no relation with succeeding execution of the instruction, or if the changed information can be recovered by restart from a predetermined microstep or a checkpoint in the microprogram, execution of the instruction is continued by restart of the microprogram from the checkpoint, as described in Reference 1. Restart of the microprogram is not allowable when an error occurs after the executed result storing completion point in step S4. A range from the checkpoint to the executed result storing completion point in step S4 is called a microprogram restart enable interval.

Referring to FIG. 1 again, microprogram restart enable signal producing circuit 114 such as a flipflop 114 is operatively coupled with the monitoring circuit 112 and is set to produce a microprogram restart enable signal during the microprogram restart enable interval under hardware or microprogram control.

Referring to FIG. 3, a description is exemplarily made as to microsteps for executing a data save instruction for stacking.

A microstep A0 is a preparing step for stacking where data for indicating a plurality of base registers and a plurality of general registers having contents to be saved into the main memory 10 are stored in the main memory according to an address given by content (T) in an address register T. At a next microstep A1, a sum (T+4) of "4" and the content (T) in the address register T is stored in a work register y.

The microstep A1 is predetermined as the above described checkpoint. Therefore, the microprogram restart enable signal (MRE) is produced from the microprogram restart enable signal producing circuit 114 (FIG. 1), that is, the flipflop is set at this microstep A1. At the same time, a microprogram address (A1) corresponding to the microstep A1 is held in a software invisible register Z.

Then, "4" is added to content (y) in the work register y and the result (y+4) is stored in the work register y so that (y)=(y+4) and content (BR0) (having 4 bytes) in the base register BR0 is then stored in the main memory according to an address given by the content (y) in the work register y at a microstep A2.

Since information in the main memory is updated at microstep A2, retry of the instruction is not allowable after microstep A2. Therefore, the instruction retry enable signal producing circuit 113 (FIG. 1) is reset.

Thereafter, similar microstep operations are carried out to save contents in the base registers and general registers. At a microstep Am, a final content (GRn) in the final general register GRn to be saved in the main memory is stored into the main memory. That is, a sum of "4" and a content (y) in the work register y is stored in the work register and then the content (GRn) in the general register is stored in the main memory according to an address indicated by the content in the work register y.

At a next microstep B, content (y) in the work register y is stored in the address register T which is one of software visible registers. Since the address register is updated, microprogram restart from the checkpoint is not allowable and the microprogram restart signal producing circuit 114 (FIG. 1) is reset at the microstep B and the succeeding microsteps.

Therefore, the microprogram restart from the checkpoint A1 is allowable during a range from microstep A1 to microstep Am. When the microprogram restarts, content (z) in the register Z is written in a microprogram address counter (not shown), and microsteps are again carried out from the microstep A1.

At a microstep C following microstep B, an instruction length (IL) is added to a content (IC) in an instruction counter IC and the result is stored in the instruction counter IC. Thus, the instruction address is updated. This microstep C is the step S5 in FIG. 2.

Returning to FIG. 1, the first processor 11 has a retry request indicator 115 such as a flipflop for holding a retry request manually inputted by an operator, for example, when debug of a program is taken place.

The first processor 11 also has a physical error generator 116 for generating a physical error, which will later be described in detail with reference to FIG. 5.

The processor 12 has an arrangement similar to the above-described arrangement of the processor 11. Detail of the the processor 12 is omitted in the drawing and in the description for the purpose of simplification thereof.

In order to recover an error in one of the first and second processors 11 and 12, an error recovery unit 20 is coupled with those processors 11 and 12 through the system control unit 13. The error recovery unit 20 is connected to a service processor 30.

Assuming that an error occurs in the first processor 11, connection of the error recovery unit 20 and the first processor 11 is shown in FIG. 1 but connection of the unit 20 and the second processor 12 is omitted in the figure for the purpose of simplification of the figure.

The error recovery unit 20 has a connection detector 21 for accessing the connection indicator 132 to detect a connection condition of the second processor 12, an activity detector 22 for accessing the activity indicator 152 to detect whether the second processor 12 is active or inactive, and a taking-over signal producing circuit 23 for producing a taking-over signal (TO) to write the checking condition of the first processor 11 into the first processor check indicator 136. The error recovery unit 20 also has a first accessing circuit 24 for accessing the instruction retry enable signal producing circuit 113 to obtain the instruction retry enable signal (IRE), a second accessing circuit 25 for accessing the microprogram restart enable signal producing circuit 114 to obtain the microprogram restart enable signal (MRE), and a retry request detector 26 for accessing the retry request indicator 115 to detect the retry request (RR) desired by the operator. Those circuits 21-26 are controlled by a control circuit 27 in the unit 20.

Now, operation of the error recovery unit 20 will be described with reference to the flow chart shown in FIGS. 4A to 4C in addition to FIG. 1.

Upon occurrence of an error in the first processor 11, the error recovery unit 20 starts operation for recovering the error in the first processor 11 in response to the error signal ER from the first processor 11. The control circuit 27 enables the connection detector 21 to read a content in the second connection indicator 132 at a stage Sa1. The control circuit 27 decides the read content at a stage Sa2. When the read content indicates that the second processor 12 is connected to the computer system, operation is shifted from stage Sa2 to a stage Sa3 and the control circuit 27 enables the activity detector 22 to detect the content in the second activity indicator 152. Then, the control circuit 27 decides at a stage Sa4 whether the second processor 12 is active or inactive. When the second processor is active, operation progresses to a stage Sa5 and the retry request detector 26 is enabled to read the retry request indicator 115. When the retry request is decided at a next stage Sa6, operation is shifted from stage Sa6 to a stage Sa7 (FIG. 4B). At stage Sa7, the control circuit 27 enables the first accessing circuit 24 to access the instruction retry enable signal producing circuit 113. When the instruction retry enable signal (IRE) is read out from the instruction retry enable signal generating circuit 113, operation progresses to a stage Sa9 through a stage Sa8. At stage Sa9, the control circuit 27 drives the taking-over signal producing circuit 23 to produce the taking-over signal (TO) which sets the first processor check indicator 136 to write the checking condition of the first processor 11 thereinto.

When the instruction retry enable signal (IRE) is not decided at stage Sa8, operation is shifted from stage Sa8 to a stage Sa10 where the microprogram restart enable signal producing circuit 114 is accessed by the second accessing circuit 25. When the microprogram restart enable signal (MRE) is not detected at the stage Sa11, operation is shifted from the stage Sa11 to stage Sa9.

As has been noted from the above description, the control circuit 27 sets the first processor check indicator 136 when the instruction retry enable signal (IRE) is detected and also when neither the instruction retry enable signal (IRE) nor the microprogram restart enable signal (MRE) is not detected.

In response to the taking-over signal (TO), the status signals in the first processor 11 are transferred to the second processor 12 in a known manner as disclosed in References 1-3 and the instruction processed in the first processor 11 on occurrence of the error is again executed from the beginning of the instruction in the second processor 12.

Namely, when the processor check indicator 136 is set by the taking-over signal (TO), the system control unit 13 informs to the second processor 12 and the service processor 30 of a fact that the first processor 11 is put into the checking condition of the processor. The second processor 12 alerts an exception processing program stored in the operation system (OS) 14. While, the service processor 30 transfers the status signals or contents in the software visible registers in the first processor 11 into a predetermined area in the main memory 10.

In this connection, the status signals can be transferred into a storage described in Reference 3.

Then, the exception processing program is executed in the second processor 12. That is, the status signals are read out from the main memory 10 or the storage, and it is decided whether or not retry of execution of the instruction processed in the first processor 11 on occurrence of the error is possible. When retry of execution is possible, execution of the instruction is retried in the second processor.

Returning to stage Sa11, when it is decided that the microprogram restart enable signal (MRE) is read out, the first processor 11 is reset at a stage Sa12. Then, the control circuit 27 generates a microprogram restart signal (MR1) which is applied to the first processor 11 at a stage Sa13. Then, the first processor 11 restarts the microprogram from the checkpoint.

Referring to FIG. 5 and FIG. 1, a description will be made as to the operation of the first processor 11 after restart of the microprogram.

In response to the microprogram restart signal (MR1), the physical error generator 116 cooperates with the monitoring circuit 112 and decides whether or not the microprogram restart ends in success (stage Sa14 in FIG. 5). When success is decided, the physical error generator 116 produces a success informing signal (SI) to thereby inform to the error recovery unit 20 of success of microprogram restart (stage Sa15). The physical error generator 116 also accesses the instruction retry enable signal producing circuit 113. Thereafter, when the instruction retry enable signal producing circuit 113 is set by execution of an instruction freshly fetched in the first processor 11 from the main memory 10, the physical error generator 116 generates a physical error signal (PE) at a stage Sa16.

Now, going back to FIG. 4B, on reception of the physical error signal (PE) at a stage Sa17, the control circuit 27 of the error recovery unit 20 carries out the operation in stage Sa9 to set the first processor check indicator 136. Thus, execution of the instruction freshly fetched in the first processor 11 is taken over by the second processor 12 through the above-described status signal transferring manner.

In FIG. 5, when it is not decided at stage Sa14 that the microprogram restart ends in success, the physical error generator 116 generates the physical error signal (PE) at a stage Sa18. Then, the control circuit 27 in error recovery unit 20 also carries out operation in stages Sa17 and Sa9, so that the processor check indicator 136 is set.

Returning to stage Sa2 (FIG. 4A), when the connection of the second processor 12 is not decided, the control circuit 27 effects operation at stage Sa19 (FIG. 4C) to access the instruction retry enable signal producing circuit 113 by the first accessing circuit 24.

When the second processor 12 is decided inactive at stage Sa4, or when the retry request is detected at stage Sa6, operation of the control circuit 27 also progresses to the step Sa19.

When the instruction retry enable signal (IRS) is decided at a stage Sa20 after stage Sa19, the control circuit 27 resets the first processor 11 at a stage Sa21 and provides an instruction retry signal (IR) to the first processor 11 at a stage Sa22. Thus, the first processor 11 carries out retry of execution of the instruction.

When the instruction retry enable signal (IRS) is not detected at stage Sa20, the microprogram restart enable signal producing circuit 114 is accessed by the second accessing circuit 25 at a stage Sa23. When the microprogram restart enable signal (MRE) is detected at a stage Sa24, the control circuit 27 resets the first processor 11 at a stage Sa25 and then provides a microprogram restart signal (MR2) to the first processor 11. Restart of the microprogram is taken place from the checkpoint in the first processor 11.

When the microprogram restart enable signal (MRE) is not detected at stage Sa24, the control circuit 27 resets the first processor 11 at a stage Sa27. Then, the control circuit 27 produces an error informing command signal (EIC) for making the first processor 11 inform of the fault to the operating system (OS) 14 at a stage Sa28.

Claims (6)

What is claimed is:
1. An error recovery system for use in combination with an electronic computer system comprising a main memory for storing a plurality of programs and a plurality of processors for processing said programs,
each program comprising a succession of instructions,
each processor comprising:
executing means for fetching selected ones of said instructions and for executing under microprogram control each of the selected instructions, during a first period of time during which retry of execution of the selected instruction is allowable and a second period of time during which retry of execution of said selected instruction is not allowable, to produce masses of information, said microprogram comprising a succession of microsteps and having a first interval during which restart of the microprogram is allowable from a checkpoint at a predetermined microstep,
monitoring means for monitoring operation of said executing means to produce an error signal when an error is detected during execution of a particular one of the selected instructions and to suspend execution of the particular instruction,
instruction retry enable signal producing means operatively coupled to said monitoring means for producing an instruction retry enable signal during said first time period, and
microprogram restart enable signal producing means operatively coupled to said monitoring means for producing a microprogram restart enable signal during said first interval,
said error recovery system being responsive to the error signal from said monitoring means in a first of said processors for accessing said microprogram restart enable signal producing means in said first processor to produce a microprogram restart signal when said microprogram restart enable signal is detected from said microprogram restart enable signal producing means, said first processor carrying out, in response to said microprogram restart signal, restart of the microprogram from said checkpoint,
said error recovery system being energized on occurrence of a physical error in said first processor to produce a taking-over signal for making a second of said processors take over execution of the particular instruction, wherein
said first processor further comprises physical error signal generating means operatively coupled with said monitoring means therein for detecting, after a retry of said microprogram is successfully completed in by restarting from said checkpoint of said microprogram, said instruction retry enable signal from said instruction retry enable signal producing means in said first processor to produce a physical error signal; and
said error recovery system comprises means responsive to said physical error signal for producing said taking-over signal to thereby put said first processor in a checking condition.
2. An error recovery system as claimed in claim 1, further comprising first accessing means responsive to said error signal from said first processor for accessing said instruction retry enable signal producing means in said first processor to produce a first enable signal when the instruction retry enable signal is detected from said instruction retry enable signal producing means in said first processor, and said taking-over signal producing means being coupled with said first accessing means and producing said taking-over signal in response to said first enable signal.
3. An error recovery system as claimed in claim 2, said main memory further comprising activity indicating means for indicating an active condition of said second processor, said error recovery system further comprising activity detecting means responsive to said error signal for accessing said activity indicating means to produce an inactive signal when no active condition of said second processor is detected in said active indicating means, said taking-over signal producing means being also coupled with said activity detecting means and being put in an inactive condition by said inactive signal.
4. An error recovery system as claimed in claim 3, said error recovery system further comprising instruction retry instructing means responsive to said first enable signal and said inactive signal for producing an instruction retry signal to said first processor, said first processor carrying out retry of execution of the particular instruction in response to said instruction retry signal.
5. An error recovery system as claimed in claim 2, said first processor further comprising retry request indicating means for holding a retry request manually inputted, and said error recovery system further comprising retry request detecting means responsive to said error signal for accessing said retry request indicating means to produce a second enable signal when said retry request is detected from said retry request indicating means, said taking-over signal generating means being coupled with said retry request detecting means and being put in an inactive condition by said second enable signal, and instruction retry instructing means responsive to said first enable signal and said second enable signal for producing an instruction retry signal to said first processor, said first processor retrying execution of the particular instruction in response to said instruction retry signal.
6. An error recovery system as claimed in claim 1, further comprising processor check indicating means responsive to said taking-over signal for indicating a condition that said first processor should be checked.
US07086638 1986-08-18 1987-08-18 Error recovery system of a multiprocessor system for recovering an error in a processor by making the processor into a checking condition after completion of microprogram restart from a checkpoint Expired - Fee Related US4852092A (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP61-193549 1986-08-18
JP19355086 1986-08-18
JP61-193550 1986-08-18
JP19354986 1986-08-18

Publications (1)

Publication Number Publication Date
US4852092A true US4852092A (en) 1989-07-25

Family

ID=26507941

Family Applications (1)

Application Number Title Priority Date Filing Date
US07086638 Expired - Fee Related US4852092A (en) 1986-08-18 1987-08-18 Error recovery system of a multiprocessor system for recovering an error in a processor by making the processor into a checking condition after completion of microprogram restart from a checkpoint

Country Status (2)

Country Link
US (1) US4852092A (en)
FR (1) FR2602891B1 (en)

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4912707A (en) * 1988-08-23 1990-03-27 International Business Machines Corporation Checkpoint retry mechanism
US5043866A (en) * 1988-04-08 1991-08-27 International Business Machines Corporation Soft checkpointing system using log sequence numbers derived from stored data pages and log records for database recovery
US5065311A (en) * 1987-04-20 1991-11-12 Hitachi, Ltd. Distributed data base system of composite subsystem type, and method fault recovery for the system
US5101408A (en) * 1988-11-10 1992-03-31 Mitsubishi Denki K.K. Error collection method for sorter system
US5153881A (en) * 1989-08-01 1992-10-06 Digital Equipment Corporation Method of handling errors in software
US5172378A (en) * 1989-05-09 1992-12-15 Hitachi, Ltd. Error detection method and apparatus for processor having main storage
US5247447A (en) * 1990-10-31 1993-09-21 The Boeing Company Exception processor system
US5321698A (en) * 1991-12-27 1994-06-14 Amdahl Corporation Method and apparatus for providing retry coverage in multi-process computer environment
US5495587A (en) * 1991-08-29 1996-02-27 International Business Machines Corporation Method for processing checkpoint instructions to allow concurrent execution of overlapping instructions
US5504859A (en) * 1993-11-09 1996-04-02 International Business Machines Corporation Data processor with enhanced error recovery
US5533191A (en) * 1992-05-07 1996-07-02 Nec Corporation Computer system comprising a plurality of terminal computers capable of backing up one another on occurrence of a fault
US5551043A (en) * 1994-09-07 1996-08-27 International Business Machines Corporation Standby checkpoint to prevent data loss
US5581691A (en) * 1992-02-04 1996-12-03 Digital Equipment Corporation Work flow management system and method
US5630047A (en) * 1995-09-12 1997-05-13 Lucent Technologies Inc. Method for software error recovery using consistent global checkpoints
US5678003A (en) * 1995-10-20 1997-10-14 International Business Machines Corporation Method and system for providing a restartable stop in a multiprocessor system
US5715386A (en) * 1992-09-30 1998-02-03 Lucent Technologies Inc. Apparatus and methods for software rejuvenation
US5748882A (en) * 1992-09-30 1998-05-05 Lucent Technologies Inc. Apparatus and method for fault-tolerant computing
US5884021A (en) * 1996-01-31 1999-03-16 Kabushiki Kaisha Toshiba Computer system having a checkpoint and restart function
US5911040A (en) * 1994-03-30 1999-06-08 Kabushiki Kaisha Toshiba AC checkpoint restart type fault tolerant computer system
EP0701209A3 (en) * 1994-09-08 1999-09-22 AT&T Corp. Apparatus and methods for software rejuvenation
US6031991A (en) * 1994-05-19 2000-02-29 Kabsuhiki Kaisha Toshiba Debug system and method for reproducing an error occurring in parallel-executed programs
US6115829A (en) * 1998-04-30 2000-09-05 International Business Machines Corporation Computer system with transparent processor sparing
US6148416A (en) * 1996-09-30 2000-11-14 Kabushiki Kaisha Toshiba Memory update history storing apparatus and method for restoring contents of memory
US6189112B1 (en) * 1998-04-30 2001-02-13 International Business Machines Corporation Transparent processor sparing
CN1092358C (en) * 1996-12-16 2002-10-09 富士通株式会社 Computer system with detecting point function
US20090113240A1 (en) * 2006-03-31 2009-04-30 Xavier Vera Detecting Soft Errors Via Selective Re-Execution
US20090217090A1 (en) * 2004-08-04 2009-08-27 Reinhard Weiberle Method, operating system and computing hardware for running a computer program
US20140223062A1 (en) * 2013-02-01 2014-08-07 International Business Machines Corporation Non-authorized transaction processing in a multiprocessing environment
US9858151B1 (en) * 2016-10-03 2018-01-02 International Business Machines Corporation Replaying processing of a restarted application

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4926320A (en) * 1987-04-07 1990-05-15 Nec Corporation Information processing system having microprogram-controlled type arithmetic processing unit
US5214652A (en) * 1991-03-26 1993-05-25 International Business Machines Corporation Alternate processor continuation of task of failed processor

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3736566A (en) * 1971-08-18 1973-05-29 Ibm Central processing unit with hardware controlled checkpoint and retry facilities
GB2047446A (en) * 1979-04-17 1980-11-26 Hitachi Ltd Multiprocessor information processing system having fault detection function
EP0105710A2 (en) * 1982-09-28 1984-04-18 Fujitsu Limited Method for recovering from error in a microprogram-controlled unit
US4586180A (en) * 1982-02-26 1986-04-29 Siemens Aktiengesellschaft Microprocessor fault-monitoring circuit
US4627054A (en) * 1984-08-27 1986-12-02 International Business Machines Corporation Multiprocessor array error detection and recovery apparatus
US4641305A (en) * 1984-10-19 1987-02-03 Honeywell Information Systems Inc. Control store memory read error resiliency method and apparatus

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4128203A (en) * 1977-09-01 1978-12-05 Eaton Corporation Four-port thermally responsive valve
FR2481831A1 (en) * 1980-05-05 1981-11-06 Westinghouse Electric Corp Industrial complex multiprocessor control system - uses microprocessor to detect and correct system fault by modifying configuration

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3736566A (en) * 1971-08-18 1973-05-29 Ibm Central processing unit with hardware controlled checkpoint and retry facilities
GB2047446A (en) * 1979-04-17 1980-11-26 Hitachi Ltd Multiprocessor information processing system having fault detection function
US4586180A (en) * 1982-02-26 1986-04-29 Siemens Aktiengesellschaft Microprocessor fault-monitoring circuit
EP0105710A2 (en) * 1982-09-28 1984-04-18 Fujitsu Limited Method for recovering from error in a microprogram-controlled unit
US4566103A (en) * 1982-09-28 1986-01-21 Fujitsu Limited Method for recovering from error in a microprogram-controlled unit
US4627054A (en) * 1984-08-27 1986-12-02 International Business Machines Corporation Multiprocessor array error detection and recovery apparatus
US4641305A (en) * 1984-10-19 1987-02-03 Honeywell Information Systems Inc. Control store memory read error resiliency method and apparatus

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5065311A (en) * 1987-04-20 1991-11-12 Hitachi, Ltd. Distributed data base system of composite subsystem type, and method fault recovery for the system
US5333314A (en) * 1987-04-20 1994-07-26 Hitachi, Ltd. Distributed data base system of composite subsystem type, and method of fault recovery for the system
US5043866A (en) * 1988-04-08 1991-08-27 International Business Machines Corporation Soft checkpointing system using log sequence numbers derived from stored data pages and log records for database recovery
US4912707A (en) * 1988-08-23 1990-03-27 International Business Machines Corporation Checkpoint retry mechanism
US5101408A (en) * 1988-11-10 1992-03-31 Mitsubishi Denki K.K. Error collection method for sorter system
US5172378A (en) * 1989-05-09 1992-12-15 Hitachi, Ltd. Error detection method and apparatus for processor having main storage
US5153881A (en) * 1989-08-01 1992-10-06 Digital Equipment Corporation Method of handling errors in software
US5247447A (en) * 1990-10-31 1993-09-21 The Boeing Company Exception processor system
US5495590A (en) * 1991-08-29 1996-02-27 International Business Machines Corporation Checkpoint synchronization with instruction overlap enabled
US5495587A (en) * 1991-08-29 1996-02-27 International Business Machines Corporation Method for processing checkpoint instructions to allow concurrent execution of overlapping instructions
US5321698A (en) * 1991-12-27 1994-06-14 Amdahl Corporation Method and apparatus for providing retry coverage in multi-process computer environment
US5581691A (en) * 1992-02-04 1996-12-03 Digital Equipment Corporation Work flow management system and method
US5533191A (en) * 1992-05-07 1996-07-02 Nec Corporation Computer system comprising a plurality of terminal computers capable of backing up one another on occurrence of a fault
US5748882A (en) * 1992-09-30 1998-05-05 Lucent Technologies Inc. Apparatus and method for fault-tolerant computing
US5715386A (en) * 1992-09-30 1998-02-03 Lucent Technologies Inc. Apparatus and methods for software rejuvenation
US5504859A (en) * 1993-11-09 1996-04-02 International Business Machines Corporation Data processor with enhanced error recovery
US5911040A (en) * 1994-03-30 1999-06-08 Kabushiki Kaisha Toshiba AC checkpoint restart type fault tolerant computer system
US6031991A (en) * 1994-05-19 2000-02-29 Kabsuhiki Kaisha Toshiba Debug system and method for reproducing an error occurring in parallel-executed programs
US5551043A (en) * 1994-09-07 1996-08-27 International Business Machines Corporation Standby checkpoint to prevent data loss
EP0701209A3 (en) * 1994-09-08 1999-09-22 AT&T Corp. Apparatus and methods for software rejuvenation
US5630047A (en) * 1995-09-12 1997-05-13 Lucent Technologies Inc. Method for software error recovery using consistent global checkpoints
US5664088A (en) * 1995-09-12 1997-09-02 Lucent Technologies Inc. Method for deadlock recovery using consistent global checkpoints
US5678003A (en) * 1995-10-20 1997-10-14 International Business Machines Corporation Method and system for providing a restartable stop in a multiprocessor system
US5884021A (en) * 1996-01-31 1999-03-16 Kabushiki Kaisha Toshiba Computer system having a checkpoint and restart function
CN1101573C (en) * 1996-01-31 2003-02-12 株式会社东芝 computer system
US6148416A (en) * 1996-09-30 2000-11-14 Kabushiki Kaisha Toshiba Memory update history storing apparatus and method for restoring contents of memory
CN1092358C (en) * 1996-12-16 2002-10-09 富士通株式会社 Computer system with detecting point function
US6189112B1 (en) * 1998-04-30 2001-02-13 International Business Machines Corporation Transparent processor sparing
US6115829A (en) * 1998-04-30 2000-09-05 International Business Machines Corporation Computer system with transparent processor sparing
US20090217090A1 (en) * 2004-08-04 2009-08-27 Reinhard Weiberle Method, operating system and computing hardware for running a computer program
US7890800B2 (en) * 2004-08-04 2011-02-15 Robert Bosch Gmbh Method, operating system and computing hardware for running a computer program
US20090113240A1 (en) * 2006-03-31 2009-04-30 Xavier Vera Detecting Soft Errors Via Selective Re-Execution
US8090996B2 (en) * 2006-03-31 2012-01-03 Intel Corporation Detecting soft errors via selective re-execution
US8402310B2 (en) 2006-03-31 2013-03-19 Intel Corporation Detecting soft errors via selective re-execution
US20140223062A1 (en) * 2013-02-01 2014-08-07 International Business Machines Corporation Non-authorized transaction processing in a multiprocessing environment
US9858151B1 (en) * 2016-10-03 2018-01-02 International Business Machines Corporation Replaying processing of a restarted application

Also Published As

Publication number Publication date Type
FR2602891B1 (en) 1990-12-07 grant
FR2602891A1 (en) 1988-02-19 application

Similar Documents

Publication Publication Date Title
US3564506A (en) Instruction retry byte counter
US3533065A (en) Data processing system execution retry control
US4410942A (en) Synchronizing buffered peripheral subsystems to host operations
US5440729A (en) Method for handling error information between channel unit and central computer
US5423026A (en) Method and apparatus for performing control unit level recovery operations
US4999837A (en) Programmable channel error injection
US4912707A (en) Checkpoint retry mechanism
US4894828A (en) Multiple sup swap mechanism
US4356550A (en) Multiprocessor system
US5392397A (en) Command execution system for using first and second commands to reserve and store second command related status information in memory portion respectively
US4371754A (en) Automatic fault recovery system for a multiple processor telecommunications switching control
US5043871A (en) Method and apparatus for database update/recovery
US5768496A (en) Method and apparatus for obtaining a durable fault log for a microprocessor
US4163280A (en) Address management system
US4409654A (en) Data processor adapted for interruption to an instruction stream
US6763456B1 (en) Self correcting server with automatic error handling
US20040078697A1 (en) Latent fault detector
US6119246A (en) Error collection coordination for software-readable and non-software readable fault isolation registers in a computer system
US5437033A (en) System for recovery from a virtual machine monitor failure with a continuous guest dispatched to a nonguest mode
US4053752A (en) Error recovery and control in a mass storage system
US6550019B1 (en) Method and apparatus for problem identification during initial program load in a multiprocessor system
US5619644A (en) Software directed microcode state save for distributed storage controller
US4703481A (en) Method and apparatus for fault recovery within a computing system
US5528755A (en) Invalid data detection, recording and nullification
US5630139A (en) Program download type information processor

Legal Events

Date Code Title Description
AS Assignment

Owner name: NEC CORPORATION, 33-1, SHIBA 5-CHOME, MINATO-KU, T

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST.;ASSIGNOR:MAKITA, AKIHISA;REEL/FRAME:004786/0711

Effective date: 19870813

Owner name: NEC CORPORATION, 33-1, SHIBA 5-CHOME, MINATO-KU, T

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MAKITA, AKIHISA;REEL/FRAME:004786/0711

Effective date: 19870813

FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
FP Expired due to failure to pay maintenance fee

Effective date: 20010725