GB2411270A - Recovery from loss of lock step - Google Patents

Recovery from loss of lock step Download PDF

Info

Publication number
GB2411270A
GB2411270A GB0509528A GB0509528A GB2411270A GB 2411270 A GB2411270 A GB 2411270A GB 0509528 A GB0509528 A GB 0509528A GB 0509528 A GB0509528 A GB 0509528A GB 2411270 A GB2411270 A GB 2411270A
Authority
GB
United Kingdom
Prior art keywords
processor
lock step
loss
architected state
processor unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB0509528A
Other versions
GB2411270B (en
GB0509528D0 (en
Inventor
Kevin David Safford
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hewlett Packard Development Co LP
Original Assignee
Hewlett Packard Development Co LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US10/187,833 external-priority patent/US7085959B2/en
Application filed by Hewlett Packard Development Co LP filed Critical Hewlett Packard Development Co LP
Publication of GB0509528D0 publication Critical patent/GB0509528D0/en
Publication of GB2411270A publication Critical patent/GB2411270A/en
Application granted granted Critical
Publication of GB2411270B publication Critical patent/GB2411270B/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1658Data re-synchronization of a redundant component, or initial sync of replacement, additional or spare unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2038Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with a single idle spare processing component
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1629Error detection by comparing the output of redundant processing systems
    • G06F11/1641Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1629Error detection by comparing the output of redundant processing systems
    • G06F11/165Error detection by comparing the output of redundant processing systems with continued operation after detection of the error
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/845Systems in which the redundancy can be transformed in increased performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)

Abstract

An apparatus, operating on an advanced multi-core processor architecture, and a corresponding method, are used to enhance recovery from loss of lock step in a multi-processor computer system (100). The apparatus for recovery from loss of lock step includes multiple processor units (111, 113; 121, 123; 125, 127) operating in the computer system, each of the processor units having at least two processor units operating in lock step, and at least one idle processor unit operating in lock step; and a controller (130) coupled to the two processor units operating in lock step and the idle processor unit. The controller includes mechanisms for copying an architected state of each of the two lock step processor units to the idle processor unit.

Description

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
2411270
METHOD AND APPARATUS FOR RECOVERY FROM LOSS OF LOCK STEP Technical Field
The technical field is computer systems employing lock stepped microprocessors. Background
Advanced computer architectures may employ multiple microprocessors. Some advanced computer architectures may employ multiple microprocessors on one silicon chip. In a typical application, two microprocessors may be implemented on a single silicon chip, and the implementation may be referred to as a dual core processor. Two or more of the multiple microprocessors may operate in a lock step mode, meaning that each of the lock stepped microprocessors process the same code sequences, and should, therefore, produce identical outputs. Figure 1A illustrates a typical implementation of a dual core processor A dual core processor 10 includes a silicon chip 11 having microprocessor core 12 (core 0) and microprocessor core 14 (core 1). The microprocessor cores 12 and 14 are coupled to an interface logic 16 that monitors external communications from the microprocessor cores 12 and 14. In the dual core processor 10, the microprocessor cores 12 and 14 operate as independent entities. While the dual core processor 10 has advantages in terms of size and processing speed, the reliability of the dual core processor 10 is not significantly better than that of two single core processors.
To enhance reliability, the dual core processor, or other multiple microprocessor architected computer systems, may employ lock step features. Figure IB is a diagram of a prior art dual corc processor that uses lock step techniques to improve overall reliability. In Figure IB, a computer system 18 includes a dual core processor 20 having a single silicon chip 21, on which are implemented microprocessor core 22 and microprocessor core 24. To employ lock step, each of the microprocessor cores 22 and 24 process the same code streams. To ensure reliable operation of the dual core processor 20, each of the microprocessors 22 and 24 may operate in "lock step." An event that causes a loss of lock step can occur on either or both of the microprocessor cores 22 and 24. An example of such an event is a data cache error. A loss of lock step, if not promptly corrected, may cause the computer system 18 to "crash." T hat is, a failure of one microprocessor core may halt processing of the dual core processor 20, and the computer system 18, even if the other microprocessor core does not encounter an error.
To detect a loss of lock step, a lock step logic 26, which may be external to the chip 21, compares outputs from the microprocessor cores 22 and 24. A difference in
1
1 processing detected by the lock step logic 26is by definition a loss of lock step. A
2 drawback to the dual core processor architecture shown in Figure IB is that the logic to
3 determine loss of lock step is external to the chip. This configuration imposes delays in
4 determining loss of lock step, and requires additional architectural features.
5 The dual core processor 20 also makes recovery from a loss of lock step difficult
6 and time-consuming. Figure 1C illustrates a current methodology for recovering from a
7 loss of lock step. In Figure 1C, the dual core processor 20 is shown coupled to memory
8 25. Should the dual core processor 20 suffer a loss of lock step, recovery may be initiated
9 by the memory 25 saving the architected state of one of the microprocessors 22 and 24
10 (i.e., the microprocessor that is considered "good"). Then, both microprocessors 22 and
11 24 are reset and reinitialized. Finally, the architected states of each of the
12 microprocessors 22 and 24 is copied from the memory 25 into the microprocessors 22 and
13 24, respectively. This prior art methodology for recovery from a loss of lock step makes
14 the microprocessors 22 and 24 unavailable for an amount of time. If the amount of time
15 required for recovery is too long, the computer system 18 employing the dual core
16 processor 20 may "crash."
17 Summary
18 An apparatus, operating on an advanced multi-core processor architecture, and a
19 corresponding method, are used to enhance recovery from loss of lock step in a computer
20 system. In an embodiment, the apparatus for recovery from loss of lock step comprises a
21 plurality of processor units operating in the computer system, each of the processor units
22 comprising at least two processor units operating in lock step, and at least one idle
23 processor unit operating in lock step; and a controller coupled to at least the at least two
24 processor units operating in lock step and the at least one idle processor unit, the
25 controller comprising means for copying an architected state of each of the at least two
26 processor units to the idle processor unit.
27 The method comprises receiving a loss of lock step signal from a processor unit;
28 receiving a notice from the processor unit experiencing the loss of lock step to take the
29 processor unit off line, and moving an architected state of the processor unit experiencing
30 the loss of lock step to a spare processor unit, wherein the spare processor unit becomes
31 an active processor unit in the computer system.
32 Description of the Drawings
33 The detailed description will refer to the following figures, in which like numbers
34 refer to like elements, and in which:
2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Figure 1A is a diagram of a prior art dual-core processor;
Figure IB is a diagram of a prior art dual-corc processor employing lock step; Figure 1C is a diagram illustrating prior art recovery from loss of lock step;
Figure 2 is a diagram of a computer system that uses an improved, multi-core processor employing lock step processing;
Figure 3 illustrates additional architectural features for use in recovery from loss of lock step for the computer system of Figure 2; and
Figure 4 is a flowchart illustrating a process for recovery from loss of lock step in the computer system of Figure 3.
Detailed Description
To improve reliability of processing assets, a computer system employs lock stepped processor cores that operate in a master/checker pair. Each of two processors in the pair processes the same code sequences, and the resulting outputs of the processors are compared by a logic circuit located near external interfaces of the two processors. Any difference in the processor outputs indicates the existence of an error. The logic circuit may then initiate a sequence of steps that halt operation of the two processors. Figure 2 shows a computer system 100 that employs processors 111 (central processor unit (CPU) 0) and 113 (CPU 1), which, in an embodiment, may be located on a common silicon chip or substrate 110. Alternatively, the processors 111 and 113 may be implemented on separate substrates. The processors 111 and 113 may operate in an independent mode, or in a lock step mode. When operating in a lock step mode, the processors 111 and 113 will appear to the computer system 100 to be a single processor core, or a logical CPU 0. The processor 111 may include error detection and signaling logic 112, and the processor 113 may include error detection and signaling logic 114. The error detection and signaling logic will be described later.
External logic circuit 115 monitors outputs of the processors 111 and 113 and may be used to detect any differences in the outputs. As noted above, such differences are indicative a potential error in at least one of the processors 111 and 113. However, which of the processors 111 and 113 is subject to an error condition may not be known. On rare occasions, both the processors 11 ] and 113 may be subject to an error condition. Such an error condition may lead to a halt in processing of the processors 111 and 113 until the error can be corrccted. In other words, any difference in the outputs causes a loss of lock step, and a halt to processing.
3
•W
1 To improve availability of the processors assets of the computer system 100,
2 additional features, such as means for detecting and signaling occurrence of errors, may
3 be incorporated into the computer system 100. For example, the error detection and
4 signaling logic 112 and 114 may be included in the processors 111 and 113, respectively,
5 or in other parts of the computer system 100, to signal an impending loss of lock step.
6 Using the impending loss of lock step signal, the computer system 100 may continue
7 operating (processing) using one of the processors 111 and 113 that did not experience an
8 error. In particular, certain events within either of the lock stepped processors 111 and
9 113 may be used by the processors 111 and 113, respectively, to indicate the impending
10 loss of lock step. As an example, and possibly due to completely random circumstances,
11 a data cache error for a cache associated with the processor 111 may occur. Such an error
12 can be completely corrected (i.e., the processor 111 does not need to be replaced), but
13 will guarantee that the processors 111 and 113 will break lock step at some future time
14 because the data cache error causes timing differences between the processors 111 and
15 113. The processor 111 may detect the data cache error, and use the detection of this data
16 cache error to signal the logic circuit 115 that the processor 111 is experiencing an error
17 that will cause a loss of lock step, and that the processor 111 is "bad." The logic circuit
18 115 may then "turn off," thereby ending lock step operations, and processing may
19 continue using the "good" processor 113. At some future time, recovery from the loss of
20 lock step (and correction of the data cache error) is executed to restore lock step operation
21 of the processors 111 and 113.
22 Figure 3 illustrates further architectural details for recovery from loss of lock step
23 in the computer system 100 of Figure 2. In Figure 3, the computer system 100 is shown
24 with additional processors 121, 123, 125, and 127, as well as the processors 111 and 113.
25 The processors 111, 113, 121, 123, 125, and 127 are coupled to node controller 130. The
26 processors operate as pairs when in lock step (i.e., the processors 111 and 113 are a first
27 pair; the processors 121 and 123 are a second pair; and the processors 125 and 127 are a
28 third pair). From the node controller's perspective, each pair of processors appears as a
29 single (logical) processor. The processor pairs, or processor units, are coupled to a
30 lockstep logic, such as the lockstep logic 115 shown in Figure 2, and the lockstep logic is
31 then connected to the node controller 130. The node controller 130 provides means for
32 copying the architected state of a processor to another processor. In an embodiment, the
33 node controller 130 has available at all time a current architected state of the processors to
34 which the node controller 130 is coupled In another embodiment, the node controller
4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
130 simply provides means for communication among the processors 111, 113, 121, 123, 125, and 127. For example, the node controller 130 may store the architected state of the processors 111, 113, 121, 123, 125, and 127, cither internally in the node controller 130, or in another component of the computer system 100. Alternatively, the node controller 130 may allow one processor (e.g., the processor 111) to copy the architected stale of the processor 111 to another processor (e.g., the processor 125). In yet another alternative embodiment, the node controller 130 may allow a processor that has broken lock step to copy, as part of the process for recovering from loss of lock step, the architected state of the processor to the node controller 130, which will in turn copy the architected state to a "hot standby" processor.
The six processors 111, 113, 121, 123, 125, and 127 operate in lock step (i.e., are processing code sequences). For example, the processor 111 operates in lock step with the processor 113, and the processor 121 operates in lock step with the processor 123 and the processor 125 operates in lock step with the processor 127.
The processor 125 may be designated as a "hot standby," and is sitting idle in lock step mode with the processor 127. Should one of the processors 111, 113, 121, and 123 suffer an error, the hot standby processors 125, 127 may be used to speed recovery from the resulting loss of lock step.
Figure 4 is a flow chart illustrating a process 200 for recovery from a loss of lock step using the computer system 100 shown in Figure 3. The process 200 will be shown with an error condition in the first processor pair 111/113. The operation 200 begins in block 205 with the system 100 operating in a normal lock step fashion. In block 210, the processor 111 detects an error event that indicates an impending loss of lock step. In block 215, the processor 111 signals the node controller 130 that the first processor pair 111/113 has broken lock step and that the first processor pair 111/113 should be taken "off-line." In block 220, the node controller 130 copies the architected state of the first processor pair 111/113 to the hot standby processor pair 125/127. In an embodiment, the architected state of the first processor pair 111/113 may be stored in the node controller 130, and to facilitate recovery, the node controller 130 copies the stored state to the third processor pair 125/127. Alternatively, the node controller 130 may copy the state of the first processor pair 111/113 directly from the processors 111 and 113 to the processors 125 and 127 without any intermediate storage of the architected state in the node controller 130, or other component of the computer system 100. The processor pair 125/127 then becomes the logical CPU 0 in the computer system 100, and the computer
5
1
2
3
4
5
6
7
8
9
10
11
12
system 100 operates without a hot standby processor pair. In block 225, recovery actions are executed on the first processor pair 111/113 (e.g., all cachcs are flushed on the processors 111 and 113). In block 230, the node controller 130 "reboots" the processors 111 and 113, and the processors 111 and 113 become the new "hot standby" processor pair on the system 100. In block 235, the operation 200 ends, with the computer system 100 operating the processors 121, 123, 125, and 127 in lock step, and with the processors 111/113 idle and in hot standby.
The terms and descriptions used herein are set forth by way of illustration only and are not meant as limitations. Those skilled in the art will recognize that many variations are possible within the spirit and scope of the invention as defined in the following claims, and there equivalents, in which all terms are to be understood in their broadest possible sense unless otherwise indicated.
6

Claims (11)

1. An apparatus to recover from a loss of lock step in a multiprocessor computer system, wherein two or more processor units (111, 113; 121,123; 125,127) operate in lock step, comprising:
in each of the two or more processor units operating in lock step:
means for detecting a loss of lock step initiating event, and means for signaling an impending loss of lock step; and means for moving an architected state of a processor unit having the loss of lock step initiating event to a separate processor unit, the spare processor unit operating idle in lock step.
2. The apparatus of claim 1, further comprising means for taking the loss of lock step processor unit off line.
3. The apparatus of claim 2, further comprising means for rebooting the loss of lock step processor unit, whereby the rebooted processor unit is designated as a new spare processor unit.
4. The apparatus of claim 1, wherein the moving means comprises:
means for copying the architected state; and means for storing the architected state.
5. The apparatus of claim 4, wherein the means for storing the architected state comprises a node controller (130) coupled to the two or more processor units and the spare processor unit.
7
6. The apparatus of claim 1, wherein the moving means comprises means for copying the architected state directly from the loss of lock step processor unit to the spare processor unit.
7. A method for recovering from a loss of lock step operation in a multi-processor computer system, comprising:
detecting (210) a loss of lock step initiating event in a processor,
signaling (215) an impending Joss of lock step;
moving (220) an architected state of the loss of lack step processor to a spare processor; and idling the loss of lock step processor.
8. The method of claim 7, further comprising:
correcting (225) the loss of lock step event in the loss of lock step processor; and rebooting (230) the loss of lock step processor, whereby the rebooted processor becomes a new spare processor.
9. The method of claim 7, wherein moving the architected state comprises copying the architected state from a node controller that couples the Joss of lock step processor unit and the spare processor.
10. The method of claim 7, wherein moving the architected state comprises copying the architected state directly from the loss of lock step processor unit to the spare processor.
11. The method of claim 7, wherein moving the architected state comprises:
copying the architected state from the loss of lock step processor unit to a node controller; and copying the architected state from the node controller to the spare processor.
8
GB0509528A 2002-07-03 2003-06-30 Method and apparatus for recovery from loss of lock step Expired - Fee Related GB2411270B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/187,833 US7085959B2 (en) 2002-07-03 2002-07-03 Method and apparatus for recovery from loss of lock step
GB0315295A GB2392520B (en) 2002-07-03 2003-06-30 Method and apparatus for recovery from loss of lock step

Publications (3)

Publication Number Publication Date
GB0509528D0 GB0509528D0 (en) 2005-06-15
GB2411270A true GB2411270A (en) 2005-08-24
GB2411270B GB2411270B (en) 2005-12-21

Family

ID=35447633

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0509528A Expired - Fee Related GB2411270B (en) 2002-07-03 2003-06-30 Method and apparatus for recovery from loss of lock step

Country Status (1)

Country Link
GB (1) GB2411270B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4589066A (en) * 1984-05-31 1986-05-13 General Electric Company Fault tolerant, frame synchronization for multiple processor systems
US6263452B1 (en) * 1989-12-22 2001-07-17 Compaq Computer Corporation Fault-tolerant computer system with online recovery and reintegration of redundant components

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4589066A (en) * 1984-05-31 1986-05-13 General Electric Company Fault tolerant, frame synchronization for multiple processor systems
US6263452B1 (en) * 1989-12-22 2001-07-17 Compaq Computer Corporation Fault-tolerant computer system with online recovery and reintegration of redundant components

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
IEEE Journal of Solid-State Circuits, Vol 27, No. 1, January 1992 (USA), Y Tamir, "Self-checking self-repairing computer nodes using the Mirror Processor", pages 4 to 16 *

Also Published As

Publication number Publication date
GB2411270B (en) 2005-12-21
GB0509528D0 (en) 2005-06-15

Similar Documents

Publication Publication Date Title
US7085959B2 (en) Method and apparatus for recovery from loss of lock step
US7055060B2 (en) On-die mechanism for high-reliability processor
US8234521B2 (en) Systems and methods for maintaining lock step operation
US7493517B2 (en) Fault tolerant computer system and a synchronization method for the same
JP2009516277A (en) Apparatus and method for eliminating errors in a system having at least two registered processing units
JP6098778B2 (en) Redundant system, redundancy method, redundancy system availability improving method, and program
WO2020239060A1 (en) Error recovery method and apparatus
US7590885B2 (en) Method and system of copying memory from a source processor to a target processor by duplicating memory writes
JP2003511756A (en) Mechanisms for improving fault isolation and diagnosis in computers
KR20180062807A (en) System interconnect and system on chip having the same
US20070170269A1 (en) Recovering communication transaction control between independent domains of an integrated circuit
JP3068009B2 (en) Error correction mechanism for redundant memory
US7194671B2 (en) Mechanism handling race conditions in FRC-enabled processors
JP3063334B2 (en) Highly reliable information processing equipment
GB2411270A (en) Recovery from loss of lock step
US7243257B2 (en) Computer system for preventing inter-node fault propagation
JPH0695902A (en) Information processor in processor duplex system
JP2001175545A (en) Server system, fault diagnosing method, and recording medium
JP3450132B2 (en) Cache control circuit
JP3539687B2 (en) Processor dual-processing information processor
JPH05265790A (en) Microprocessor device
JPH07120296B2 (en) Error control method in hot standby system
JPH06168151A (en) Duplex computer system
JPH09282292A (en) Memory copying device for information processing system
JP2002215415A (en) Fault-tolerant system and fault separation method therefor

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20080630