WO2002084490A2 - Providing fault-tolerance by comparing addresses and data from redundant processors running in lock-step - Google Patents
Providing fault-tolerance by comparing addresses and data from redundant processors running in lock-step Download PDFInfo
- Publication number
- WO2002084490A2 WO2002084490A2 PCT/US2002/011563 US0211563W WO02084490A2 WO 2002084490 A2 WO2002084490 A2 WO 2002084490A2 US 0211563 W US0211563 W US 0211563W WO 02084490 A2 WO02084490 A2 WO 02084490A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- redundant processors
- combined
- operations
- store
- system memory
- Prior art date
Links
- 230000007246 mechanism Effects 0.000 claims description 21
- 238000000034 method Methods 0.000 claims description 16
- 230000001052 transient effect Effects 0.000 description 5
- 230000004048 modification Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000003362 replicative effect Effects 0.000 description 3
- 239000004065 semiconductor Substances 0.000 description 2
- 230000000694 effects Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/18—Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits
- G06F11/183—Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components
- G06F11/184—Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components where the redundant components implement processing functionality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1629—Error detection by comparing the output of redundant processing systems
- G06F11/1641—Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1629—Error detection by comparing the output of redundant processing systems
- G06F11/1641—Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components
- G06F11/1645—Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components and the comparison itself uses redundant hardware
Definitions
- the present invention relates the design of multiprocessor systems. More specifically, the present invention relates to a method and an apparatus for facilitating fault-tolerance by comparing addresses and data from redundant processors running in lock-step.
- Error-correcting codes can be employed to correct transient errors that occur when data is stored into memory.
- error-correcting codes cannot correct all types of errors, and furthermore, the associated circuitry to detect and correct errors is impractical to deploy in extremely time-critical computational circuitry within a microprocessor.
- Transient errors can also be detected and/or corrected by replicating a computer system so that there exist two or more copies of the computer system concurrently executing the same code. This allows transient errors to be detected by periodically comparing results produced by these replicated computer systems.
- Transient errors can be corrected in a replicated computer system by voting. If there are three or more replicated computer systems and an error is detected, the computer systems can vote to determine which result is correct. For example, in a three-computer system, if two of the three computers produce the same result, this result is presumed to be the correct answer if the other computer system produces a different result.
- a common multiprocessor design includes a number of processors 151-154 with a number of level one ( I) caches, 161-164, that share a single level two (L2) cache 180 and a memory 183 (see FIG. 1).
- I level one
- L2 cache 180 a single level two cache 180 and a memory 183 (see FIG. 1).
- the system attempts to retrieve the data item from L2 cache 180. If the data item is not present in L2 cache 180, the system first retrieves the data item from memory 183 into
- L2 cache 180 and then from L2 cache 180 into LI cache 161.
- a coherency protocol typically ensures that if one copy of a data item is modified in LI cache 161, other copies of the same data item in LI caches 162-164, in L2 cache 180 and in memory 183 are updated or invalidated to reflect the modification. This is accomplished by broadcasting an invalidation message across bus 170.
- this type of coherency mechanism can cause replicated processors to have different state in their local LI caches. For example, if a first replicated processor updates a data item in LI cache, it may cause the same data item to be invalidated in the LI cache of second replicated processor. In this case, the LI cache of the first replicated processor ends up in a different state than the LI cache of the second replicated processor.
- One embodiment of the present invention provides a system that facilitates fault-tolerance by using redundant processors.
- This system operates by receiving store operations from a plurality of redundant processors running the same code in lockstep. The system compares the store operations to determine if the store operations are identical, thereby indicating that the redundant processors are operating correctly. If the store operations are identical, the system combines the store operations into a combined store operation, and forwards the combined store operation to a system memory that is shared between the redundant processors. If the store operations are not identical, the system indicates an error condition.
- handling the error condition involves setting an error flag in a data word for the combined store operation, and forwarding the combined store operation to the system memory.
- the system determines whether a majority of store operations are identical. If so, the system combines the majority into a combined store operation, and forwards the combined store operation to the system memory. If no majority exists, the system sets an error flag in a data word for the combined store operation, and forwards the combined store operation to the system memory.
- the system additionally receives load operations from the redundant processors.
- the system compares the load operations to determine if the load operations are identical, thereby indicating that the redundant processors are operating correctly. If the load operations are identical, the system combines the load operations into a combined load operation, and forwards the combined load operation to the system memory that is shared between the redundant processors. Next, the system receives a return data value from the system memory, and broadcasts the return data value to the redundant processors. If the load operations are not identical, the system indicates an error condition.
- the system receives the return data value at one of the redundant processors.
- This redundant processor examines an error flag in the return data value. If the error flag is set, the processor traps to an error handling routine.
- system memory includes a lower- level cache memory.
- the system additionally receives invalidation messages from the plurality of redundant processors. These invalidation messages indicate that a specific cache line should be invalidated. The system combines these invalidation messages into a combined invalidation message, and communicates the combined invalidation message to other processors in the computer system.
- the system additionally receives an invalidation message indicating that a specific cache line should be invalidated.
- This invalidation message is generated as a result of actions of another processor that is not one of the redundant processors.
- the system broadcasts this invalidation message to the redundant processors.
- FIG. 1A illustrates a multiprocessor system.
- FIG. IB illustrates a multiprocessor system in accordance with an embodiment of the present invention.
- FIG. 2 illustrates in more detail the multiprocessor system illustrated in FIG. IB in accordance with an embodiment of the present invention.
- FIG. 3 illustrates the structure of a switch in accordance with an embodiment of the present invention.
- FIG. 4A illustrates a duplex configuration of a multiprocessor system in accordance with an embodiment of the present invention.
- FIG. 4B illustrates a triple modular redundancy (TMR) configuration of a multiprocessor system in accordance with an embodiment of the present invention.
- TMR triple modular redundancy
- FIG. 5 is a flow chart illustrating a store operation in accordance with an embodiment of the present invention.
- FIG. 6 is a flow chart illustrating a load operation in accordance with an embodiment of the present invention.
- FIG. 7 is a flow chart illustrating the process of sending an invalidation operation in accordance with an embodiment of the present invention.
- FIG. 8 is a flow chart illustrating the process of receiving an invalidation operation in accordance with an embodiment of the present invention.
- FIG. 9 is a flow chart illustrating the process of handling a load value with an error set in accordance with an embodiment of the present invention.
- FIG. IB illustrates a multiprocessor system 100 in accordance with an embodiment of the present invention.
- semiconductor chip 101 includes a number of processors 110, 120, 130 and 140, which contain level one (LI) caches 112, 122, 132 and 142, respectively.
- LI caches 112, 122, 132 and 142 may be separate instruction and data caches, or alternatively, unified instruction/data caches.
- LI caches 112, 122, 132 and 142 are coupled to level two (L2) cache 106.
- L2 cache 106 is coupled to off-chip memory 102 through memory controller 104.
- LI caches 112, 122, 132 and 142 are write-through caches, which means that all updates to LI caches 112, 122, 132 and 142 are automatically propagated to L2 cache 106.
- L2 cache 106 is an "inclusive cache", which means that all items in LI caches 112, 122, 132 and 142 are included in L2 cache 106.
- FIG. 2 illustrates in more detail the multiprocessor system illustrated in FIG. IB in accordance with an embodiment of the present invention.
- L2 cache 106 is implemented with four banks 202-205, which can be accessed in parallel by processors 110, 120, 130 and 140 through switches 215 and 216.
- Switch 215 handles communications that feed from processors 110, 120, 130 and 140 into L2 banks 202-205, while switch 216 handles communications in the reverse direction from L2 banks 202-205 to processors 110, 120, 130 and 140. Note that only two bits of the address are required to dete ⁇ nine which of the four banks 202-205 a memory request is directed to.
- switch 215 additionally includes an I/O port 150 for receiving communications from I/O devices, and switch 216 includes an I/O port 152 for sending communications to I/O devices. Note that by using this "banked" architecture, it is possible to concurrently connect each LI cache to its own bank of L2 cache, thereby increasing the bandwidth of L2 cache 106.
- Switch FIG. 3 illustrates the structure of a switch 215 in accordance with an embodiment of the present invention.
- Switch 215 includes a number of inputs 301- 304, which are coupled to processors 110, 120, 130 and 140, respectively.
- Switch 215 also includes a number of outputs 311-314, which are coupled to L2 banks 202- 205, respectively. Note that each of these inputs 301-304 and outputs 311-314 represents multiple data lines.
- multiplexers 321-324 there are a number of multiplexers 321-324. Each of these multiplexers 321-324 has an input queue for each of the inputs 301-304. For example, multiplexer 321 is coupled to four queues, each one of which is coupled to one of the inputs 310-304. Comparator 331 performs comparison operations between values stored in the input queues to facilitate fault tolerance. In the system configuration illustrated in FIG. 4A, comparator 331 compares pairs of inputs 301- 302 and 303-304. In another configuration illustrated in FIG. 4B, comparator circuit facilitates voting between three or more inputs 301-304 to determine if a majority of the inputs match. The output of comparator 331feeds into arbitration circuit 341. Arbitration circuit 341 causes an entry from one of the input queues to be routed to output 311 through multiplexer 321.
- Broadcast switch 350 includes a number of pass gates 351-351, which can selectively couple an output of a multiplexer to a neighboring output. For example, if pass gate 352 is transparent and the output of multiplexer 322 is disabled, the output of multiplexer 321 is broadcast onto outputs 311 and 312. Note that in general there are many possible ways to implement broadcast switch 350. The only requirement is that broadcast switch 350 should be able to broadcast the output of any one of multiplexers 321-324 to multiple outputs 311-314.
- switch 216 is identical to the structure of switch 215 except that inputs 301-304 are coupled to L2 banks 202-205, respectively, and the outputs 311-314 are coupled to processors 110, 120, 130 and 140, respectively.
- FIG. 4A illustrates a duplex configuration of a multiprocessor system in accordance with an embodiment of the present invention.
- processors 110 and 120 form a pair of redundant processors that execute the same code in lockstep.
- Store operations through switch 215 are compared to ensure that data values and store addresses from processors 110 and 120 agree.
- Load operations through switch 215 are similarly compares to ensure that the load addresses agree. If not, an error is indicated.
- processors 130 and 140 form another pair of redundant processors, and switch 215 compares store operations from these processors.
- the configuration illustrated in FIG. 4A is achieved by initializing processors
- comparators 331-334 are configured so that inputs 301 and 302 are always compared against each other and inputs 303-304 are always compared against each other.
- the output of comparators 332 and 334 are disabled and pass gates 352 and 354 are made transparent. This ensures that the output of multiplexer 321 is broadcast to outputs 311 and 312, which are coupled to processors 110 and 120. It also ensures that the output of multiplexer 323 is broadcast to outputs 313 and 314, which are coupled to processors 130 and 140.
- processors 110, 120 and 130 execute the same code in lockstep
- processor 140 is a spare processor (which may also be executing the same code in lockstep to facilitate rapid replacement).
- Store operations generated by processors 110, 120 and 130 are compared at switch 215. If they do not agree, the system performs a voting operation to determine if two if the three store operations agree. If so, the store operations that agree are taken to be the correct store operation, and the other store operation is presumed to erroneous and is ignored.
- the configuration illustrated in FIG. 4B is achieved by initializing processors
- comparators 331-334 are configured so that inputs 301-303 are always compared against each other, and so that the majority wins.
- broadcast switch 216 the output of comparators 332-334 are disabled and pass gates 352 and 353 are made transparent. This ensures that the output of multiplexer 321 is broadcast to outputs
- processors 110, 120 and 130 which are coupled to processors 110, 120 and 130, respectively.
- one embodiment of the present invention can be selectively reconfigured between the configuration illustrated in FIG. 4A and the configuration illustrated in FIG. 4B during a system boot operation.
- FIG. 5 is a flow chart illustrating a store operation in accordance with an embodiment of the present invention.
- the system starts when switch 215 receives store operations from redundant processors running in lock step (step 502).
- the system compares these store operations by using one of comparators 331-334 (step 504). If these store operations are identical, the processors are presumably operating properly.
- the system combines the store operations into a single store operation (step 507), and forwards the combined store operation to system memory (or L2 cache) (step 509). Note that combining store operations involves passing only one instance of the store operation to switch 215 and ignoring the other instances.
- the system sets an error flag in the data word for the store operation (step 508), and forwards the store operation to system memory (step 510). Note that if there is an error, it does not matter what the data value of the store operation is set to. The system only has to ensure that the error flag is set.
- the system determines whether a majority of the store operations are identical (step 512). If not, the system sets an error flag in the data word for the store operation (step 508), and forwards the store operation to memory (step 510).
- the system combines the majority into a combined store operation (by ignoring all but one instance of the identical store operations) (step 514), and forwards the combined store operation to memory (step 509).
- FIG. 6 is a flow chart illustrating a load operation in accordance with an embodiment of the present invention.
- the system starts when switch 215 receives load operations from redundant processors running in lock step (step 602).
- the system compares these load operations by using one of comparators 331-334 (step 604). If these load operations are identical, the processors are presumably operating properly.
- the system combines the load operations into a single load operation (step 607), and forwards the combined load operation to system memory (or L2 cache) (step 609). Note that combining load operations involves passing only one instance of the load operation to switch 215 and ignoring the other instances.
- the system determines whether a majority of the load operations are identical (step 612). If not, the system sets generates an error condition (step 608).
- the system combines the majority into a combined load operation (by ignoring all but one instance of the identical load operations) (step 614), and forwards the combined load operation to memory (step 609).
- switch 216 receives a return value for the load operation from system memory (step 618).
- Switch 216 uses pass gates 351-354 to broadcast the return value to the redundant processors (step 620).
- the processor checks the error flag (step 904). If the error flag is set, the processor traps to an error handling routine (step 906). Otherwise, the processor handles the load normally (step 908).
- FIG. 7 is a flow chart illustrating the process of sending an invalidation operation to support cache coherence in accordance with an embodiment of the present invention.
- switch 215 receives multiple invalidation messages from redundant processors executing the same code (step 702). These invalidation messages are identical, and they indicate that a data item is updated in the local caches of the redundant processors.
- Switch 215 combines these invalidation messages into a combined invalidation message (step 704), which can be accomplished by ignoring all but one of the invalidation messages.
- FIG. 8 is a flow chart illustrating the process of receiving an invalidation operation in accordance with an embodiment of the present invention.
- switch 216 receives an invalidation message caused by a processor that is not part of the set of redundant processors (step 802).
- Switch 216 then broadcasts the validation message to the set of redundant processors (step 804). For example, in FIG.
- switch 216 when switch 216 receives a single invalidation message generated by redundant processors 110 and 120. Switch 216 broadcasts this invalidation message to processors 130 and 140. In this case, the same cache line is invalidated in both processor caches, thereby keeping the state within processors 130 and 140 identical.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Hardware Redundancy (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
Description
Claims
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2002582363A JP3972983B2 (en) | 2001-04-13 | 2002-04-11 | Method for providing fault tolerance by comparing addresses and data from redundant processors operating in lockstep |
DE60212115T DE60212115D1 (en) | 2001-04-13 | 2002-04-11 | PROVIDE ERROR TOLERANCE FOR ADDRESS AND DATA COMPARISON OF LOCK-STEP REDUNDANT PROCESSORS |
AU2002252647A AU2002252647A1 (en) | 2001-04-13 | 2002-04-11 | Providing fault-tolerance by comparing addresses and data from redundant processors running in lock-step |
KR1020037013188A KR100842637B1 (en) | 2001-04-13 | 2002-04-11 | Providing fault-tolerance by comparing addresses and data from redundant processors running in lock-step |
EP02721731A EP1379951B1 (en) | 2001-04-13 | 2002-04-11 | Providing fault-tolerance by comparing addresses and data from redundant processors running in lock-step |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US28359801P | 2001-04-13 | 2001-04-13 | |
US60/283,598 | 2001-04-13 | ||
US10/061,522 | 2002-01-31 | ||
US10/061,522 US6862693B2 (en) | 2001-04-13 | 2002-01-31 | Providing fault-tolerance by comparing addresses and data from redundant processors running in lock-step |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2002084490A2 true WO2002084490A2 (en) | 2002-10-24 |
WO2002084490A3 WO2002084490A3 (en) | 2003-08-14 |
Family
ID=26741162
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2002/011563 WO2002084490A2 (en) | 2001-04-13 | 2002-04-11 | Providing fault-tolerance by comparing addresses and data from redundant processors running in lock-step |
Country Status (7)
Country | Link |
---|---|
US (1) | US6862693B2 (en) |
EP (1) | EP1379951B1 (en) |
JP (1) | JP3972983B2 (en) |
KR (1) | KR100842637B1 (en) |
AU (1) | AU2002252647A1 (en) |
DE (1) | DE60212115D1 (en) |
WO (1) | WO2002084490A2 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7237144B2 (en) | 2004-04-06 | 2007-06-26 | Hewlett-Packard Development Company, L.P. | Off-chip lockstep checking |
US7287185B2 (en) | 2004-04-06 | 2007-10-23 | Hewlett-Packard Development Company, L.P. | Architectural support for selective use of high-reliability mode in a computer system |
US7290169B2 (en) | 2004-04-06 | 2007-10-30 | Hewlett-Packard Development Company, L.P. | Core-level processor lockstepping |
US7296181B2 (en) | 2004-04-06 | 2007-11-13 | Hewlett-Packard Development Company, L.P. | Lockstep error signaling |
Families Citing this family (56)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6971043B2 (en) * | 2001-04-11 | 2005-11-29 | Stratus Technologies Bermuda Ltd | Apparatus and method for accessing a mass storage device in a fault-tolerant server |
US6813527B2 (en) * | 2002-11-20 | 2004-11-02 | Honeywell International Inc. | High integrity control system architecture using digital computing platforms with rapid recovery |
US7337344B2 (en) * | 2003-01-31 | 2008-02-26 | Point Grey Research Inc. | Methods and apparatus for synchronizing devices on different serial data buses |
US7478257B2 (en) * | 2003-03-31 | 2009-01-13 | Intel Corporation | Local receive clock signal adjustment |
US7257734B2 (en) * | 2003-07-17 | 2007-08-14 | International Business Machines Corporation | Method and apparatus for managing processors in a multi-processor data processing system |
US7529807B1 (en) * | 2004-05-05 | 2009-05-05 | Sun Microsystems, Inc. | Common storage in scalable computer systems |
US7392426B2 (en) * | 2004-06-15 | 2008-06-24 | Honeywell International Inc. | Redundant processing architecture for single fault tolerance |
US7308605B2 (en) * | 2004-07-20 | 2007-12-11 | Hewlett-Packard Development Company, L.P. | Latent error detection |
US7328371B1 (en) * | 2004-10-15 | 2008-02-05 | Advanced Micro Devices, Inc. | Core redundancy in a chip multiprocessor for highly reliable systems |
US7502958B2 (en) * | 2004-10-25 | 2009-03-10 | Hewlett-Packard Development Company, L.P. | System and method for providing firmware recoverable lockstep protection |
US7627781B2 (en) * | 2004-10-25 | 2009-12-01 | Hewlett-Packard Development Company, L.P. | System and method for establishing a spare processor for recovering from loss of lockstep in a boot processor |
US7516359B2 (en) * | 2004-10-25 | 2009-04-07 | Hewlett-Packard Development Company, L.P. | System and method for using information relating to a detected loss of lockstep for determining a responsive action |
US7624302B2 (en) * | 2004-10-25 | 2009-11-24 | Hewlett-Packard Development Company, L.P. | System and method for switching the role of boot processor to a spare processor responsive to detection of loss of lockstep in a boot processor |
US7818614B2 (en) * | 2004-10-25 | 2010-10-19 | Hewlett-Packard Development Company, L.P. | System and method for reintroducing a processor module to an operating system after lockstep recovery |
US7356733B2 (en) * | 2004-10-25 | 2008-04-08 | Hewlett-Packard Development Company, L.P. | System and method for system firmware causing an operating system to idle a processor |
US8347034B1 (en) | 2005-01-13 | 2013-01-01 | Marvell International Ltd. | Transparent level 2 cache that uses independent tag and valid random access memory arrays for cache access |
US7685372B1 (en) * | 2005-01-13 | 2010-03-23 | Marvell International Ltd. | Transparent level 2 cache controller |
US7467327B2 (en) * | 2005-01-25 | 2008-12-16 | Hewlett-Packard Development Company, L.P. | Method and system of aligning execution point of duplicate copies of a user program by exchanging information about instructions executed |
JP2006228121A (en) * | 2005-02-21 | 2006-08-31 | Toshiba Corp | Arithmetic processing unit |
US20060236168A1 (en) * | 2005-04-01 | 2006-10-19 | Honeywell International Inc. | System and method for dynamically optimizing performance and reliability of redundant processing systems |
US7933966B2 (en) * | 2005-04-26 | 2011-04-26 | Hewlett-Packard Development Company, L.P. | Method and system of copying a memory area between processor elements for lock-step execution |
US7590885B2 (en) * | 2005-04-26 | 2009-09-15 | Hewlett-Packard Development Company, L.P. | Method and system of copying memory from a source processor to a target processor by duplicating memory writes |
US7730350B2 (en) * | 2005-04-28 | 2010-06-01 | Hewlett-Packard Development Company, L.P. | Method and system of determining the execution point of programs executed in lock step |
US7426614B2 (en) * | 2005-04-28 | 2008-09-16 | Hewlett-Packard Development Company, L.P. | Method and system of executing duplicate copies of a program in lock step |
US8103861B2 (en) * | 2005-04-28 | 2012-01-24 | Hewlett-Packard Development Company, L.P. | Method and system for presenting an interrupt request to processors executing in lock step |
US7549082B2 (en) * | 2005-04-28 | 2009-06-16 | Hewlett-Packard Development Company, L.P. | Method and system of bringing processors to the same computational point |
US7716377B2 (en) * | 2005-05-25 | 2010-05-11 | Harris Steven T | Clustering server providing virtual machine data sharing |
US7747897B2 (en) * | 2005-11-18 | 2010-06-29 | Intel Corporation | Method and apparatus for lockstep processing on a fixed-latency interconnect |
US7797575B2 (en) * | 2007-04-04 | 2010-09-14 | International Business Machines Corporation | Triple voting cell processors for single event upset protection |
US7743285B1 (en) * | 2007-04-17 | 2010-06-22 | Hewlett-Packard Development Company, L.P. | Chip multiprocessor with configurable fault isolation |
DE102007062974B4 (en) * | 2007-12-21 | 2010-04-08 | Phoenix Contact Gmbh & Co. Kg | Signal processing device |
US20100169886A1 (en) * | 2008-12-31 | 2010-07-01 | Seakr Engineering, Incorporated | Distributed memory synchronized processing architecture |
JP5509637B2 (en) * | 2009-03-18 | 2014-06-04 | 日本電気株式会社 | Fault tolerant system |
US8082425B2 (en) * | 2009-04-29 | 2011-12-20 | Advanced Micro Devices, Inc. | Reliable execution using compare and transfer instruction on an SMT machine |
WO2012029137A1 (en) * | 2010-08-31 | 2012-03-08 | 富士通株式会社 | Computing device, information processing device and method of controlling computing device |
US8671311B2 (en) * | 2011-02-15 | 2014-03-11 | International Business Machines Corporation | Multiprocessor switch with selective pairing |
US8930752B2 (en) | 2011-02-15 | 2015-01-06 | International Business Machines Corporation | Scheduler for multiprocessor system switch with selective pairing |
US8635492B2 (en) | 2011-02-15 | 2014-01-21 | International Business Machines Corporation | State recovery and lockstep execution restart in a system with multiprocessor pairing |
JP5699057B2 (en) * | 2011-08-24 | 2015-04-08 | 株式会社日立製作所 | Programmable device, programmable device reconfiguration method, and electronic device |
DE102012010143B3 (en) | 2012-05-24 | 2013-11-14 | Phoenix Contact Gmbh & Co. Kg | Analog signal input circuit with a number of analog signal acquisition channels |
DE102013202253A1 (en) * | 2013-02-12 | 2014-08-14 | Paravan Gmbh | Circuit for controlling an acceleration, braking and steering system of a vehicle |
JP2015222467A (en) * | 2014-05-22 | 2015-12-10 | ルネサスエレクトロニクス株式会社 | Microcontroller and electronic control device using the same |
US10540284B2 (en) * | 2014-07-29 | 2020-01-21 | Nxp Usa, Inc. | Cache-coherent multiprocessor system and a method for detecting failures in a cache-coherent multiprocessor system |
US10331532B2 (en) * | 2017-01-19 | 2019-06-25 | Qualcomm Incorporated | Periodic non-intrusive diagnosis of lockstep systems |
CN106886498B (en) * | 2017-02-28 | 2020-06-26 | 华为技术有限公司 | Data processing device and terminal |
US10514990B2 (en) * | 2017-11-27 | 2019-12-24 | Intel Corporation | Mission-critical computing architecture |
US10635550B2 (en) | 2017-12-08 | 2020-04-28 | Ge Aviation Systems Llc | Memory event mitigation in redundant software installations |
US10946866B2 (en) | 2018-03-31 | 2021-03-16 | Intel Corporation | Core tightly coupled lockstep for high functional safety |
WO2019193384A1 (en) * | 2018-04-02 | 2019-10-10 | Pratik Sharma | Fail-stop processors |
US11120642B2 (en) | 2018-06-27 | 2021-09-14 | Intel Corporation | Functional safety critical audio system for autonomous and industrial applications |
US10922203B1 (en) * | 2018-09-21 | 2021-02-16 | Nvidia Corporation | Fault injection architecture for resilient GPU computing |
US10992516B2 (en) | 2018-12-13 | 2021-04-27 | Honeywell International Inc. | Efficient self-checking redundancy comparison in a network |
US11520297B2 (en) | 2019-03-29 | 2022-12-06 | Intel Corporation | Enhancing diagnostic capabilities of computing systems by combining variable patrolling API and comparison mechanism of variables |
US11269799B2 (en) * | 2019-05-03 | 2022-03-08 | Arm Limited | Cluster of processing elements having split mode and lock mode |
GB2588206B (en) * | 2019-10-15 | 2022-03-16 | Advanced Risc Mach Ltd | Co-scheduled loads in a data processing apparatus |
WO2023212094A1 (en) * | 2022-04-26 | 2023-11-02 | Motional Ad Llc | Software-defined compute nodes on multi-soc architectures |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4456952A (en) * | 1977-03-17 | 1984-06-26 | Honeywell Information Systems Inc. | Data processing system having redundant control processors for fault detection |
US5901281A (en) * | 1991-01-25 | 1999-05-04 | Hitachi, Ltd. | Processing unit for a computer and a computer system incorporating such a processing unit |
US5903717A (en) * | 1997-04-02 | 1999-05-11 | General Dynamics Information Systems, Inc. | Fault tolerant computer system |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4868851A (en) * | 1988-01-26 | 1989-09-19 | Harris Corporation | Signal processing apparatus and method |
US5058053A (en) * | 1988-03-31 | 1991-10-15 | International Business Machines Corporation | High performance computer system with unidirectional information flow |
US4965717A (en) * | 1988-12-09 | 1990-10-23 | Tandem Computers Incorporated | Multiple processor system having shared memory with private-write capability |
US5226152A (en) * | 1990-12-07 | 1993-07-06 | Motorola, Inc. | Functional lockstep arrangement for redundant processors |
US5623449A (en) * | 1995-08-11 | 1997-04-22 | Lucent Technologies Inc. | Flag detection for first-in-first-out memories |
US6854075B2 (en) * | 2000-04-19 | 2005-02-08 | Hewlett-Packard Development Company, L.P. | Simultaneous and redundantly threaded processor store instruction comparator |
-
2002
- 2002-01-31 US US10/061,522 patent/US6862693B2/en not_active Expired - Lifetime
- 2002-04-11 WO PCT/US2002/011563 patent/WO2002084490A2/en active IP Right Grant
- 2002-04-11 JP JP2002582363A patent/JP3972983B2/en not_active Expired - Lifetime
- 2002-04-11 KR KR1020037013188A patent/KR100842637B1/en active IP Right Grant
- 2002-04-11 AU AU2002252647A patent/AU2002252647A1/en not_active Abandoned
- 2002-04-11 DE DE60212115T patent/DE60212115D1/en not_active Expired - Lifetime
- 2002-04-11 EP EP02721731A patent/EP1379951B1/en not_active Expired - Lifetime
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4456952A (en) * | 1977-03-17 | 1984-06-26 | Honeywell Information Systems Inc. | Data processing system having redundant control processors for fault detection |
US5901281A (en) * | 1991-01-25 | 1999-05-04 | Hitachi, Ltd. | Processing unit for a computer and a computer system incorporating such a processing unit |
US5903717A (en) * | 1997-04-02 | 1999-05-11 | General Dynamics Information Systems, Inc. | Fault tolerant computer system |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7237144B2 (en) | 2004-04-06 | 2007-06-26 | Hewlett-Packard Development Company, L.P. | Off-chip lockstep checking |
US7287185B2 (en) | 2004-04-06 | 2007-10-23 | Hewlett-Packard Development Company, L.P. | Architectural support for selective use of high-reliability mode in a computer system |
US7290169B2 (en) | 2004-04-06 | 2007-10-30 | Hewlett-Packard Development Company, L.P. | Core-level processor lockstepping |
US7296181B2 (en) | 2004-04-06 | 2007-11-13 | Hewlett-Packard Development Company, L.P. | Lockstep error signaling |
Also Published As
Publication number | Publication date |
---|---|
AU2002252647A1 (en) | 2002-10-28 |
US20020152420A1 (en) | 2002-10-17 |
EP1379951B1 (en) | 2006-06-07 |
WO2002084490A3 (en) | 2003-08-14 |
EP1379951A2 (en) | 2004-01-14 |
JP2005512162A (en) | 2005-04-28 |
US6862693B2 (en) | 2005-03-01 |
KR100842637B1 (en) | 2008-06-30 |
JP3972983B2 (en) | 2007-09-05 |
DE60212115D1 (en) | 2006-07-20 |
KR20040063794A (en) | 2004-07-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1379951B1 (en) | Providing fault-tolerance by comparing addresses and data from redundant processors running in lock-step | |
US7055060B2 (en) | On-die mechanism for high-reliability processor | |
EP2425330B1 (en) | Reliable execution using compare and transfer instruction on an smt machine | |
US7145837B2 (en) | Global recovery for time of day synchronization | |
US7668923B2 (en) | Master-slave adapter | |
US7415630B2 (en) | Cache coherency during resynchronization of self-correcting computer | |
JP2500038B2 (en) | Multiprocessor computer system, fault tolerant processing method and data processing system | |
US20050091383A1 (en) | Efficient zero copy transfer of messages between nodes in a data processing system | |
US8671311B2 (en) | Multiprocessor switch with selective pairing | |
US20050080869A1 (en) | Transferring message packets from a first node to a plurality of nodes in broadcast fashion via direct memory to memory transfer | |
US8196027B2 (en) | Method and device for comparing data in a computer system having at least two execution units | |
US20050080920A1 (en) | Interpartition control facility for processing commands that effectuate direct memory to memory information transfer | |
US20050080945A1 (en) | Transferring message packets from data continued in disparate areas of source memory via preloading | |
US7194671B2 (en) | Mechanism handling race conditions in FRC-enabled processors | |
US5276862A (en) | Safestore frame implementation in a central processor | |
US8201067B2 (en) | Processor error checking for instruction data | |
US5557737A (en) | Automated safestore stack generation and recovery in a fault tolerant central processor | |
US20050078708A1 (en) | Formatting packet headers in a communications adapter | |
US20200348985A1 (en) | Cluster of processing elements having split mode and lock mode | |
US6915450B2 (en) | Method and apparatus for arbitrating transactions between domains in a computer system | |
Falih Mahmood | A Pipelined Fault Tolerant Architecture for Real time DSP Applications | |
JPH0458329A (en) | Arithmetic processor | |
Rim et al. | An architecture for high availability multi-user systems | |
Kato | A high dependability computer attainable by a tracking redundancy scheme | |
JPH01147755A (en) | Memory access control device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG UZ VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 1020037013188 Country of ref document: KR |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2002582363 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2002721731 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 2002721731 Country of ref document: EP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
WWG | Wipo information: grant in national office |
Ref document number: 2002721731 Country of ref document: EP |