US20050240806A1 - Diagnostic memory dump method in a redundant processor - Google Patents

Diagnostic memory dump method in a redundant processor Download PDF

Info

Publication number
US20050240806A1
US20050240806A1 US10/953,242 US95324204A US2005240806A1 US 20050240806 A1 US20050240806 A1 US 20050240806A1 US 95324204 A US95324204 A US 95324204A US 2005240806 A1 US2005240806 A1 US 2005240806A1
Authority
US
United States
Prior art keywords
processor
memory
processor element
dump
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/953,242
Other languages
English (en)
Inventor
William Bruckert
James Klecka
James Smullen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hewlett Packard Development Co LP
Original Assignee
Hewlett Packard Development Co LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Development Co LP filed Critical Hewlett Packard Development Co LP
Priority to US10/953,242 priority Critical patent/US20050240806A1/en
Assigned to HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P. reassignment HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KLECKA, JAMES S., BRUCKERT, WILLIAM F., SMULLEN, JAMES R.
Priority to CN 200510107155 priority patent/CN1755660B/zh
Publication of US20050240806A1 publication Critical patent/US20050240806A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1658Data re-synchronization of a redundant component, or initial sync of replacement, additional or spare unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1629Error detection by comparing the output of redundant processing systems
    • G06F11/165Error detection by comparing the output of redundant processing systems with continued operation after detection of the error
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1675Temporal synchronisation or re-synchronisation of redundant processing components
    • G06F11/1687Temporal synchronisation or re-synchronisation of redundant processing components at event level, e.g. by interrupt or result of polling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3404Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for parallel or distributed programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3495Performance evaluation by tracing or monitoring for systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/3636Software debugging by tracing the execution of the program
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/366Software debugging using diagnostics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1629Error detection by comparing the output of redundant processing systems
    • G06F11/1641Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1629Error detection by comparing the output of redundant processing systems
    • G06F11/1641Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components
    • G06F11/1645Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components and the comparison itself uses redundant hardware
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1675Temporal synchronisation or re-synchronisation of redundant processing components
    • G06F11/1683Temporal synchronisation or re-synchronisation of redundant processing components at instruction level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/18Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits
    • G06F11/183Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components
    • G06F11/184Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components where the redundant components implement processing functionality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/18Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits
    • G06F11/183Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components
    • G06F11/184Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components where the redundant components implement processing functionality
    • G06F11/185Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components where the redundant components implement processing functionality and the voting is itself performed redundantly
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/88Monitoring involving counting

Definitions

  • a useful tool for diagnosing system difficulties in a computing system is a memory dump, an output file generated by an operating system during a failure for usage in determining the cause of the failure.
  • the stored debug information may be reduced to cover only the operating system or kernel level memory, allowing analysis of nearly all kernel-level system errors.
  • the kernel-level system dump remains large enough to compromise availability.
  • An even smaller memory dump may be acquired to cover only the smallest amount of base-level debugging information, typically sufficient only to identify a problem.
  • a plurality of redundant, loosely-coupled processor elements are operational as a logical processor.
  • a logic detects a halt condition of the logical processor and, in response to the halt condition, reintegrates and commences operation in less than all of the processor elements leaving at least one processor element nonoperational.
  • the logic also buffers data from the nonoperational processor element in the reloaded operational processor elements and writes the buffered data to storage for analysis.
  • FIG. 1 is a schematic block diagram depicting an embodiment of a computing system that includes a plurality of redundant, loosely-coupled processor elements arranged and operational as a logical processor;
  • FIG. 2 is a flow chart that illustrates an embodiment of a sequence of operations for performing an asymmetric memory dump operation
  • FIG. 3 is a flow chart showing an alternative embodiment of a method for performing a diagnostic memory dump operation
  • FIGS. 4A, 4B , and 4 C are schematic block diagrams respectively showing an embodiment of a computer system
  • FIG. 5 is a schematic block diagram depicting another embodiment of a synchronization unit
  • FIG. 6 is a schematic block diagram showing a functional view of three processor slices operating in duplex mode with one processor slice omitted from running the operating system;
  • FIG. 7 is a block timing diagram that illustrates an embodiment of the technique for performing a diagnostic memory dump
  • FIG. 8 is a schematic block diagram illustrating an embodiment of a processor complex in a processor node
  • FIG. 9 is a schematic block diagram showing an embodiment of a processor complex that includes three processor slices.
  • FIGS. 10A and 10B are schematic block diagrams depicting an embodiment of a processor complex and logical processor.
  • a computing system composed of multiple redundant processors can capture a memory dump by having a central processing unit, for example running an operating system that supports multiple redundant processors such as the NonStop KernelTM made available by Hewlett Packard Company of Palo Alto, Calif., copy the memory of a non-executing central processing unit.
  • the non-executing or “down” processor can run a Gard State Services (HSS) operating system.
  • HSS Gard State Services
  • pre-Fast-Memory-Dump pre-FMD
  • the running processor copies raw data from the down processor memory and compresses the raw data in the running processor.
  • the raw data is compressed in the down processor, for example under HSS, and the compressed data is moved via network communications, such as ServerNetTM to the running processor which writes the compressed data to storage, such as a memory dump disk.
  • a further alternative Fast-Memory-Dump enhancement involves copying only part of the memory, either compressed or noncompressed data, from the down processor to the running processor, then reloading the operating environment to the memory part to begin execution, then copying the remainder of the memory after completion of the reload that returns the memory to the running system.
  • the described dump techniques use capture of memory contents prior to reloading the processor, for example by copying the memory to the system swap files or special dump files.
  • the copying time may be mitigated, for example by copying part of memory and reloading into only that copied part, then recopying the remainder of the data after the processor is reloaded, returning memory to normal usage after the copy is complete.
  • the method reduces the time expenditure, and thus system down-time, for capturing and storing the memory dump data, but significantly impacts processor performance since only a subset of the memory is available for normal operation.
  • the techniques further involve a significant delay before normal operation can be started.
  • a logical processor may include two or three processor elements running the same logical instruction stream.
  • a dual-modular redundant (DMR) logical processor and/or a tri-modular redundant (TMR) logical processor can capture and save a logical processor memory dump while concurrently running an operating system on the sample logical processor.
  • DMR dual-modular redundant
  • TMR tri-modular redundant
  • Memory dump data is copied from the down processor element, for example using a dissimilar data exchange direct memory access (DMA) transfer.
  • DMA dissimilar data exchange direct memory access
  • a schematic block diagram depicts an embodiment of a computing system 100 that includes a plurality of redundant, loosely-coupled processor elements 102 that are arranged and operational as a logical processor.
  • a logic for example executable in the processor elements 102 , detects a halt condition of the logical processor and, in response to the halt condition, reloads and commences operation in less than all of the processor elements, leaving at least one processor element nonoperational.
  • the logic also buffers data from the nonoperational processor element in the reloaded operational processor elements and writes the buffered data to storage for analysis.
  • the loosely-coupled processors 102 form a combined system of multiple logical processors, with the individual processors 102 assuring data integrity of a computation.
  • the illustrative computing system 100 operates as a node that can be connected to a network 104 .
  • a suitable network 104 is a dual-fabric Hewlett Packard ServerNet clusterTM.
  • the computing system 100 is typically configured for fault isolation, small fault domains, and a massively parallel architecture.
  • the computing system 100 is a triplexed server which maintains no single point of hardware failure, even under processor failure conditions.
  • the illustrative configuration has three instances of a processor slice 106 A, 106 B, 106 C connected to four logical synchronization units 108 .
  • a logical synchronization unit 108 is shown connected to two network fabrics 104 .
  • Individual logical computations in a logical processor are executed separately three times in the three physical processors. Individual copies of computation results eventually produce an output message for input/output or interprocess communication, which is forwarded to a logical synchronization unit 108 and mutually checked for agreement. If any one of the three copies of the output message is different from the others, the differing copy of the computation is “voted out” of future computations, and computations on the remaining instances of the logical processor continue. Accordingly, even after a processor failure, no single point of failure proceeds to further computations. At a convenient time, the errant processing element can be replaced online, reintegrating with the remaining processing elements, restoring the computing system 100 to fully triplexed computation.
  • the individual processor slices 106 A, 106 B, 106 C are each associated respectively with a memory 110 A, 110 B, 110 C and a reintegration element 112 A, 112 B, 112 C.
  • the data can be buffered or temporarily stored, typically in the memory associated with the running processor slices.
  • the individual processor slices 106 A, 106 B, 106 C also can include a logic, for example a program capable of execution on the processor elements 102 or other type of control logic, that reloads and reintegrates the nonoperational processor element or slice into the logical processor after the data is buffered.
  • An asymmetric data dump is desirable to enable analysis into causes of a failure condition. Diagnostic information is more likely to be collected if acquisition can be made without compromising performance and availability.
  • the computing system 100 can activate capture of diagnostic memory dump information upon a logical processor halt condition.
  • Diagnostic memory dump collection logic reloads the logical processor but does not include all processor slices 106 A, 106 B, 106 C in the reload.
  • the processor slice selected to omit from reload is arbitrarily selected.
  • the omitted processor slice can be selected based on particular criteria, such as measured performance of the individual slices, variation in capability and/or functionality of the particular processor slices, or the like.
  • the processor slice omitted from the reload is maintained in the stopped condition with network traffic neither allowed into the stopped processor elements 102 nor allowed as output.
  • the processor element may remain stopped and the associated memory may be dumped prior to reintegrating the element.
  • a flow chart illustrates an embodiment of a sequence of operations for performing an asymmetric memory dump operation 200 .
  • a response logic or program reloads the operating system 204 in less than all of the processor slices.
  • one processor slice is omitted from reload while the operating system is reloaded in the other two processor slices.
  • One technique for reloading less than all processor slices is performed by issuing a command to place the logical processor in a “ready for reload” state that denotes which processor slice and/or processor element is to be omitted from the reload.
  • the omitted processor slice or processor element is voted out and the remaining two processor elements in a tri-modular redundant (TMR) logical processor, or the remaining single processor element in a dual-modular redundant (DMR) logical processor, are reloaded. Criteria for selection of the omitted processor may be arbitrary or based on various conditions and circumstances.
  • a parallel receive dump (PRD) program is automatically initiated 206 .
  • PRD parallel receive dump
  • a separate instruction stream of the PRD program typically executes in all reloaded processor slices, although other implementations may execute from logic located in alternative locations.
  • the parallel receive dump (PRD) program creates 208 a dump file and allocates 210 buffers; typically memory associated with the processor slices. All architectural state of the processor elements is typically saved in the memory so that separate collection of the information is superfluous.
  • the parallel receive dump (PRD) program opens 212 a memory window on a physical partition of memory associated with the particular processor slice executing the PRD program.
  • the parallel receive dump (PRD) program moves 214 the window over the entire partition.
  • all processor elements of the logical processor generally have identical partitions so that the window also describes the partition of the omitted processor element.
  • the parallel receive dump (PRD) program performs a divergent data direct memory access (DMA) 216 operation that transfers data from one processor slice to another via DMA.
  • DMA divergent data direct memory access
  • a specific embodiment can use high order address bits of the source memory address to identify the particular omitted processor slice and/or processor element. For example, four memory views are available using ServerNetTM including the logical processor as a whole, processor slice A, processor slice B, and processor slice C.
  • a direct memory access device reads data from the omitted processor slice and copies the data to a buffer in memory of the two executing processor slices, for example running the NonStop KernelTM.
  • the source physical address high-order bits denote the one specific processor slice omitted from the reload.
  • the target buffer for the DMA operation is in memory associated with the reloaded processor slices running the operating system and not the memory associated with the stopped processor slice.
  • the computing system that executes the asymmetric memory dump operation 200 may further include a reintegration logic that restarts and resynchronizes the plurality of processor elements following a failure or service condition.
  • a reintegration process is executable on at least one of the operating processor elements and delays reintegration of the nonoperational processor element until dump processing is complete.
  • the parallel receive dump program compresses 218 the data into a compressed dump format, and writes 220 the compressed data to a storage device, for example a dump file on an external disk storage. Transfer of the compressed dump data is similar to the dump operation of the pre-Fast-Memory-Dump (pre-FMD) technique receive dump operation.
  • pre-FMD pre-Fast-Memory-Dump
  • the parallel receive dump program When the parallel receive dump program has completed copying, compressing, and writing the dump data to storage, the parallel receive dump program closes 222 the window to physical memory, closes 224 the storage file, and initiates 226 reintegration of the dumped processor slice.
  • the omitted processor slice is reintegrated 228 and the dump operation completes.
  • the diagnostic memory dump method 300 can be implemented in an executable logic such as a computer program or other operational code that executes in the processor elements or in other control elements.
  • the operation begins on detection 302 of a halt condition of at least one processor element of multiple redundant processor elements.
  • a system may implement a pointer to a linked list of various fault and operating conditions that causes execution to begin at a servicing logic for a particular condition.
  • Various conditions may evoke a response that generates a diagnostic memory dump.
  • One processor element termed a “down” processor element, is maintained 304 in a state existing at the halt condition and the other processor elements are reloaded 306 and enabled to commence execution, for example restarting an operating system such as Hewlett Packard's NonStop KemelTM.
  • One technique for initiating a response to the halt condition issues a command to place the logical processor in a “ready for reload ” state in which the command designates the processor element to be omitted from the reload.
  • the command causes “voting-out” of the omitted processor element and reloads the remaining processor elements for execution.
  • the state of the down processor maintained at the halt condition is copied 308 to a storage while the reloaded other processors continue executing.
  • the reloaded processor elements can automatically initiate a parallel receive dump program that creates a dump file, allocates buffers in memory associated with the reloaded processors for usage in temporarily storing the memory dump data, and saves the architectural state of all processor elements in memory buffers.
  • a special direct memory access (DMA) operation can be started that copies memory of the down processor element to buffers in the running processor elements.
  • the divergent data Direct Memory Access (DMA) operation uses a self-directed write whereby a designated source memory address identifies the one processor element maintained in the halt condition.
  • the parallel receive dump program compresses data into dump format, writes 312 the compressed memory dump data to a storage device, and closes the memory window.
  • the divergent data DMA operation can be used to write 312 the memory dump from the buffers to a storage device, for example an external disk storage device for subsequent analysis.
  • the down processor element is reintegrated 310 into the logical processor.
  • FIGS. 4A, 4B , and 4 C schematic block diagrams respectively show an embodiment of a computer system 400 , for example a fault-tolerant NonStopTM architecture computer system from Hewlett-Packard Company of Palo Alto, Calif., and two views of an individual processor slice 402 .
  • the illustrative processor slice 402 is an N-way computer with dedicated memory and clock oscillator.
  • the processor slice 402 has multiple microprocessors 404 , a memory controller/IO interface 408 , and a memory subsystem 406 .
  • the processor slice 402 further includes reintegration logic 410 and an interface to voter logic.
  • a logical processor halts.
  • the computer system 400 is an entire process made up of multiple logical processors.
  • the halted logical processor ceases functioning under the normal operating system, for example the NonStop Kernel system, and enters a state that allows only abbreviated functionality.
  • the halt condition causes the system to function under halted state services (HSS).
  • a failure monitoring logic such as software or firmware operating in control logic, detects the halt condition, and selects from among the processor slices 402 for a processor slice to omit from reloading.
  • the omitted processor may be selected arbitrarily or based on functionality or operating characteristics, such as performance capability considerations associated with the different processor slices 402 .
  • the control logic reloads the operating system into memory for processor slices that are not selected for omission so that the processor slices return to execution.
  • the omitted processor slice remains inactive or “down”, continuing operations under halted state services (HSS).
  • the reloaded processor slices request capture of a memory dump from the omitted processor memory while the omitted processor slice remains functionally isolated from the operating processor slices.
  • the reloaded processors begin a copy process, for example that executes on the reloaded processors and stores the memory dump data from the omitted processor to memory 406 associated with the operating processor slices 402 .
  • the diagnostic dump data passes from the omitted processor slice to the reloaded and operating processor slices via a pathway through a logical synchronization unit 414 .
  • the logical synchronization unit 414 is an input/output interface and synchronization unit that can be operated to extract the memory dump data from the omitted processor slice including a copy of the data and a copy of a control descriptor associated with the data.
  • the reintegration logic 410 generally operates in conditions of processor slice failure to reintegrate the operations of the failed slice into the group of redundant slices, for example by replicating write operations of memory for one processor slice to memory of other processor slices in a redundant combination of processor
  • software executing in one or more of the reloaded and running processor slices performs asymmetric input/output operations that copy memory dump data from the omitted processor slice to memory buffers in both of the operating, reloaded processor slices. Accordingly, the two reloaded processor slices 402 operate temporarily in duplex mode while acquiring and storing the diagnostic dump information before returning to triplex operation.
  • the microprocessors 404 may be standard Intel Itanium Processor Family multiprocessors that share a partitioned memory system. Each microprocessor may have one or more cores per die.
  • a processor slice 402 with an N-Way Symmetrical Multi-Processor (SMP) supports N logical processors. Each logical processor has an individual system image and does not share memory with any other processor.
  • SMP Symmetrical Multi-Processor
  • Reintegration logic 410 can replicate memory write operations to the local memory and sends the operations across a reintegration link 412 to another slice.
  • the reintegration logic 410 is configurable to accept memory write operations from the reintegration link 412 .
  • the reintegration logic 410 can be interfaced between the I/O bridge/memory controller 408 and memory 406 , for example Dual In-line Memory Modules (DIMMs).
  • DIMMs Dual In-line Memory Modules
  • the reintegration logic 410 may be integrated into the I/O bridge/memory controller 408 .
  • Reintegration logic 410 is used to bring a new processor slice 402 online by bringing memory state in line with other processor slices.
  • the computer system 400 uses loosely lock-stepped multiprocessor boxes called slices 402 , each a fully functional computer with a combination of microprocessors 404 , cache, memory 406 , and interfacing 408 to input/output lines. All output paths from the multiprocessor slices 402 are compared for data integrity. A failure in one slice 402 is transparently handled by continuing operation with other slices 402 continuing in operation.
  • the computer system 400 executes in a “loose-lock stepping” manner in which redundant microprocessors 404 run the same instruction stream and compare results intermittently, not on a cycle-by-cycle basis, but rather when the processor slice 402 performs an output operation. Loose-lockstep operation prevents error recovery routines and minor non-determinism conditions in the microprocessor 404 from causing lock-step comparison errors.
  • a schematic block diagram depicts an embodiment of a synchronization unit 500 including a logical gateway 514 that prevents divergent operations from propagating to the input/output stream.
  • the synchronization unit 500 is capable of connecting to one, two, or three processor slices 504 through a serialized input/output bus.
  • the synchronization unit 500 performs transaction-level checking on input/output transactions and forwards data to a host-side input/output device 522 , for example a host bus adapter or storage array network (SAN) controller.
  • the synchronization engine 520 enables the multiple processor slices 504 to synchronize and exchange asynchronous data such as interrupts and can also control exchange of private and dissimilar data among slices.
  • Logical gateway 514 has two independent voter subunits, one for voting Programmed Input/Output (PIO) read and write transactions 516 and a second for voting Direct Memory Access (DMA) read responses 518 .
  • the Direct Memory Access (DMA) read response subunit 518 verifies input/output controller-initiated DMA operations or responses from memory and performs checks on read data with processors performing voted-write operations.
  • DMA write traffic is originated by the input/output controller 522 and is replicated to all participating processor slices 504 .
  • DMA read traffic is originated by the input/output controller 522 .
  • DMA read requests are replicated to all participating slices 504 .
  • a block timing diagram illustrates an embodiment of the technique for performing a diagnostic memory dump 700 .
  • the computing system runs in a triplex mode 702 with the three individual processor slices executing a common instruction stream redundantly.
  • a halt condition 704 terminates execution of the processor slices and entry of all slices into halted state services (HSS).
  • HSS halted state services
  • a processor slice is selected to omit from running of the processor slices in duplex and the two processor slices not selected for omission are reloaded 706 . After reload, the two processor slices run the operating system in duplex 708 .
  • a copy process starts 710 , for example in a logical synchronization unit that operates as a pathway for copying data from the omitted processor slice memory to a buffer memory in one or both of the running processor slices under management of control information received from the running processor slices.
  • Data can be copied through a scan of the entire memory partition of the omitted processor slice memory.
  • a schematic block diagram illustrates an embodiment of a processor complex 900 that includes three processor slices, slice A 902 , slice B 904 , and slice C 906 , and N voting blocks 908 and System Area Network (SAN) interfaces 910 .
  • the number of voting blocks N is the number of logical processors supported in the processor complex 900 .
  • a processor slice 902 , 904 , 906 is illustratively a multiprocessor computer with contained caches, a memory system, a clock oscillator, and the like. Each microprocessor is capable of running a different instruction stream from a different logical processor.
  • N voting blocks 908 and N SAN interfaces 910 are mutually paired and included within N respective logical synchronization units (LSUs) 912 .
  • An illustrative processor complex 900 has one to two logical synchronization blocks 912 , with associated voting block unit 908 and SAN interface 910 , per logical processor.
  • processor complex 900 is merely descriptive term and does not necessarily define a system enclosed within a single housing.
  • a processor complex is generally not a single field-replaceable unit.
  • a field-replaceable unit can include one LSU and one slice.
  • the voter units 908 are logical gateways of operation and data crossing from the logical synchronization blocks 912 unchecked domain to a self-checked domain.
  • DMA read data and PIO reads and write requests that address the self-checked domain are checked by the voter unit 908 in order of receipt. Operations are not allowed to pass one another and complete before the next is allowed to start.
  • DMA read response data are also checked in the order received and then forwarded to the system area network interface 910 , for example a Peripheral Component Interconnect Extended (PCI-X) interface.
  • PCI-X Peripheral Component Interconnect Extended
  • a processor complex includes reintegration links 412 that copy memory contents from a functioning slice or slicess to a non-operating or newly added slice. Reintegration is used after some errors or repair operations in conditions that recovery is served by resetting a slice and returning to operation with other running slices.
  • the reintegration link 412 may copy over the memory of a single processor element, multiple processor elements, or all processor elements within a processor slice.
  • processor element refers to a single core.
  • FIG. 10B a schematic block diagram illustrates the embodiment of the processor complex 1000 , depicting a logical processor 1006 .
  • a processor slice 1004 for example the N-way SMP processor slice, each instruction stream is associated to a different logical processor 1006 .
  • Each logical processor 1006 can execute a processor-dedicated copy of an operating system, for example the NonStop Kernel (NSK)TM operating system from Hewlett-Packard Company of Palo Alto, Calif.
  • N logical processors 1006 that mutually share neither private memory nor peripheral storage, but otherwise all run out of the same physically-shared memory. Except for a small amount of initialization code that segments the processor slice memory, each logical processor runs independently of the others from different regions of the same memory.
  • the logical processor 1006 is formed from one or more processor elements 1002 , for example three in the illustrative embodiment, depending on the number of processor slices 1004 available.
  • a simplex logical processor has only one processor element (PE) per logical processor.
  • a dual-modular redundant (DMR) logical processor has two processor elements (PEs) per logical processor.
  • a tri-modular redundant (TMR) logical processor has three.
  • Each processor element 1002 in a logical processor 1006 runs the same instruction stream, in loosely lock-stepped operation, and output data from multiple processor elements is compared during data input/output (I/O) operations.
  • the logical synchronization unit (LSU) 500 functions as part of a logical processor 1006 in a fault tolerant interface to a system area network 914 and performs voting and synchronization of the processor elements 1002 of the logical processor 1006 .
  • each logical synchronization unit 500 is controlled and used by only a single logical processor 1006 .
  • one or two logical synchronization units 500 are combined with one, two, or three processor elements 1002 to create varying degrees of fault tolerance in the logical processors 1006 .
  • a system may optionally be configured with a second logical synchronization unit 500 per logical processor 1006 .
  • the voter logic 908 connects the processor slices 902 , 904 , 906 to the SAN interface 910 and supplies synchronization functionality for the logical processor. More specifically, the voter logic 908 compares data from programmed input/output (PIO) reads and writes to registers in the logical synchronization unit 912 from each of the processor elements. The comparison is called voting and ensures that only correct commands are sent to logical synchronization unit logic. Voter logic 908 also reads outbound data from processor slice memories and compares the results before sending the data to the system area network (SAN), ensuring that outbound SAN traffic only contains data computed, or agreed-upon by voting, by all processor elements in the logical processor.
  • PIO programmed input/output
  • Voter logic 908 also reads outbound data from processor slice memories and compares the results before sending the data to the system area network (SAN), ensuring that outbound SAN traffic only contains data computed, or agreed-upon by voting, by all processor elements in the logical processor.
  • SAN system area
  • the voter logic 908 also replicates and distributes programmed input/output (PIO) data read from the system area network and registers in the logical synchronization unit 912 to each of the processor elements.
  • the voter logic 908 further replicates and distributes inbound data from the system area network to each of the processor elements.
  • the voter logic 908 can supply time-of-day support to enable processor slices to simultaneously read the same time-of-day value.
  • the voter logic 908 supports a rendezvous operation so that all processor slices can periodically check for mutual synchrony, and cause one or more processor elements to wait to attain synchrony.
  • the voter logic 908 also supports asymmetric data exchange buffers and inter-slice interrupt capability.
  • the voter logic shown as the logical gateway 514 in FIG. 5 , includes interface logic, for example programmed input/output (PIO) 516 and direct memory access (DMA) read interface 518 , and state logic.
  • the state logic designates either an asymmetric state or a symmetric state.
  • the asymmetric state is specific to one processor element 504 .
  • the symmetric state is common to the entire logical processor. Examples of processor element-specific logic and data are the rendezvous registers and logic shown as synchronization engine 520 , dissimilar data exchange buffers, and the inter-slice interrupts.
  • Parallel read and write operations to the asymmetric logic are from a single processor element 504 .
  • Processor element-initiated read and write operations to the asynchronous registers are not voted or compared. Data is sent back only to the specific processor element requesting the operation.
  • the logical gateway 514 forwards data to the processor element memories 506 at approximately the same time.
  • the processor elements 504 do not execute in perfect lockstep so that data may arrive in memory 506 early or late relative to program execution of the particular processor element 504 .
  • the system area network (SAN) interface 522 is generally used for all input/output, storage, and interprocessor communications.
  • the SAN interface 522 communicates with the three processor slices through the logical gateway 514 .
  • System area network (SAN) traffic passes to and from a logical processor and not individual processor elements.
  • the logical gateway 514 replicates data from the system area network to the memory of all processor elements 504 participating in the logical processor.
  • the logical gateway 514 also performs the voting operation, comparing data from the slices before passing the data to the SAN interface.
  • each logical processor has a dedicated SAN interface.
  • redundant execution paths can be implemented to avoid a single failure from disabling multiple logical processors.
  • the system performs according to a loose lock-step fault tolerance model that enables high availability.
  • the system is tolerant of hardware and many software faults via loosely-coupled clustering software that can shift workload from a failing processor to the other processors in the cluster.
  • the model tolerates single hardware faults as well as software faults that affect only a single processor.
  • the model uses processor self-checking and immediately stopping before faulty data is written to persistent storage or propagated to other processors.
  • the new processor or processors are reintegrated, including restarting and resynchronizing with the existing running processors. Steps to restore the processor memory state and return to loose lock-step operation with the existing processor or processors is called reintegration.
  • reintegration Unlike a logical processor failure, multiple-redundant hardware failures and the subsequent reintegration of the replacement or restarted hardware are not detectable to application software.
  • Reintegration is an application-transparent action that incorporates an additional processor slice into one or more redundant slices in an operating logical processor.
  • a reintegration link is used to copy memory state from the reintegration source to the target. Since the processor on the reintegration source continues executing application code and updating memory, the reintegration link allows normal operations to modify memory and still have modifications reflected to the reintegration target memory.
  • Reintegration link hardware can be implemented so that reintegration occurs in groups of one or more logical processors. For example, an implementation can reintegrate an entire slice even if only one processor element of one logical processor is restarted. A reintegration scheme that affects only a single logical processor reduces or minimizes the amount of time a system runs on less than full capability.
  • Reintegration is triggered by a condition such as a processor slice replacement, receipt of a command for a system management facility, and/or occurrence of an input/output voting error or other error detected by other processor elements in the logical processor.
  • a condition such as a processor slice replacement, receipt of a command for a system management facility, and/or occurrence of an input/output voting error or other error detected by other processor elements in the logical processor.
  • processor slice replacement each new processor elements is reintegrated by the currently running logical processors.
  • remaining executing processor elements can reset the faulty processor element and reintegrate.
  • the processor element can be brought back to a fully functional state.
  • the logical processor that can reintegrate a new processor element determines whether to begin the reintegration process. For reintegration of an entire slice, control resides in the processor complex. In both cases reintegration control is below the system level function. If reintegration fails, the logical processor simply logs the error and continues to attempt the reintegration process. The logical processor may reduce the frequency of reintegration, but continues to try until success.
  • reintegration control is performed by the reintegration logic 410 that is positioned between the processors 404 and memory 406 .
  • a reintegration link 412 connects the multiple processor slices 402 and can reflect memory writes from one slice to an adjacent neighbor slice.
  • the reintegration logic 410 can be a double data rate synchronous dynamic random access memory (DRAM) interface.
  • DRAM synchronous dynamic random access memory
  • Reintegration logic 410 transparently passes memory operations between the microprocessors 404 and local memory 406 .
  • Reintegration link 412 usage is generally limited to scrubbing of latent faults.
  • reintegration logic 410 at the source processor slice duplicates all main memory write operations, sending the operation both to the local memory 406 and across the reintegration link 412 .
  • reintegration logic 410 accepts incoming writes from the reintegration link 412 and writes target local memory 406 .
  • the target does not execute application programs but rather executes a tight cache resident loop with no reads or writes to target local memory 406 .
  • the reintegration link 412 is a one-way connection from one processor slice to one adjacent neighbor processor slice. For a system that includes three processor slices A, B, and C, only slice A can reintegrate slice B, only slice B can reintegrate slice C, and only slice C can reintegrate slice A. Starting from one processor slice with two new processor slices, reintegration can be done in two steps. The first reintegration cycle brings the second processor slice online and the second reintegration cycle brings the third online.
  • reintegration logic 410 on the source and target slices are initialized.
  • the logical synchronization unit 414 is set unresponsive to the target, no interrupts are delivered to the target and input/output operations do not include the target.
  • the target is set to accept writes from the reintegration link 412 .
  • the target executes an in-cache loop waiting for reintegration to complete.
  • the source reads and writes back all memory local to the source, in an atomic operation since the SAN interface may simultaneously be updating target source memory.
  • a single pass operation reads each cache block from memory and then tags the cache block as dirty with an atomic operation without changing contents. Then, target memory is updated except for the state contained in the remaining dirty cache blocks on the source cache.
  • reintegration affects all processor elements in a slice.
  • reintegration is performed to the entire memory of the affected processor slice.
  • the processor complex executes applications with two processor slices active during reintegration so that no loss of data integrity occurs.
  • reintegration is performed on an entire processor slice. In other implementations, a single processor element of the processor slice may be reintegrated.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Hardware Redundancy (AREA)
US10/953,242 2004-03-30 2004-09-28 Diagnostic memory dump method in a redundant processor Abandoned US20050240806A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US10/953,242 US20050240806A1 (en) 2004-03-30 2004-09-28 Diagnostic memory dump method in a redundant processor
CN 200510107155 CN1755660B (zh) 2004-09-28 2005-09-28 冗余处理器中的诊断存储器转储方法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US55781204P 2004-03-30 2004-03-30
US10/953,242 US20050240806A1 (en) 2004-03-30 2004-09-28 Diagnostic memory dump method in a redundant processor

Publications (1)

Publication Number Publication Date
US20050240806A1 true US20050240806A1 (en) 2005-10-27

Family

ID=35346428

Family Applications (5)

Application Number Title Priority Date Filing Date
US10/953,242 Abandoned US20050240806A1 (en) 2004-03-30 2004-09-28 Diagnostic memory dump method in a redundant processor
US10/990,151 Active 2025-09-20 US7890706B2 (en) 2004-03-30 2004-11-16 Delegated write for race avoidance in a processor
US11/042,981 Expired - Fee Related US7434098B2 (en) 2004-03-30 2005-01-25 Method and system of determining whether a user program has made a system level call
US11/045,401 Abandoned US20050246581A1 (en) 2004-03-30 2005-01-27 Error handling system in a redundant processor
US11/071,944 Abandoned US20050223275A1 (en) 2004-03-30 2005-03-04 Performance data access

Family Applications After (4)

Application Number Title Priority Date Filing Date
US10/990,151 Active 2025-09-20 US7890706B2 (en) 2004-03-30 2004-11-16 Delegated write for race avoidance in a processor
US11/042,981 Expired - Fee Related US7434098B2 (en) 2004-03-30 2005-01-25 Method and system of determining whether a user program has made a system level call
US11/045,401 Abandoned US20050246581A1 (en) 2004-03-30 2005-01-27 Error handling system in a redundant processor
US11/071,944 Abandoned US20050223275A1 (en) 2004-03-30 2005-03-04 Performance data access

Country Status (2)

Country Link
US (5) US20050240806A1 (zh)
CN (2) CN1690970A (zh)

Cited By (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050223275A1 (en) * 2004-03-30 2005-10-06 Jardine Robert L Performance data access
US20060020852A1 (en) * 2004-03-30 2006-01-26 Bernick David L Method and system of servicing asynchronous interrupts in multiple processors executing a user program
US20060090064A1 (en) * 2004-10-25 2006-04-27 Michaelis Scott L System and method for switching the role of boot processor to a spare processor responsive to detection of loss of lockstep in a boot processor
US20060107107A1 (en) * 2004-10-25 2006-05-18 Michaelis Scott L System and method for providing firmware recoverable lockstep protection
US20060107111A1 (en) * 2004-10-25 2006-05-18 Michaelis Scott L System and method for reintroducing a processor module to an operating system after lockstep recovery
US20060107114A1 (en) * 2004-10-25 2006-05-18 Michaelis Scott L System and method for using information relating to a detected loss of lockstep for determining a responsive action
US20060107112A1 (en) * 2004-10-25 2006-05-18 Michaelis Scott L System and method for establishing a spare processor for recovering from loss of lockstep in a boot processor
US20060143534A1 (en) * 2004-12-28 2006-06-29 Dall Elizabeth J Diagnostic memory dumping
US20060168439A1 (en) * 2005-01-26 2006-07-27 Fujitsu Limited Memory dump program boot method and mechanism, and computer-readable storage medium
US20060190700A1 (en) * 2005-02-22 2006-08-24 International Business Machines Corporation Handling permanent and transient errors using a SIMD unit
US20060242456A1 (en) * 2005-04-26 2006-10-26 Kondo Thomas J Method and system of copying memory from a source processor to a target processor by duplicating memory writes
US20070101191A1 (en) * 2005-10-31 2007-05-03 Nec Corporation Memory dump method, computer system, and memory dump program
US20070124347A1 (en) * 2005-11-30 2007-05-31 Oracle International Corporation Database system configured for automatic failover with no data loss
US20070124522A1 (en) * 2005-11-30 2007-05-31 Ellison Brandon J Node detach in multi-node system
US20070124348A1 (en) * 2005-11-30 2007-05-31 Oracle International Corporation Database system configured for automatic failover with no data loss
US20080104606A1 (en) * 2004-07-22 2008-05-01 International Business Machines Corporation Apparatus and method for updating i/o capability of a logically-partitioned computer system
US20080195836A1 (en) * 2005-02-23 2008-08-14 Hewlett-Packard Development Company, L.P. Method or Apparatus for Storing Data in a Computer System
US20080263391A1 (en) * 2007-04-20 2008-10-23 International Business Machines Corporation Apparatus, System, and Method For Adapter Card Failover
US20090132565A1 (en) * 2007-11-20 2009-05-21 Fujitsu Limited Information processing system and network logging information processing method
WO2009083116A1 (de) 2007-12-21 2009-07-09 Phoenix Contact Gmbh & Co. Kg Signalverarbeitungsvorrichtung
US20090217092A1 (en) * 2005-08-08 2009-08-27 Reinhard Weiberle Method and Device for Controlling a Computer System Having At Least Two Execution Units and One Comparator Unit
US20100131721A1 (en) * 2008-11-21 2010-05-27 Richard Title Managing memory to support large-scale interprocedural static analysis for security problems
US7743285B1 (en) * 2007-04-17 2010-06-22 Hewlett-Packard Development Company, L.P. Chip multiprocessor with configurable fault isolation
US20100185838A1 (en) * 2009-01-16 2010-07-22 Foxnum Technology Co., Ltd. Processor assigning control system and method
US20100275065A1 (en) * 2009-04-27 2010-10-28 Honeywell International Inc. Dual-dual lockstep processor assemblies and modules
US8127099B2 (en) * 2006-12-26 2012-02-28 International Business Machines Corporation Resource recovery using borrowed blocks of memory
US20120166893A1 (en) * 2010-12-27 2012-06-28 International Business Machines Corporation Recording and Preventing Crash in an Appliance
US20120210162A1 (en) * 2011-02-15 2012-08-16 International Business Machines Corporation State recovery and lockstep execution restart in a system with multiprocessor pairing
WO2013174490A1 (de) 2012-05-24 2013-11-28 Phoenix Contact Gmbh & Co. Kg Analogsignal-eingangsschaltung mit einer anzahl von analogsignal-erfassungskanälen
US20140040670A1 (en) * 2011-04-22 2014-02-06 Fujitsu Limited Information processing device and processing method for information processing device
US8671311B2 (en) 2011-02-15 2014-03-11 International Business Machines Corporation Multiprocessor switch with selective pairing
TWI448847B (zh) * 2009-02-27 2014-08-11 Foxnum Technology Co Ltd 處理器分配控制系統及其控制方法
US8930752B2 (en) 2011-02-15 2015-01-06 International Business Machines Corporation Scheduler for multiprocessor system switch with selective pairing
US20150161032A1 (en) * 2013-12-05 2015-06-11 Fujitsu Limited Information processing apparatus, information processing method, and storage medium
EP2829974A3 (en) * 2013-07-26 2015-12-23 Fujitsu Limited Memory dump method, information processing apparatus and program
US9298536B2 (en) 2012-11-28 2016-03-29 International Business Machines Corporation Creating an operating system dump
US20160147534A1 (en) * 2014-11-21 2016-05-26 Oracle International Corporation Method for migrating cpu state from an inoperable core to a spare core
US20160266960A1 (en) * 2015-03-11 2016-09-15 Fujitsu Limited Information processing apparatus and kernel dump method
US9971650B2 (en) 2016-06-06 2018-05-15 International Business Machines Corporation Parallel data collection and recovery for failing virtual computer processing system
US10089195B2 (en) * 2015-09-30 2018-10-02 Robert Bosch Gmbh Method for redundant processing of data
US10102052B2 (en) 2014-01-29 2018-10-16 Hewlett Packard Enterprise Development Lp Dumping resources
US10521327B2 (en) 2016-09-29 2019-12-31 2236008 Ontario Inc. Non-coupled software lockstep
US20210224184A1 (en) * 2019-12-26 2021-07-22 Anthem, Inc. Automation Testing Tool Framework
US11221899B2 (en) * 2019-09-24 2022-01-11 Arm Limited Efficient memory utilisation in a processing cluster having a split mode and a lock mode
US11474711B2 (en) * 2018-05-29 2022-10-18 Seiko Epson Corporation Circuit device, electronic device, and mobile body
US20230074108A1 (en) * 2021-09-03 2023-03-09 SK Hynix Inc. Memory system and operating method thereof

Families Citing this family (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7308605B2 (en) * 2004-07-20 2007-12-11 Hewlett-Packard Development Company, L.P. Latent error detection
US7047440B1 (en) * 2004-07-27 2006-05-16 Freydel Lev R Dual/triple redundant computer system
DE102004038590A1 (de) * 2004-08-06 2006-03-16 Robert Bosch Gmbh Verfahren zur Verzögerung von Zugriffen auf Daten und/oder Befehle eines Zweirechnersystems sowie entsprechende Verzögerungseinheit
EP1807760B1 (de) * 2004-10-25 2008-09-17 Robert Bosch Gmbh Datenverarbeitungssystem mit variabler taktrate
JP4555713B2 (ja) * 2005-03-17 2010-10-06 富士通株式会社 エラー通知方法及び情報処理装置
US8694621B2 (en) * 2005-08-19 2014-04-08 Riverbed Technology, Inc. Capture, analysis, and visualization of concurrent system and network behavior of an application
US7516358B2 (en) * 2005-12-20 2009-04-07 Hewlett-Packard Development Company, L.P. Tuning core voltages of processors
US7496786B2 (en) * 2006-01-10 2009-02-24 Stratus Technologies Bermuda Ltd. Systems and methods for maintaining lock step operation
US20070220369A1 (en) * 2006-02-21 2007-09-20 International Business Machines Corporation Fault isolation and availability mechanism for multi-processor system
JP5087884B2 (ja) * 2006-08-11 2012-12-05 富士通セミコンダクター株式会社 データ処理ユニット、およびこれを使用したデータ処理装置
US20080165521A1 (en) * 2007-01-09 2008-07-10 Kerry Bernstein Three-dimensional architecture for self-checking and self-repairing integrated circuits
US20080270653A1 (en) * 2007-04-26 2008-10-30 Balle Susanne M Intelligent resource management in multiprocessor computer systems
JP5309703B2 (ja) * 2008-03-07 2013-10-09 日本電気株式会社 共有メモリの制御回路、制御方法及び制御プログラム
US7991933B2 (en) * 2008-06-25 2011-08-02 Dell Products L.P. Synchronizing processors when entering system management mode
EP2426605B1 (en) * 2008-08-08 2016-03-02 Amazon Technologies, Inc. Providing executing programs with reliable access to non-local block data storage
JP5507830B2 (ja) * 2008-11-04 2014-05-28 ルネサスエレクトロニクス株式会社 マイクロコントローラ及び自動車制御装置
US8631208B2 (en) * 2009-01-27 2014-01-14 Intel Corporation Providing address range coherency capability to a device
US8875142B2 (en) * 2009-02-11 2014-10-28 Hewlett-Packard Development Company, L.P. Job scheduling on a multiprocessing system based on reliability and performance rankings of processors and weighted effect of detected errors
CN101840390B (zh) * 2009-03-18 2012-05-23 中国科学院微电子研究所 适用于多处理器系统的硬件同步电路结构及其实现方法
US8364862B2 (en) * 2009-06-11 2013-01-29 Intel Corporation Delegating a poll operation to another device
JP5099090B2 (ja) * 2009-08-19 2012-12-12 日本電気株式会社 マルチコアシステム、マルチコアシステムの制御方法、及びマルチプロセッサ
DE102009054637A1 (de) * 2009-12-15 2011-06-16 Robert Bosch Gmbh Verfahren zum Betreiben einer Recheneinheit
EP2550598A1 (de) * 2010-03-23 2013-01-30 Continental Teves AG & Co. oHG Redundante zwei-prozessor-steuerung und steuerungsverfahren
US8479042B1 (en) * 2010-11-01 2013-07-02 Xilinx, Inc. Transaction-level lockstep
US8554726B2 (en) * 2011-06-01 2013-10-08 Clustrix, Inc. Systems and methods for reslicing data in a relational database
US8924780B2 (en) * 2011-11-10 2014-12-30 Ge Aviation Systems Llc Method of providing high integrity processing
JP5601353B2 (ja) * 2012-06-29 2014-10-08 横河電機株式会社 ネットワーク管理システム
US9274904B2 (en) * 2013-06-18 2016-03-01 Advanced Micro Devices, Inc. Software only inter-compute unit redundant multithreading for GPUs
US9251014B2 (en) * 2013-08-08 2016-02-02 International Business Machines Corporation Redundant transactions for detection of timing sensitive errors
CN104699550B (zh) * 2014-12-05 2017-09-12 中国航空工业集团公司第六三一研究所 一种基于lockstep架构的错误恢复方法
US9411363B2 (en) * 2014-12-10 2016-08-09 Intel Corporation Synchronization in a computing device
US10067763B2 (en) * 2015-12-11 2018-09-04 International Business Machines Corporation Handling unaligned load operations in a multi-slice computer processor
US10579536B2 (en) * 2016-08-09 2020-03-03 Arizona Board Of Regents On Behalf Of Arizona State University Multi-mode radiation hardened multi-core microprocessors
WO2018048720A1 (en) * 2016-09-09 2018-03-15 The Charles Stark Draper Laboratory, Inc. Voting circuits and methods for trusted fault tolerance of a system of untrusted subsystems
GB2555628B (en) * 2016-11-04 2019-02-20 Advanced Risc Mach Ltd Main processor error detection using checker processors
US10740167B2 (en) * 2016-12-07 2020-08-11 Electronics And Telecommunications Research Institute Multi-core processor and cache management method thereof
TWI779069B (zh) 2017-07-30 2022-10-01 埃拉德 希提 具有以記憶體為基礎的分散式處理器架構的記憶體晶片
US10474549B2 (en) 2017-07-31 2019-11-12 Oracle International Corporation System recovery using a failover processor
US10901878B2 (en) * 2018-12-19 2021-01-26 International Business Machines Corporation Reduction of pseudo-random test case generation overhead
CN111123792B (zh) * 2019-12-29 2021-07-02 苏州浪潮智能科技有限公司 一种多主系统交互通信与管理方法和装置
US11372981B2 (en) 2020-01-09 2022-06-28 Rockwell Collins, Inc. Profile-based monitoring for dual redundant systems
US11645185B2 (en) * 2020-09-25 2023-05-09 Intel Corporation Detection of faults in performance of micro instructions
US20230066835A1 (en) * 2021-08-27 2023-03-02 Keysight Technologies, Inc. Methods, systems and computer readable media for improving remote direct memory access performance

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4481578A (en) * 1982-05-21 1984-11-06 Pitney Bowes Inc. Direct memory access data transfer system for use with plural processors
US5111384A (en) * 1990-02-16 1992-05-05 Bull Hn Information Systems Inc. System for performing dump analysis
US5295258A (en) * 1989-12-22 1994-03-15 Tandem Computers Incorporated Fault-tolerant computer system with online recovery and reintegration of redundant components
US5295259A (en) * 1991-02-05 1994-03-15 Advanced Micro Devices, Inc. Data cache and method for handling memory errors during copy-back
US5317752A (en) * 1989-12-22 1994-05-31 Tandem Computers Incorporated Fault-tolerant computer system with auto-restart after power-fall
US5781558A (en) * 1996-08-14 1998-07-14 International Computers Limited Diagnostic memory access
US5884019A (en) * 1995-08-07 1999-03-16 Fujitsu Limited System and method for collecting dump information in a multi-processor data processing system
US5999933A (en) * 1995-12-14 1999-12-07 Compaq Computer Corporation Process and apparatus for collecting a data structure of a memory dump into a logical table
US6141635A (en) * 1998-06-12 2000-10-31 Unisys Corporation Method of diagnosing faults in an emulated computer system via a heterogeneous diagnostic program
US6263373B1 (en) * 1998-12-04 2001-07-17 International Business Machines Corporation Data processing system and method for remotely controlling execution of a processor utilizing a test access port
US6314501B1 (en) * 1998-07-23 2001-11-06 Unisys Corporation Computer system and method for operating multiple operating systems in different partitions of the computer system and for allowing the different partitions to communicate with one another through shared memory
US20020144175A1 (en) * 2001-03-28 2002-10-03 Long Finbarr Denis Apparatus and methods for fault-tolerant computing using a switching fabric
US20030014521A1 (en) * 2001-06-28 2003-01-16 Jeremy Elson Open platform architecture for shared resource access management
US6543010B1 (en) * 1999-02-24 2003-04-01 Hewlett-Packard Development Company, L.P. Method and apparatus for accelerating a memory dump
US20030145157A1 (en) * 2002-01-31 2003-07-31 Smullen James R. Expedited memory dumping and reloading of computer processors
US20060136784A1 (en) * 2004-12-06 2006-06-22 Microsoft Corporation Controlling software failure data reporting and responses

Family Cites Families (66)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3665404A (en) * 1970-04-09 1972-05-23 Burroughs Corp Multi-processor processing system having interprocessor interrupt apparatus
US4228496A (en) * 1976-09-07 1980-10-14 Tandem Computers Incorporated Multiprocessor system
US4293921A (en) * 1979-06-15 1981-10-06 Martin Marietta Corporation Method and signal processor for frequency analysis of time domain signals
JPS61253572A (ja) 1985-05-02 1986-11-11 Hitachi Ltd 疎結合マルチプロセツサ・システムの負荷配分方式
US4733353A (en) * 1985-12-13 1988-03-22 General Electric Company Frame synchronization of multiply redundant computers
JP2695157B2 (ja) 1986-12-29 1997-12-24 松下電器産業株式会社 可変パイプラインプロセッサ
EP0306211A3 (en) * 1987-09-04 1990-09-26 Digital Equipment Corporation Synchronized twin computer system
CA2003338A1 (en) * 1987-11-09 1990-06-09 Richard W. Cutts, Jr. Synchronization of fault-tolerant computer system having multiple processors
AU616213B2 (en) * 1987-11-09 1991-10-24 Tandem Computers Incorporated Method and apparatus for synchronizing a plurality of processors
JP2644780B2 (ja) 1987-11-18 1997-08-25 株式会社日立製作所 処理依頼機能を持つ並列計算機
GB8729901D0 (en) * 1987-12-22 1988-02-03 Lucas Ind Plc Dual computer cross-checking system
JPH0797328B2 (ja) * 1988-10-25 1995-10-18 インターナシヨナル・ビジネス・マシーンズ・コーポレーシヨン フオールト・トレラント同期システム
US4965717A (en) * 1988-12-09 1990-10-23 Tandem Computers Incorporated Multiple processor system having shared memory with private-write capability
US5369767A (en) * 1989-05-17 1994-11-29 International Business Machines Corp. Servicing interrupt requests in a data processing system without using the services of an operating system
US5291608A (en) * 1990-02-13 1994-03-01 International Business Machines Corporation Display adapter event handler with rendering context manager
JP2833717B2 (ja) * 1990-06-01 1998-12-09 イー・アイ・デユポン・ドウ・ヌムール・アンド・カンパニー 弾性率変化のある複合整形外科インプラント
US5226152A (en) * 1990-12-07 1993-07-06 Motorola, Inc. Functional lockstep arrangement for redundant processors
CA2068048A1 (en) * 1991-05-06 1992-11-07 Douglas D. Cheung Fault tolerant processing section with dynamically reconfigurable voting
US5339404A (en) * 1991-05-28 1994-08-16 International Business Machines Corporation Asynchronous TMR processing system
JPH05128080A (ja) * 1991-10-14 1993-05-25 Mitsubishi Electric Corp 情報処理装置
US5613127A (en) * 1992-08-17 1997-03-18 Honeywell Inc. Separately clocked processor synchronization improvement
US5790776A (en) * 1992-12-17 1998-08-04 Tandem Computers Incorporated Apparatus for detecting divergence between a pair of duplexed, synchronized processor elements
US5535397A (en) * 1993-06-30 1996-07-09 Intel Corporation Method and apparatus for providing a context switch in response to an interrupt in a computer process
US5572620A (en) * 1993-07-29 1996-11-05 Honeywell Inc. Fault-tolerant voter system for output data from a plurality of non-synchronized redundant processors
US5504859A (en) * 1993-11-09 1996-04-02 International Business Machines Corporation Data processor with enhanced error recovery
DE69435090T2 (de) * 1993-12-01 2009-06-10 Marathon Technologies Corp., Stow Rechnersystem mit Steuereinheiten und Rechnerelementen
US6449730B2 (en) * 1995-10-24 2002-09-10 Seachange Technology, Inc. Loosely coupled mass storage computer cluster
US5850555A (en) * 1995-12-19 1998-12-15 Advanced Micro Devices, Inc. System and method for validating interrupts before presentation to a CPU
US6141769A (en) * 1996-05-16 2000-10-31 Resilience Corporation Triple modular redundant computer system and associated method
US5790397A (en) * 1996-09-17 1998-08-04 Marathon Technologies Corporation Fault resilient/fault tolerant computing
US5796939A (en) * 1997-03-10 1998-08-18 Digital Equipment Corporation High frequency sampling of processor performance counters
US5903717A (en) * 1997-04-02 1999-05-11 General Dynamics Information Systems, Inc. Fault tolerant computer system
US5896523A (en) * 1997-06-04 1999-04-20 Marathon Technologies Corporation Loosely-coupled, synchronized execution
EP1029267B1 (en) * 1997-11-14 2002-03-27 Marathon Technologies Corporation Method for maintaining the synchronized execution in fault resilient/fault tolerant computer systems
US6173356B1 (en) * 1998-02-20 2001-01-09 Silicon Aquarius, Inc. Multi-port DRAM with integrated SRAM and systems and methods using the same
US5991900A (en) * 1998-06-15 1999-11-23 Sun Microsystems, Inc. Bus controller
US6223304B1 (en) * 1998-06-18 2001-04-24 Telefonaktiebolaget Lm Ericsson (Publ) Synchronization of processors in a fault tolerant multi-processor system
US6199171B1 (en) * 1998-06-26 2001-03-06 International Business Machines Corporation Time-lag duplexing techniques
US6256753B1 (en) * 1998-06-30 2001-07-03 Sun Microsystems, Inc. Bus error handling in a computer system
US6195715B1 (en) * 1998-11-13 2001-02-27 Creative Technology Ltd. Interrupt control for multiple programs communicating with a common interrupt by associating programs to GP registers, defining interrupt register, polling GP registers, and invoking callback routine associated with defined interrupt register
US6948092B2 (en) * 1998-12-10 2005-09-20 Hewlett-Packard Development Company, L.P. System recovery from errors for processor and associated components
US6393582B1 (en) * 1998-12-10 2002-05-21 Compaq Computer Corporation Error self-checking and recovery using lock-step processor pair architecture
EP1157324A4 (en) * 1998-12-18 2009-06-17 Triconex Corp PROCESS AND DEVICE FOR PROCESSING CONTROL USING A MULTIPLE REDUNDANT PROCESS CONTROL SYSTEM
US6397365B1 (en) * 1999-05-18 2002-05-28 Hewlett-Packard Company Memory error correction using redundant sliced memory and standard ECC mechanisms
US6820213B1 (en) * 2000-04-13 2004-11-16 Stratus Technologies Bermuda, Ltd. Fault-tolerant computer system with voter delay buffer
US6658654B1 (en) * 2000-07-06 2003-12-02 International Business Machines Corporation Method and system for low-overhead measurement of per-thread performance information in a multithreaded environment
EP1213650A3 (en) * 2000-08-21 2006-08-30 Texas Instruments France Priority arbitration based on current task and MMU
US6604177B1 (en) * 2000-09-29 2003-08-05 Hewlett-Packard Development Company, L.P. Communication of dissimilar data between lock-stepped processors
US6604717B2 (en) * 2000-11-15 2003-08-12 Stanfield Mccoy J. Bag holder
US7017073B2 (en) * 2001-02-28 2006-03-21 International Business Machines Corporation Method and apparatus for fault-tolerance via dual thread crosschecking
US6704887B2 (en) * 2001-03-08 2004-03-09 The United States Of America As Represented By The Secretary Of The Air Force Method and apparatus for improved security in distributed-environment voting
US6971043B2 (en) * 2001-04-11 2005-11-29 Stratus Technologies Bermuda Ltd Apparatus and method for accessing a mass storage device in a fault-tolerant server
US6928583B2 (en) * 2001-04-11 2005-08-09 Stratus Technologies Bermuda Ltd. Apparatus and method for two computing elements in a fault-tolerant server to execute instructions in lockstep
US7076510B2 (en) * 2001-07-12 2006-07-11 Brown William P Software raid methods and apparatuses including server usage based write delegation
US6754763B2 (en) * 2001-07-30 2004-06-22 Axis Systems, Inc. Multi-board connection system for use in electronic design automation
US6859866B2 (en) * 2001-10-01 2005-02-22 International Business Machines Corporation Synchronizing processing of commands invoked against duplexed coupling facility structures
US7194671B2 (en) * 2001-12-31 2007-03-20 Intel Corporation Mechanism handling race conditions in FRC-enabled processors
US7076397B2 (en) * 2002-10-17 2006-07-11 Bmc Software, Inc. System and method for statistical performance monitoring
US6983337B2 (en) * 2002-12-18 2006-01-03 Intel Corporation Method, system, and program for handling device interrupts
US7526757B2 (en) * 2004-01-14 2009-04-28 International Business Machines Corporation Method and apparatus for maintaining performance monitoring structures in a page table for use in monitoring performance of a computer program
US7231543B2 (en) * 2004-01-14 2007-06-12 Hewlett-Packard Development Company, L.P. Systems and methods for fault-tolerant processing with processor regrouping based on connectivity conditions
JP2005259030A (ja) * 2004-03-15 2005-09-22 Sharp Corp 性能評価装置、性能評価方法、プログラムおよびコンピュータ読取可能記録媒体
US7162666B2 (en) * 2004-03-26 2007-01-09 Emc Corporation Multi-processor system having a watchdog for interrupting the multiple processors and deferring preemption until release of spinlocks
US20050240806A1 (en) * 2004-03-30 2005-10-27 Hewlett-Packard Development Company, L.P. Diagnostic memory dump method in a redundant processor
US7308605B2 (en) * 2004-07-20 2007-12-11 Hewlett-Packard Development Company, L.P. Latent error detection
US7328331B2 (en) * 2005-01-25 2008-02-05 Hewlett-Packard Development Company, L.P. Method and system of aligning execution point of duplicate copies of a user program by copying memory stores

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4481578A (en) * 1982-05-21 1984-11-06 Pitney Bowes Inc. Direct memory access data transfer system for use with plural processors
US6073251A (en) * 1989-12-22 2000-06-06 Compaq Computer Corporation Fault-tolerant computer system with online recovery and reintegration of redundant components
US6263452B1 (en) * 1989-12-22 2001-07-17 Compaq Computer Corporation Fault-tolerant computer system with online recovery and reintegration of redundant components
US5317752A (en) * 1989-12-22 1994-05-31 Tandem Computers Incorporated Fault-tolerant computer system with auto-restart after power-fall
US5295258A (en) * 1989-12-22 1994-03-15 Tandem Computers Incorporated Fault-tolerant computer system with online recovery and reintegration of redundant components
US5111384A (en) * 1990-02-16 1992-05-05 Bull Hn Information Systems Inc. System for performing dump analysis
US5295259A (en) * 1991-02-05 1994-03-15 Advanced Micro Devices, Inc. Data cache and method for handling memory errors during copy-back
US5884019A (en) * 1995-08-07 1999-03-16 Fujitsu Limited System and method for collecting dump information in a multi-processor data processing system
US5999933A (en) * 1995-12-14 1999-12-07 Compaq Computer Corporation Process and apparatus for collecting a data structure of a memory dump into a logical table
US5781558A (en) * 1996-08-14 1998-07-14 International Computers Limited Diagnostic memory access
US6141635A (en) * 1998-06-12 2000-10-31 Unisys Corporation Method of diagnosing faults in an emulated computer system via a heterogeneous diagnostic program
US6314501B1 (en) * 1998-07-23 2001-11-06 Unisys Corporation Computer system and method for operating multiple operating systems in different partitions of the computer system and for allowing the different partitions to communicate with one another through shared memory
US20030037178A1 (en) * 1998-07-23 2003-02-20 Vessey Bruce Alan System and method for emulating network communications between partitions of a computer system
US6263373B1 (en) * 1998-12-04 2001-07-17 International Business Machines Corporation Data processing system and method for remotely controlling execution of a processor utilizing a test access port
US6543010B1 (en) * 1999-02-24 2003-04-01 Hewlett-Packard Development Company, L.P. Method and apparatus for accelerating a memory dump
US20020144175A1 (en) * 2001-03-28 2002-10-03 Long Finbarr Denis Apparatus and methods for fault-tolerant computing using a switching fabric
US20030014521A1 (en) * 2001-06-28 2003-01-16 Jeremy Elson Open platform architecture for shared resource access management
US20030145157A1 (en) * 2002-01-31 2003-07-31 Smullen James R. Expedited memory dumping and reloading of computer processors
US20060136784A1 (en) * 2004-12-06 2006-06-22 Microsoft Corporation Controlling software failure data reporting and responses

Cited By (81)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050223275A1 (en) * 2004-03-30 2005-10-06 Jardine Robert L Performance data access
US20050246587A1 (en) * 2004-03-30 2005-11-03 Bernick David L Method and system of determining whether a user program has made a system level call
US20060020852A1 (en) * 2004-03-30 2006-01-26 Bernick David L Method and system of servicing asynchronous interrupts in multiple processors executing a user program
US7434098B2 (en) 2004-03-30 2008-10-07 Hewlett-Packard Development Company, L.P. Method and system of determining whether a user program has made a system level call
US20080178191A1 (en) * 2004-07-22 2008-07-24 International Business Machines Corporation Updating i/o capability of a logically-partitioned computer system
US20080104606A1 (en) * 2004-07-22 2008-05-01 International Business Machines Corporation Apparatus and method for updating i/o capability of a logically-partitioned computer system
US8131891B2 (en) * 2004-07-22 2012-03-06 International Business Machines Corporation Updating I/O capability of a logically-partitioned computer system
US8112561B2 (en) 2004-07-22 2012-02-07 International Business Machines Corporation Updating I/O capability of a logically-partitioned computer system
US20060107111A1 (en) * 2004-10-25 2006-05-18 Michaelis Scott L System and method for reintroducing a processor module to an operating system after lockstep recovery
US7818614B2 (en) * 2004-10-25 2010-10-19 Hewlett-Packard Development Company, L.P. System and method for reintroducing a processor module to an operating system after lockstep recovery
US20060107112A1 (en) * 2004-10-25 2006-05-18 Michaelis Scott L System and method for establishing a spare processor for recovering from loss of lockstep in a boot processor
US7627781B2 (en) 2004-10-25 2009-12-01 Hewlett-Packard Development Company, L.P. System and method for establishing a spare processor for recovering from loss of lockstep in a boot processor
US7624302B2 (en) 2004-10-25 2009-11-24 Hewlett-Packard Development Company, L.P. System and method for switching the role of boot processor to a spare processor responsive to detection of loss of lockstep in a boot processor
US20060107114A1 (en) * 2004-10-25 2006-05-18 Michaelis Scott L System and method for using information relating to a detected loss of lockstep for determining a responsive action
US7516359B2 (en) 2004-10-25 2009-04-07 Hewlett-Packard Development Company, L.P. System and method for using information relating to a detected loss of lockstep for determining a responsive action
US7502958B2 (en) * 2004-10-25 2009-03-10 Hewlett-Packard Development Company, L.P. System and method for providing firmware recoverable lockstep protection
US20060107107A1 (en) * 2004-10-25 2006-05-18 Michaelis Scott L System and method for providing firmware recoverable lockstep protection
US20060090064A1 (en) * 2004-10-25 2006-04-27 Michaelis Scott L System and method for switching the role of boot processor to a spare processor responsive to detection of loss of lockstep in a boot processor
US20060143534A1 (en) * 2004-12-28 2006-06-29 Dall Elizabeth J Diagnostic memory dumping
US7383471B2 (en) * 2004-12-28 2008-06-03 Hewlett-Packard Development Company, L.P. Diagnostic memory dumping
US20060168439A1 (en) * 2005-01-26 2006-07-27 Fujitsu Limited Memory dump program boot method and mechanism, and computer-readable storage medium
US7302559B2 (en) * 2005-01-26 2007-11-27 Fujitsu Limited Memory dump program boot method and mechanism, and computer-readable storage medium
US20060190700A1 (en) * 2005-02-22 2006-08-24 International Business Machines Corporation Handling permanent and transient errors using a SIMD unit
US20080195836A1 (en) * 2005-02-23 2008-08-14 Hewlett-Packard Development Company, L.P. Method or Apparatus for Storing Data in a Computer System
US20060242456A1 (en) * 2005-04-26 2006-10-26 Kondo Thomas J Method and system of copying memory from a source processor to a target processor by duplicating memory writes
US7590885B2 (en) * 2005-04-26 2009-09-15 Hewlett-Packard Development Company, L.P. Method and system of copying memory from a source processor to a target processor by duplicating memory writes
US20090217092A1 (en) * 2005-08-08 2009-08-27 Reinhard Weiberle Method and Device for Controlling a Computer System Having At Least Two Execution Units and One Comparator Unit
US20070101191A1 (en) * 2005-10-31 2007-05-03 Nec Corporation Memory dump method, computer system, and memory dump program
US20070124347A1 (en) * 2005-11-30 2007-05-31 Oracle International Corporation Database system configured for automatic failover with no data loss
US20070124522A1 (en) * 2005-11-30 2007-05-31 Ellison Brandon J Node detach in multi-node system
US7627584B2 (en) * 2005-11-30 2009-12-01 Oracle International Corporation Database system configured for automatic failover with no data loss
US7668879B2 (en) * 2005-11-30 2010-02-23 Oracle International Corporation Database system configured for automatic failover with no data loss
US20070124348A1 (en) * 2005-11-30 2007-05-31 Oracle International Corporation Database system configured for automatic failover with no data loss
US8127099B2 (en) * 2006-12-26 2012-02-28 International Business Machines Corporation Resource recovery using borrowed blocks of memory
US7743285B1 (en) * 2007-04-17 2010-06-22 Hewlett-Packard Development Company, L.P. Chip multiprocessor with configurable fault isolation
US20080263391A1 (en) * 2007-04-20 2008-10-23 International Business Machines Corporation Apparatus, System, and Method For Adapter Card Failover
US8010506B2 (en) * 2007-11-20 2011-08-30 Fujitsu Limited Information processing system and network logging information processing method
US20090132565A1 (en) * 2007-11-20 2009-05-21 Fujitsu Limited Information processing system and network logging information processing method
US20100318325A1 (en) * 2007-12-21 2010-12-16 Phoenix Contact Gmbh & Co. Kg Signal processing device
WO2009083116A1 (de) 2007-12-21 2009-07-09 Phoenix Contact Gmbh & Co. Kg Signalverarbeitungsvorrichtung
US8965735B2 (en) 2007-12-21 2015-02-24 Phoenix Contact Gmbh & Co. Kg Signal processing device
US20100131721A1 (en) * 2008-11-21 2010-05-27 Richard Title Managing memory to support large-scale interprocedural static analysis for security problems
US8429633B2 (en) * 2008-11-21 2013-04-23 International Business Machines Corporation Managing memory to support large-scale interprocedural static analysis for security problems
US20100185838A1 (en) * 2009-01-16 2010-07-22 Foxnum Technology Co., Ltd. Processor assigning control system and method
TWI448847B (zh) * 2009-02-27 2014-08-11 Foxnum Technology Co Ltd 處理器分配控制系統及其控制方法
US20100275065A1 (en) * 2009-04-27 2010-10-28 Honeywell International Inc. Dual-dual lockstep processor assemblies and modules
US7979746B2 (en) 2009-04-27 2011-07-12 Honeywell International Inc. Dual-dual lockstep processor assemblies and modules
US8719622B2 (en) * 2010-12-27 2014-05-06 International Business Machines Corporation Recording and preventing crash in an appliance
US20120166893A1 (en) * 2010-12-27 2012-06-28 International Business Machines Corporation Recording and Preventing Crash in an Appliance
US8635492B2 (en) * 2011-02-15 2014-01-21 International Business Machines Corporation State recovery and lockstep execution restart in a system with multiprocessor pairing
US8671311B2 (en) 2011-02-15 2014-03-11 International Business Machines Corporation Multiprocessor switch with selective pairing
US20120210162A1 (en) * 2011-02-15 2012-08-16 International Business Machines Corporation State recovery and lockstep execution restart in a system with multiprocessor pairing
US8930752B2 (en) 2011-02-15 2015-01-06 International Business Machines Corporation Scheduler for multiprocessor system switch with selective pairing
US20140040670A1 (en) * 2011-04-22 2014-02-06 Fujitsu Limited Information processing device and processing method for information processing device
EP2701063A1 (en) * 2011-04-22 2014-02-26 Fujitsu Limited Information processing device and information processing device processing method
EP2701063A4 (en) * 2011-04-22 2014-05-07 Fujitsu Ltd INFORMATION PROCESSING DEVICE, METHOD OF PROCESSING INFORMATION PROCESSING DEVICE
US9448871B2 (en) * 2011-04-22 2016-09-20 Fujitsu Limited Information processing device and method for selecting processor for memory dump processing
WO2013174490A1 (de) 2012-05-24 2013-11-28 Phoenix Contact Gmbh & Co. Kg Analogsignal-eingangsschaltung mit einer anzahl von analogsignal-erfassungskanälen
US9658267B2 (en) 2012-05-24 2017-05-23 Phoenix Contact Gmbh & Co. Kg Analog signal input circuit to process analog input signals for the safety of a process
US9298536B2 (en) 2012-11-28 2016-03-29 International Business Machines Corporation Creating an operating system dump
US9436536B2 (en) 2013-07-26 2016-09-06 Fujitsu Limited Memory dump method, information processing apparatus, and non-transitory computer-readable storage medium
EP2829974A3 (en) * 2013-07-26 2015-12-23 Fujitsu Limited Memory dump method, information processing apparatus and program
US20150161032A1 (en) * 2013-12-05 2015-06-11 Fujitsu Limited Information processing apparatus, information processing method, and storage medium
US9519534B2 (en) * 2013-12-05 2016-12-13 Fujitsu Limited Information processing in response to failure of apparatus, method, and storage medium
US10102052B2 (en) 2014-01-29 2018-10-16 Hewlett Packard Enterprise Development Lp Dumping resources
US11263012B2 (en) 2014-11-21 2022-03-01 Oracle International Corporation Method for migrating CPU state from an inoperable core to a spare core
US10528351B2 (en) 2014-11-21 2020-01-07 Oracle International Corporation Method for migrating CPU state from an inoperable core to a spare core
US11709742B2 (en) 2014-11-21 2023-07-25 Oracle International Corporation Method for migrating CPU state from an inoperable core to a spare core
US20160147534A1 (en) * 2014-11-21 2016-05-26 Oracle International Corporation Method for migrating cpu state from an inoperable core to a spare core
US9710273B2 (en) * 2014-11-21 2017-07-18 Oracle International Corporation Method for migrating CPU state from an inoperable core to a spare core
US20160266960A1 (en) * 2015-03-11 2016-09-15 Fujitsu Limited Information processing apparatus and kernel dump method
US10089195B2 (en) * 2015-09-30 2018-10-02 Robert Bosch Gmbh Method for redundant processing of data
US10565056B2 (en) 2016-06-06 2020-02-18 International Business Machines Corporation Parallel data collection and recovery for failing virtual computer processing system
US9971650B2 (en) 2016-06-06 2018-05-15 International Business Machines Corporation Parallel data collection and recovery for failing virtual computer processing system
US10521327B2 (en) 2016-09-29 2019-12-31 2236008 Ontario Inc. Non-coupled software lockstep
US11474711B2 (en) * 2018-05-29 2022-10-18 Seiko Epson Corporation Circuit device, electronic device, and mobile body
US11221899B2 (en) * 2019-09-24 2022-01-11 Arm Limited Efficient memory utilisation in a processing cluster having a split mode and a lock mode
US20210224184A1 (en) * 2019-12-26 2021-07-22 Anthem, Inc. Automation Testing Tool Framework
US11615018B2 (en) * 2019-12-26 2023-03-28 Anthem, Inc. Automation testing tool framework
US20230074108A1 (en) * 2021-09-03 2023-03-09 SK Hynix Inc. Memory system and operating method thereof
US12014074B2 (en) * 2021-09-03 2024-06-18 SK Hynix Inc. System and method for storing dump data

Also Published As

Publication number Publication date
CN1696903A (zh) 2005-11-16
US20050246581A1 (en) 2005-11-03
US20050246587A1 (en) 2005-11-03
US20050223275A1 (en) 2005-10-06
US7890706B2 (en) 2011-02-15
CN100472456C (zh) 2009-03-25
CN1690970A (zh) 2005-11-02
US20050223178A1 (en) 2005-10-06
US7434098B2 (en) 2008-10-07

Similar Documents

Publication Publication Date Title
US20050240806A1 (en) Diagnostic memory dump method in a redundant processor
Bernick et al. NonStop/spl reg/advanced architecture
US5317726A (en) Multiple-processor computer system with asynchronous execution of identical code streams
Jewett Integrity S2: A fault-tolerant Unix platform
EP0372579B1 (en) High-performance computer system with fault-tolerant capability; method for operating such a system
US5384906A (en) Method and apparatus for synchronizing a plurality of processors
US5890003A (en) Interrupts between asynchronously operating CPUs in fault tolerant computer system
Bernstein Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing
US7496786B2 (en) Systems and methods for maintaining lock step operation
US5958070A (en) Remote checkpoint memory system and protocol for fault-tolerant computer system
AU753120B2 (en) Fault resilient/fault tolerant computing
JP2500038B2 (ja) マルチプロセッサ・コンピュ―タ・システム、フォ―ルト・トレラント処理方法及びデ―タ処理システム
EP0433979A2 (en) Fault-tolerant computer system with/config filesystem
CN1755660B (zh) 冗余处理器中的诊断存储器转储方法
JP3030658B2 (ja) 電源故障対策を備えたコンピュータシステム及びその動作方法
US5473770A (en) Fault-tolerant computer system with hidden local memory refresh
US20040193735A1 (en) Method and circuit arrangement for synchronization of synchronously or asynchronously clocked processor units
CN100442248C (zh) 用于避免竞争的计算机系统同步单元
Tamir Self-checking self-repairing computer nodes using the Mirror Processor
Gold et al. Tolerating Processor Failures in a Distributed Shared-Memory Multiprocessor
Jeffery Virtual lockstep for fault tolerance and architectural vulnerability analysis
Siewiorek et al. C. vmp: the analysis, architecture and implementation of a fault tolerant multiprocessor
Zhonghong et al. Research on architecture and design principles of COTS components based generic fault-tolerant computer

Legal Events

Date Code Title Description
AS Assignment

Owner name: HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P., TEXAS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BRUCKERT, WILLIAM F.;KLECKA, JAMES S.;SMULLEN, JAMES R.;REEL/FRAME:015850/0070;SIGNING DATES FROM 20040803 TO 20040901

STCB Information on status: application discontinuation

Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION