WO2001024007A2 - Method and apparatus for processing errors in a computer system - Google Patents

Method and apparatus for processing errors in a computer system Download PDF

Info

Publication number
WO2001024007A2
WO2001024007A2 PCT/US2000/025845 US0025845W WO0124007A2 WO 2001024007 A2 WO2001024007 A2 WO 2001024007A2 US 0025845 W US0025845 W US 0025845W WO 0124007 A2 WO0124007 A2 WO 0124007A2
Authority
WO
WIPO (PCT)
Prior art keywords
error
module
packet
request
registers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2000/025845
Other languages
English (en)
French (fr)
Other versions
WO2001024007A3 (en
Inventor
John S. Keen
Azmeer Salleh
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Graphics Properties Holdings Inc
Original Assignee
Silicon Graphics Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Silicon Graphics Inc filed Critical Silicon Graphics Inc
Priority to EP00963680A priority Critical patent/EP1221229A2/en
Priority to JP2001526709A priority patent/JP2003524225A/ja
Publication of WO2001024007A2 publication Critical patent/WO2001024007A2/en
Publication of WO2001024007A3 publication Critical patent/WO2001024007A3/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/24Handling requests for interconnection or transfer for access to input/output bus using interrupt

Definitions

  • the present invention relates in general to computer system signal processing and more particularly to a method and apparatus for processing errors in a computer system.
  • the design of a computer system includes mechanisms for detecting and responding to any errors that may occur during operation. After a computer system's hardware detects the presence of an error, the computer system's software is often notified of the occurrence of the error and instructed to take appropriate action. Software development on prototype chips while in laboratory testing is hampered if some errors cannot be induced to happen and the code developed to handle these errors cannot be tested. For example, there may be no capability to generate incoming packets that have an invalid command encoding. Therefore, it is desirable to provide an efficient technique to identify and capture errors that occur during computer system operation. It is also desirable to provide a capability to induce errors into a computer system in order to test error handling software.
  • a need has arisen for a technique to identify errors and capture information about them and provide a capability to induce the occurrence of errors in a computer system.
  • a method and apparatus for processing errors in a computer system is provided that substantially eliminates or reduces disadvantages and problems associated with conventional error processing techniques.
  • an apparatus for processing errors in a computer system that includes a request module that can receive incoming packets.
  • a processor module can identify a write operation specified by an incoming request packet.
  • the processor module determines a register specified by the incoming request packet upon which to perform the operation.
  • a registers module maintains registers within which the write operation is performed.
  • the incoming request packet specifies instructions for how to inject an error into the computer system.
  • the processor module performs a write operation by writing information from the incoming request packet into one of the header and data registers of the registers module.
  • the processor module sets an error bit to trigger processing of the injected error.
  • the request module receives a request packet and determines whether the request packet has an error.
  • the request module transfers the request packet to the processor module for processing in response to a determination that there is no error in the request packet. Otherwise, the request module stores header and data information associated with the request packet in the header and data registers of the registers module in response to the request module identifying an error in the request packet.
  • the request module sets an error bit in an error register of the registers module to indicate that an error has been identified in the request packet.
  • FIGURE 1 illustrates a block diagram of a computer system
  • FIGURE 2 illustrates a block diagram of a node controller in the computer system
  • FIGURE 3 illustrates a block diagram of a local block unit in the node controller
  • FIGURE 4 illustrates an example of an error register used in a registers module of the local block unit
  • FIGURE 5 illustrates an example of a mask register used in the registers module of the local block unit
  • FIGURES 6A and 6B illustrate an example of header registers used in the registers module of the local block unit ;
  • FIGURE 7 illustrates an example of a data register used in the registers module of the local block unit.
  • FIGURE 1 is a block diagram of a computer system 10.
  • Computer system 10 includes a plurality of node controllers 12 interconnected by a network 14. Each node controller 12 processes data and traffic both internally and with other node controllers 12 within computer system 10 over network 14. Each node controller 12 may communicate with a local processor 16, a local memory device 17, and a local input/output device 18.
  • FIGURE 2 is a block diagram of node controller 12 used in a multi-processor computer system 10.
  • Node controller 12 includes a network interface unit 20, a memory directory interface unit 22, a processor interface unit 24, an input/output interface unit 26, a local block unit 28, and a crossbar unit 30.
  • Network interface unit 20 may provide a communication link to network 14 in order to transfer data, messages, and other traffic to other node controllers 12 in computer system 10.
  • Processor interface unit 22 may provide a communication link with one or more local processors 16.
  • Memory directory interface unit 22 may provide a communication link with one or more local memory devices 17.
  • Input/output interface unit 26 may provide a communication link with one or more local input/output devices 18.
  • Local block unit 28 is dedicated to processing invalidation requests and PIO requests from local processor 16 or from a remote processor associated with a remote node controller 12.
  • Crossbar unit 30 arbitrates the transfer of data, messages, and other traffic for node controller 12.
  • FIGURE 3 is a block diagram of local block unit 28.
  • Local block unit 28 handles the error processing for node controller 12.
  • Local block unit 28 includes a request module 30, an invalidation module 32, a processor module 34, an output module 36, a vector module 38, a registers module 40, a reply module 42, and a clock module 44.
  • Request module 30 receives incoming request packets and determines what is to be done with the received request packets.
  • Incoming request packets may include various types of requests such as a normal programmed input/output (PIO) write request, a normal PIO read request, a vector PIO write request, a vector PIO read request, and a local invalidation request.
  • PIO programmed input/output
  • request module 30 After receiving an entire incoming request packet, request module 30 identifies the type of request packet that has been received. For request packets requiring a PIO read or write operation, request module 30 activates processor module 34 which is responsible for servicing the PIO request. For a local invalidation request, request module 30 activates invalidation module 32 which is responsible for servicing the local invalidation request.
  • request module 30 If request module 30 does not identify the request packet as a PIO or local invalidation request, the received request packet is considered to be an error. In the case of an error, request module 30 activates registers module 40 for error notification and capture of the packet's contents. Invalidation block 32 services local invalidation requests identified by request module 30. Upon receiving a local invalidation request, invalidation module 32 checks for a legal encoding in the local validation request. If the encoding is illegal, invalidation module 32 notifies registers module 40 of the error so that the error may be captured. If the encoding is legal, invalidation block 32 generates an invalidation request packet or an invalidation acknowledgment reply packet on behalf of every processor interface unit 24 indicated in the local invalidation request.
  • Processor module 34 services PIO read and write requests which may target local registers in any of the memory directory unit 22, network interface unit 20, crossbar unit 30, and local block unit 28.
  • Processor module 34 decodes a destination address from within the PIO request to determine the particular unit in which the register specified in the request resides and ensures that the source of the request has authority to perform the operation. If the source of the request has authority to perform the operation, processor module 34 coordinates with the particular unit in which the specified register resides in order to carry out the operation. If not, then the operation is not performed.
  • Processor module 34 is responsible for returning an appropriate reply in response to the PIO request .
  • Output module 36 is the passageway for outgoing request and reply packets from local block unit 28.
  • output module 36 coordinates traffic from these modules so that only one is able to transmit a request packet at a time and only one is able to transmit a reply packet at a time.
  • Outgoing reply and request packets leave output module 36 on separate virtual channels multiplexed on a common physical channel so that flits within these outgoing request and reply packets can be interleaved.
  • Vector module 38 formats and transmits vector PIO read or write requests according to contents of associated registers within registers module 40.
  • Registers module 40 maintains the state of local registers in local block unit 28.
  • Registers module 40 provides values of various local registers to other modules within local block unit 28.
  • Registers module 40 updates local registers in response to PIO write requests or other activity within local block unit 28 such as error capture and injection.
  • Registers module 40 also includes control parameters to assist clock module 44 to drive real time clock output signals from local block unit 28.
  • Reply module 42 handles incoming vector reply packets. After receiving a vector reply packet, reply module 42 notifies registers module 40 so that the information within the vector reply packet can be retained in associated local registers. If reply module 44 receives a reply packet that is not encoded as a vector reply packet, reply module 44 informs registers module 40 that an error has occurred so that registers module 40 can capture the error.
  • registers module 40 includes several registers to identify and handle an error.
  • FIGURE 4 shows an example of an error register 50 in registers module 40.
  • Error register 50 provides a one bit field for, in this example, eleven types of errors. Ten of these may occur as a result of an incorrect request or reply packet received by local block unit 28. The other error does not involve receipt of a packet but occurs as a result of an unexpected behavior of the incoming real time clock signal received at clock module 44.
  • registers module 40 sets the corresponding bit in error register 50.
  • System software can read the value of error register 50 through a normal PIO read operation and obtain information about what particular types of errors have occurred.
  • registers module 40 Upon setting a bit in error register 50, registers module 40 generates an interrupt signal 52 to drive an input to processor interface unit 24. Interrupt signal 52 indicates that an error has occurred and prompts system software to take appropriate action. Processor interface unit 24 selects a processor 16 to handle the error and causes the selected processor to interrupt its operation in order to invoke error handling software. Although the processor interrupt could have been triggered by transmitting, a PIO write request packet to processor interface unit 24 which targets a local register in processing unit 24, several advantages are achieved by directly generating a dedicated interrupt signal 52 from registers module 40. Complications such as preparing and sending PIO write requests and receiving subsequent replies through the request and reply scheme of local block unit 28 are avoided. The identified error may make it impossible for a PIO write request to be conveyed to processor interface unit 24. By providing a direct dedicated interrupt signal from registers module 40 to processor interface unit 24, a simpler and more reliable technique is employed to initiate an interrupt for error handling.
  • FIGURE 5 shows an example of a mask register 60.
  • Mask register 60 allows software to clear out some error bits in error register 50 without affecting other error bits within error register 50.
  • Mask register 60 includes one bit fields corresponding to each of the error types of error register 50.
  • System software through normal PIO write operations, may set a field in mask register 60, causing registers module 40 to clear the corresponding field in error register 50. If a bit is not set in mask register 60, then registers module 40 leaves the corresponding field in error register 50 unchanged.
  • software can individually clear the associated bit in error register 50 without affecting any of its other bits through mask register 60.
  • FIGURES 6A and 6B show examples of header registers 70 and 72 in registers module 40.
  • FIGURE 7 shows an example of a data register 80 in registers module 40.
  • registers module 40 Upon receiving an initial error, registers module 40 saves the contents of the offending packet's header in header registers 70 and 72 and the contents of the offending packet's data (if any) in data register 80.
  • a valid bit 74 in header register 70 is set and an overrun bit 76 is cleared.
  • a value is assigned to the type of error that occurred and is stored in an error type field 78.
  • the bit associated with the identified error type is set in error register 50.
  • Valid bit 74 indicates that header registers 70 and 72 and data register 80 contain information with respect to a packet that has caused an error. If a subsequent error occurs while valid bit 74 is set, overrun bit 76 is set, the appropriate bit in error register 50 is set, but the contents of the packet causing the subsequent error are discarded and not kept. Overrun bit 76 identifies that subsequent errors were received but associated packet contents were not captured. Though shown to capture and store header and data information from only a single error packet as a design choice, the system may be designed to capture and store header and data information for multiple error packets.
  • All registers related to error processing remain intact despite an occurrence of a reset operation across node controller 12. This ensures that error states are not lost due to a system reset. Some errors may cause portions of node controller 12 to become inoperative so that error handling cannot proceed without a system reset. In this instance, system software will still have the opportunity to analyze the cause of the problem after a system reset has occurred.
  • local block unit 28 identifies and handles errors due to receipt of reply and request packets, local block unit 28 may also be used to inject errors for handling by system software.
  • one or more PIO write operations may be initiated by a processor 16. These PIO write operations are used to write desired test header and data information into header registers 70 and 72 and data register 80.
  • either the same or another PIO write operation is generated to set a desired error bit in error register 50.
  • Setting of the error bit triggers activation of interrupt signal 52.
  • the appropriate processor has its operations interrupted to handle the error by analyzing the header and data information injected into the header registers 70 and 72 and the data register 80. In this manner, any of the errors specified in error register 50 may be induced in a known circumstance in order to test the system's error handling software without forcing errors during normal operation which may be difficult to induce.
  • the software preferably performs one or more PIO write operations on registers in registers module 40.
  • Each PIO write operation preferably modifies the state of only a single register as each PIO write operation preferably specifies exactly one unique address. Since error injection may require setting up several different registers (e.g., header registers 70 and 72 for header information and data register 80 for data information) , several separate PIO write requests may be issued by the software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Multi Processors (AREA)
  • Communication Control (AREA)
PCT/US2000/025845 1999-09-30 2000-09-20 Method and apparatus for processing errors in a computer system Ceased WO2001024007A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP00963680A EP1221229A2 (en) 1999-09-30 2000-09-20 Method and apparatus for processing errors in a computer system
JP2001526709A JP2003524225A (ja) 1999-09-30 2000-09-20 コンピュータシステムのエラーを処理する方法及び装置

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/409,764 US6457146B1 (en) 1999-09-30 1999-09-30 Method and apparatus for processing errors in a computer system
US09/409,764 1999-09-30

Publications (2)

Publication Number Publication Date
WO2001024007A2 true WO2001024007A2 (en) 2001-04-05
WO2001024007A3 WO2001024007A3 (en) 2002-01-10

Family

ID=23621860

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/025845 Ceased WO2001024007A2 (en) 1999-09-30 2000-09-20 Method and apparatus for processing errors in a computer system

Country Status (4)

Country Link
US (1) US6457146B1 (enExample)
EP (1) EP1221229A2 (enExample)
JP (1) JP2003524225A (enExample)
WO (1) WO2001024007A2 (enExample)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108632142A (zh) * 2018-03-28 2018-10-09 华为技术有限公司 节点控制器的路由管理方法和装置

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6751698B1 (en) * 1999-09-29 2004-06-15 Silicon Graphics, Inc. Multiprocessor node controller circuit and method
US7539134B1 (en) * 1999-11-16 2009-05-26 Broadcom Corporation High speed flow control methodology
US8024639B2 (en) 2006-06-23 2011-09-20 Schweitzer Engineering Laboratories, Inc. Software and methods to detect and correct data structure
US20080155293A1 (en) * 2006-09-29 2008-06-26 Schweitzer Engineering Laboratories, Inc. Apparatus, systems and methods for reliably detecting faults within a power distribution system
US20080080114A1 (en) * 2006-09-29 2008-04-03 Schweitzer Engineering Laboratories, Inc. Apparatus, systems and methods for reliably detecting faults within a power distribution system
US7900093B2 (en) * 2007-02-13 2011-03-01 Siemens Aktiengesellschaft Electronic data processing system and method for monitoring the functionality thereof
US8441768B2 (en) 2010-09-08 2013-05-14 Schweitzer Engineering Laboratories Inc Systems and methods for independent self-monitoring
US9007731B2 (en) 2012-03-26 2015-04-14 Schweitzer Engineering Laboratories, Inc. Leveraging inherent redundancy in a multifunction IED
US11323362B2 (en) 2020-08-07 2022-05-03 Schweitzer Engineering Laboratories, Inc. Resilience to single event upsets in software defined networks

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5121342A (en) 1989-08-28 1992-06-09 Network Communications Corporation Apparatus for analyzing communication networks
US5414713A (en) 1990-02-05 1995-05-09 Synthesis Research, Inc. Apparatus for testing digital electronic channels
JPH05241985A (ja) * 1992-03-03 1993-09-21 Mitsubishi Electric Corp 入出力制御装置
US5465250A (en) 1993-06-24 1995-11-07 National Semiconductor Corporation Hybrid loopback for FDDI-II slave stations
US5446726A (en) * 1993-10-20 1995-08-29 Lsi Logic Corporation Error detection and correction apparatus for an asynchronous transfer mode (ATM) network device
US5581705A (en) * 1993-12-13 1996-12-03 Cray Research, Inc. Messaging facility with hardware tail pointer and software implemented head pointer message queue for distributed memory massively parallel processing system
JP3164996B2 (ja) * 1995-03-15 2001-05-14 日本電気株式会社 シリアルデータ受信装置
JPH08272719A (ja) * 1995-03-30 1996-10-18 Mitsubishi Electric Corp 通信インタフェース回路
JP3936408B2 (ja) * 1995-03-31 2007-06-27 富士通株式会社 情報処理方法及び情報処理装置
US6012148A (en) * 1997-01-29 2000-01-04 Unisys Corporation Programmable error detect/mask utilizing bus history stack
US20010042176A1 (en) * 1997-09-05 2001-11-15 Erik E. Hagersten Skewed finite hashing function

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108632142A (zh) * 2018-03-28 2018-10-09 华为技术有限公司 节点控制器的路由管理方法和装置
CN108632142B (zh) * 2018-03-28 2021-02-12 华为技术有限公司 节点控制器的路由管理方法和装置

Also Published As

Publication number Publication date
US6457146B1 (en) 2002-09-24
EP1221229A2 (en) 2002-07-10
WO2001024007A3 (en) 2002-01-10
JP2003524225A (ja) 2003-08-12

Similar Documents

Publication Publication Date Title
EP0196911B1 (en) Local area networks
US6460105B1 (en) Method and system for transmitting interrupts from a peripheral device to another device in a computer system
US6012148A (en) Programmable error detect/mask utilizing bus history stack
US7003698B2 (en) Method and apparatus for transport of debug events between computer system components
JP3280759B2 (ja) 入力/出力制御装置および方法
EP3457283B1 (en) Centralized error handling in aplication specific integrated circuits
US6457146B1 (en) Method and apparatus for processing errors in a computer system
GB2402776A (en) Method of avoiding overflow when transmitting data
US20040255070A1 (en) Inter-integrated circuit router for supporting independent transmission rates
US7281163B2 (en) Management device configured to perform a data dump
KR102806951B1 (ko) 버스 모니터링 장치 및 방법, 저장 매체, 전자장치
US6732212B2 (en) Launch raw packet on remote interrupt
JP3711871B2 (ja) Pciバスの障害解析容易化方式
US5343557A (en) Workstation controller with full screen write mode and partial screen write mode
US6243823B1 (en) Method and system for boot-time deconfiguration of a memory in a processing system
KR20170117326A (ko) 랜덤 액세스 메모리를 포함하는 하나 이상의 처리 유닛을 위한 직접 메모리 액세스 제어 장치
US8264948B2 (en) Interconnection device
CN115437976A (zh) 一种总线控制方法及系统
US7818646B1 (en) Expectation based event verification
US6907503B2 (en) Dual port RAM communication protocol
US6710620B2 (en) Bus interface for I/O device with memory
US8780900B2 (en) Crossbar switch system
US7143197B1 (en) Method and system for monitoring a telecommunications signal transmission link
JP4336849B2 (ja) コンピュータシステム,入出力制御装置,及びコンピュータシステム動作方法
JP2006309292A (ja) サーバ装置、サーバシステム、及びサーバシステムでの系切り換え方法

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): JP

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: A3

Designated state(s): JP

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

ENP Entry into the national phase

Ref country code: JP

Ref document number: 2001 526709

Kind code of ref document: A

Format of ref document f/p: F

WWE Wipo information: entry into national phase

Ref document number: 2000963680

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2000963680

Country of ref document: EP