GB2456618A

GB2456618A - Delaying the stop-clock signal of a chip by a set amount of time so that error handling and recovery can be performed before the clock is stopped

Info

Publication number: GB2456618A
Application number: GB0822285A
Authority: GB
Inventors: Andreas Koenig; Matthias Klein; Manfred Walz; Thomas Buechner
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2008-01-15
Filing date: 2008-12-08
Publication date: 2009-07-22
Also published as: GB0822285D0

Abstract

Disclosed is a method and circuit for operating self-checking logic 16, 18, 28 in a computer processing chip 10. The chip has functional units for detecting errors 28, for tracing the errors 18, and for controlling the processor clock 16, such that a clock-stop signal is generated by the self-checking logic which is used for error management and recovery. When a stop-clock signal is generated the signal is intercepted 440, a delay 445 is defined during which error-related, chip internal error handling and/ or recovery preparation actions are processed 470. At the end of the predetermined delay 460 the clock-stop action is performed 490, 495. A warning message to firmware may be sent to help in error and recovery management. The delay may be configured according to the location of the failure, the time needed to communicate with the stop-clock signal to the clock mechanism on the chip and/or the time needed to collect and store debug data.

Description

I . - • »

- 1 -

2456618

DESCRIPTION

Method and apparatus for handling, propagation and communication of clockstop situations in an I/O sub-system

1. BACKGROUND OF THE INVENTION

1.1 FIELD OF THE INVENTION

The present invention relates to the area of computer chip technology, and particularly the field of chip error management. In particular, it relates to a method and respective circuit and system for operating self-checking logic in a computer processing chip, wherein a clock-stop signal is generated by the self-checking logic which is used for error management and recovery.

1.2 DESCRIPTION AND DISADVANTAGES OF PRIOR ART

The present invention is applicable preferably in a multiple-chip processor cluster, particularly in an I/O subsystem of high performing server computers and in particular in mainframe computers. I/O subsystem is usually understood as a term describing system hardware that connects main memory and CPUs to controllers of peripheral devices over various interfaces, preferably standardized interfaces.

Most of today's computer chips contain self-checking logic and/or logic observing the behaviour of the chip as well as it's correct handling and processing of commands and data.

Figure 1 illustrates the most basic structural components of a prior art hardware and software environment used for a prior art self-checking logic in a computer chip.

The shown computer chip represents a state of the art I/O chip in a tree structure of mainframe computers. Unit 14 represents a link unit using a standardized interface like Infiniband (ILU: Infiniband Link Unit) to a next higher hierarchy of the I/O tree, while units 30, 32, 34 and 3b represent link units that provide the interfaces to the I/O chips on the next lower level of the I/O tree. They are exemplarily implemented as STI Link Units, abbreviated as SLUs. STI is an abbreviation for "Self-Timed Interface" and denotes a prior art IBM proprietary Interface for connecting an I/O Hub to I/O Bridge chips in IBM mainframe computers.

The 1:4 fan-out depicted in the drawing is typical for an I/O chip in an I/O tree structure.

The functional computation on the packets streaming thru the shown I/O chip is done in various functional units between the north (ILU 14) and south interfaces (SLUs 30..36).

Functional units (FU) 20, 22, 24 and 26 represent an exemplary arrangement of those functional units.

Distributed trace facilities that are located in a subset or even in all of the described link and functional units, are controlled by a centralized trace control unit abbreviated as TU, and represented by box 18.

Errors that are detected in the various functional units and link units by internal error detection mechanisms are reported to a central error handling unit (EHU) 28 that analyses the occurred error situation and triggers the required recovery and/or error isolation steps.

Step 110 represents an error condition that is detected by the internal error detection mechanisms in the functional unit 26.

This error condition is reported to the error handling unit 28 using dedicated wires as shown in step 120.

If the reported error condition is too severe to be handled within the functional unit 26 or by any other recovery mechanisms within the chip, a clockstop request is issued to the central clock

control unit (CCU) 16. This Clockstop request is shown in step 130.

Keeping in mind this basic error management configuration on a prior art chip as depicted in figure 1 a person skilled in the art may appreciate that if in such a chip an error or false behaviour is detected that can not be fixed by hardware or software mechanisms or might result in an unpredictable behaviour and corruption of data or actions, these checking mechanisms are usually able to shut down the chip by stopping it's internal clocks that are driving the functional logic and therefore prevent the failure from spreading within a system.

An article "IBM S/390 Parallel Enterprise Server G5 fault tolerance: A Historical Perspective", by Spainhower, L. et al, published in IBM Journal of Research and Development, Vol. 43, No. 5/6, September / November 1999, pp 863 to 873 summarizes chip error management in a historical overview and also describes the prior art management of fault tolerances of I/O subsystems.

A further article "Enhanced I/O subsystem recovery and availability on the IBM System z9", by Oakes, K.J. et al,

published in IBM Journal of Research and Development, Vol. 51, No. 1/2, Jan/ March 2007, pp 131 to 144, focuses prior art problems of error recovery implemented by firmware. In a respective error recovery strategy a sequence of steps is performed including error detection, data capture, scheduling and executing recovery actions, software notifications, middleware notification, etc..

The main goal in this context is to preserve the integrity of customer data in order to avoid business critical situations. Severe hardware failures or firmware errors require to stop the clock in order to avoid the propagation of errors within the I/O subsystem spreading from one initially effected I/O chip. This is referred to a clockstop error.

In such prior art implementations, such a clockstop is executed unconditionally and immediately when such a critical situation is detected. The shutdown of the chip is usually detected by the system, when it has become inaccessible, and when the interfaces to the stopped chip are no longer responding to respective requests. This kind of chip error management is in many hardware/software environments and in many business-critical situations not tolerable due to the tremendous impact of the not foreseen loss of components within the system. The time and actions that have to be performed within the system to analyse and to react to such an unexpected outage can lead to a temporary significant decrease of performance, or furthermore it may create a situation that leads to a total fail of the whole system. Controlled actions to limit the impact of the outage can not be executed by the system in advance, as well as chip internal mechanisms that could be implemented to collect data to provide appropriate information about the root cause of the problem can not act anymore once the clockstop has been executed.

1.3 OBJECTIVES OF THE INVENTION

The objective of the present invention is thus to provide an improved method and system for method and respective circuit and system for operating self-checking logic in a computer processing chip.

2. SUMMARY AND ADVANTAGES OF THE INVENTION

This objective of the invention is achieved by the features stated in enclosed independent claims. Further advantageous arrangements and embodiments of the invention are set forth in the respective subclaims. Reference should now be made to the appended claims.

- 5 -

According to the broadest aspect of the invention a method for operating self-checking logic in a computer processing chip is disclosed, preferably applicable in a multiple chip, tree-like processor cluster, particularly in a I/O subsystem, wherein a clock-stop signal is generated by the self-checking logic which is used for error management and recovery, which method is characterized by the steps of:

a) trapping, i.e. intercepting the clockstop signal in a first cycle of the processing unit or on the clockstop request path to the clock controlling unit,

b) defining or using a pre-defined delay time - preferably a number of cycles - during which recovery and error handling preparation actions - in particular also tracestop actions - are allowed to be processed,

c) performing preparation and execution of error related, chip internal actions as well as a signalisation of the forthcoming clockstop to system mechanisms which are external to the effected chip during the delay, and d) performing the clockstop only after the end of the predetermined delay.

When further in a multiple processor cluster a further step of communicating a warning message to neighbor-processors is performed then it is possible to advantageously set up a synchronization of error management between said multiple processor units. So, the error management of a processor cluster can be improved and uncontrolled error spreading along the cluster is avoided.

The inventive method of communicating the forthcoming clockstop situation and therefore the coming outage of a component within

c J • «

- 6 -

the system allows the centralized system wide control mechanisms to act proactively to the new situation. Time-consuming investigations to analyse the unexpected outage are no longer required since the information about failing components is available before the outage gets visible to the system. This also allows a system reconfiguration to bypass the failing components and/or to limit the impact on the whole system.

When the delay is made configurable according to at least one of the following criteria, then it is possible to provide a customizable error management adjustable to the respective individual needs in any respective hardware/ software environment:

- Depending on the location of the failure within the chip -relative to the boundaries of the effected chip - the delay needs to be short enough to make sure that the error does not spread across additional components

- The delay should - under respect to above mentioned criteria -be at least as long as needed to guaranty the communication of the forthcoming Clockstop to neighbouring components/chips or to a centralized system mechanism.

- The delay should - under respect to above mentioned criteria -be at least as long as chip internal mechanisms require to perform any desirable action within the chip like e.g. some collecting and storing of debug data.

An example for an adequate clockstop delay is a number from 15 to 25 cycles for a unit being located in the center region of the tree as shown in figure 3, a more remote unit located near of chip boundary needs a delay of e.g. only from 3 to 7 cycles.

So, by means of the above features it is possible to delay the clockstop. Thus, it is possible to prepare the clockstop situation by executing or completing certain preparation actions before the clocks are stopped.

According to a preferred embodiment a new apparatus is provided ready to be implemented on any given chip that intercepts clockstop requests before they reach the clock control logic of the chip. It is able to delay the request to stop the clocks either unconditionally or depending on pre-definable conditions for a certain amount of time. The time won by the delay of the clockstop can be used by the inventive apparatus to execute certain required actions within the chip to e.g. capture and store debug data.

Furthermore, the apparatus can use special communication mechanisms like e.g. Infiniband-special flow control packets as described in the following subsection 2.1 to communicate the forthcoming clockstop to neighbouring components like attached chips via its interfaces or dedicated wires. This communication of the clockstop can solve certain problems of today's systems.

2.1 Special communication mechanisms:

It is disclosed to provide advantageously a communication protocol for inter-chip communication, wherein the protocol is based on a standard interface protocol, which is adapted to incorporate control, configuration and/or recovery information for computer chips, and the information is encapsulated within communication packets of a communication layer above the physical layer of the interface protocol.

One essential point of the new control traffic dedicated communication protocol according to the invention is that such protocol allows a reliable communication requiring only basically initialized connection of a main communication path. This communication bypasses all critical macros since the chip related information, for example control, configuration and/or recovery information, is encapsulated in a low layer of a standard communication protocol. Such communication enables error recovery,

- 8 -

which is typically a deadlock, and reestablishment of the traffic. However, the new protocol may also be used during hardware initialization, in order to go around not yet sufficiently initialized hardware components for error recovery. Furthermore, during hardware initialization, the new protocol may be used to regularly/initially set up or configure non-initialized or not completely initialized hardware components. In any case an additional interface becomes superfluous, ie extra pins and wires are saved.

According to one preferred embodiment, the communication packets are manufacturer specific flow control packets defined by Opcodes (Operation Codes), which are not used by the standard interface protocol. That is, the basic structure of the standard communication protocol must not be changed. Proprietary enhancements may be introduced using open resources, which is easy and cost effective.

A variety of error cases may be handled if the Opcodes defining the communication packets each indicate different kinds of information. This extends the protocol to cover any failure occurring in control, configuration and/or recovery of chips, and therefore being extremely reliable. Furthermore, using different Opcodes, the information carried by the protocol is not restricted to mere failure management. For example, recovery of a system may require initialization of components. However, such mechanism may also be employed in regular control of a system, e.g., for initially preparing macros in routing, credits etc. before they are able to take up operation.

The amount of information to be transferred may be increased if such information is split up into several data packets having a header and a sequence number field. This allows restoring the full message of manufacturer specific flow control packets extending a defined length.

Preferably, the inventive enhancement of a standard interface protocol is made to an InfiniBand protocol, preferably of Version 1.2. InfiniBand (also named IB in the following) is a switched fabric communications link primarily used in high performance computing. Its features include quality of service and fail over, and it is designed to be scalable. The IB architecture specification defines a connection between processor nodes and high performance I/O nodes such as storage devices. Since this standard is widely used, application of present invention may be easily implemented.

In particular, the networking layer of the InfiniBand protocol may be used as the communication layer for transferring the chip related information. Such layer allows definition of a manufacturer specific subtype of flow control packets specified by the IB standard, which are also transferred on a very low layer of the IB communication protocol.

The IB standard defines a 32-bit flow control packet used to control the traffic flow on the link level. These packets contain a 4-bit OpCode field. However, only OpCode 0x0 and 0x1 are used by the IB standard. If Opcodes defining the communication packets are different from 0x0 and 0x1, the content of the remaining 28 bits is not defined by the IB specification, i.e. open for proprietary enhancement according to the invention.

The communication protocol is implemented by a method for interchip communication, wherein a communication protocol (CP) based on a standard interface protocol (SIP) is used, the method comprising the steps of:

- determining chip related information comprising data relevant for at least one of the following: control, configuration,

recovery;

- 10 -

- encapsulating the information within communication packets of a communication layer above the physical layer of the interface protocol;

- inserting the communication packets into a regular traffic flow of the sending chip;

- extracting the information from an incoming data stream on the receiving chip.

One essential point of the communication method according to the invention is that the advantages of both current access methods for control, configuration and recovery functions, i.e. dedicated interfaces and special command types, are combined. The encapsulated low-level communication can transfer all kinds of required messages and commands in both directions. It further allows a pretty reliable and direct access for nearly no additional costs, using the existing pins and wires. Moreover, it is not exposed to any kind of communication problem on the main data path, due to the fact that - besides the link protocol engine and the physical layer - no additional logical units involved in the main data communication are used, like routing, translation, buffering, checking etc.

The preferred embodiment of the invention is a processing unit for inter-chip communication, wherein on the one hand the unit is connected to a link protocol engine of a main interface to a neighboring chip, and on the other hand the unit is connected to control and configuration mechanisms of the own chip.

One essential point of the processing unit according to the invention is that architectured manufacturer specific flow control packets or any other comparable low level communication packets of the used interface protocol can be employed for the mentioned kind of control, configuration and recovery communication. This solution does neither need separate pins or wiring nor is it exposed to most of the m.i sconf iguration, failure or traffic

- 11 -

backing problems in the main data path. Preferably, in order to save space and costs, the unit is integrally formed with a processing unit and/or a control unit of a computer chip.

2.2.

According to the present invention, system-wide recovery mechanisms mostly realized in firmware that get informed of a forthcoming clockstop for a certain chip do not need to execute long-lasting and "surgical" investigations when a chip becomes inaccessible anymore. This allows the system recovery mechanisms to react quicker for an error situation which is now -under usage of the invention - known in advance.

3. BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example and is not limited by the shape of the figures of the drawings in which:

Figure 1 illustrates the most basic structural components of a prior art hardware and software environment used for a prior art method,

Figure 2 illustrates the most basic structural components of a inventive hardware and software environment used for a preferred embodiment of the inventive method,

Figure 3 illustrates the control flow of the most important steps of a preferred embodiment of the inventive method.

4. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

With reference to figure 2 illustrating the most basic structural components of a inventive hardware and software environment used for a preferred embodiment of the inventive method, a Clockstop/ Tracestop Preparation Logic (CTPL) 40

- 12 -

is implemented in a set of I/O chips that are connected for example in a tree hierarchy.

With additional reference to figure 3, after an error occurred in Unit 26, see step 410, and has been communicated in a step 420 to EHU 28 as described above with reference to figure 1, the EHU will issue a clockstop request as known from prior art to the CCU 18.

The CTPL logic 40 provided according to the invention intercepts clockstop requests issued by the central error handling mechanism implemented in the error handling unit 28 (EHU) of these chips on it's way to the clock control logic, see step 440 and 210 in figure 3. A counter within this logic 40 is started by the CTPL 40, step 450, as soon as the clockstop request is detected. When the clocks are later stopped, this counter contains the number of cycles occurred between the clockstop request and the point in time the clocks are really stopped by the clock control unit 16 (CCU).

If enabled, the CTPL 40 allows thus to delay the clockstop request for a pre-definable amount of cycles, or until a desired action has completed, see step 240 in figure 2.

Within this delay, a catalogue of small tasks can be completed advantageously as follows:

a) CTPL 40 will invoke the Trace Control Logic 18, in order to make it collect debug data from several buffer locations of the chip,

b) CTPL 40 will order the collected data according to the needs of a debug user according to any pre-configured scheme, and c) CTPL 40 will store the data at any pre-configured storage location preferably outside the defected chip, in order to enable a quick debug process.

Those actions will be done within the available delay time, sufficiently early before the clock will be stopped and no actions to access respective chip components are possible anymore. Dependent of the respective prevailing hardware and software environment and the business situation, further emergency tasks can be advantageously performed, such as properly shutting down interfaces of the affected chip to neighbouring chips or adjacently connected storage devices, select and activate a certain, pre-configured emergency plan, by which pre-selected resources residing on different, not affected systems can be allocated for such emergency use, and sending an error management report to any pre-configured storage in order to provide full, complete and clear information about the error to respective error management tools or error management users. The skilled reader will appreciate, that further measures, here referred to as "emergency actions" can be taken in order to track and save that data which is known to be definitely required to the respective business situation. All these tasks or actions are symbolically represented by step 470 in figure 4.

So the loop is run multiple times, and respective actions can be completed. The end of the loop is reached by the counter reaching the predefined maximum value. Then in a step 490 the CTPL 40 releases its interception to pause the forwarding of the clock stop requests. Consequently, this request will be sent to the CCU 18 which in turn stops the clock, step 495.

In an alternate embodiment of the invention, the loop is exited only after a confirmatory message has been received by the EHU 28 and forwarded to the CTPL 40, saying that the - configurably defined most important - actions have successfully been completed.

In a preferred variation of the inventive method, each functional unit 20, 24, 26, but optionally also the management units 16, 18, 14, 28 stores an individual delay value which is reported together

- 14 -

with an error code when the error is initially detected. This value is then used to determine the delay relevant for the loop condition 460. In this way, any desired error handling can be focused and individually adapted to the actual need of the business in which the chip is used.

The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.

The circuit as described above is part of the design for an integrated circuit chip. The chip design is created in a graphical computer programming language, and stored in a computer storage medium (such as a disk, tape, physical hard drive, or virtual hard drive such as in a storage access network). If the designer does not fabricate chips or the photolithographic masks used to fabricate chips, the designer transmits the resulting design by physical means (e.g., by providing a copy of the storage medium storing the design) or electronically (e.g., through the Internet) to such entities, directly or indirectly. The stored design is then converted into the appropriate format (e.g., GDSII) for the fabrication of photolithographic masks, which typically include multiple copies of the chip design in question that are to be formed on a wafer. The photolithographic masks are utilized to define areas of the wafer (and/or the layers thereon) to be etched or otherwise processed.

- 15 -

Claims

1. A method for operating self-checking logic (16, 18, 28) in a computer processing chip (10) comprising respective functional units for detecting errors (28), for tracing said errors (18), for controlling the processor clock (16), wherein a clock-stop signal is generated by said self-checking logic which is used for error management and recovery, characterized by the steps of:

a. intercepting (440) said clock-stop signal,

b. defining (445) a delay during which error-related, chip internal error handling and/ or recovery preparation actions are allowed to be processed,

c. performing (470) the execution of said actions during said delay, and d. performing (490, 495) the clockstop only after (460) the end of said predetermined delay.

2. The method according to claim 1, wherein said clock-stop signal is trapped in the path to the clock control unit (18) .

3. The method according to claim 1, wherein said delay is determined by a predetermined number of clock cycles.

4.The method according to claim 1, wherein in a multiple processor cluster a further step of communicating a warning message to a firmware component is performed in order to further improve the error and recovery management.

- 16 -

5. The method according to claim 1 wherein said delay is configurable according at least one of the following criteria:

a) the location of the failure within the error-affected chip relative to the boundaries of said chip, wherein said delay is pre-configured short enough to make sure that the error does not spread across additional components resident on said chip,

b) wherein said delay is selected at least as long as needed to guaranty the communication of the forthcoming Clock stop to a system mechanism implemented centralized on said chip,

c) wherein said delay is selected to be at least as long as a chip-internal implemented mechanism requires to perform a respective one of a predetermined plurality of pre-configured actions.

6. The method according to the preceding claim, wherein one of said preconfigured actions is to collect and store debug data from diverse components implemented on said affected chip.

7.An electronic data processing system including self-checking logic (16, 18, 28) in a computer processing chip (10) comprising respective functional units for detecting errors (28), for tracing said errors (18), for controlling the processor clock (16), wherein a clock-stop signal is generated by said self-checking logic which is used for error management and recovery, having a clockstop/tracestop preparation logic (40) for performing the steps of:

a. intercepting (440) said clock-stop signal,

- 17 -

8. A computer program product for operating self-checking logic (16, 18, 28) in a computer processing chip (10) comprising respective functional units for detecting errors (28), for tracing said errors (18), for controlling the processor clock (16), wherein a clock-stop signal is generated by said self-checking logic which is used for error management and recovery, comprising a computer useable medium including a computer readable program, wherein the computer readable program includes a functional component that when executed on a computer causes the computer to perform the steps of:

a. intercepting (440) said clock-stop signal,