JPH04321140A

JPH04321140A - Method and device for pointing error occurring part

Info

Publication number: JPH04321140A
Application number: JP3090636A
Authority: JP
Inventors: Takanori Kinoshita; 孝徳木下; Kenji Korekata; 研二是方
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1991-04-22
Filing date: 1991-04-22
Publication date: 1992-11-11

Abstract

PURPOSE:To recognize an exchange unit having the largest count value of a counter as a part where an error occurred by detecting the error at the earliest stage. CONSTITUTION:The component parts of a computer system consist of the exchange units 1 that can be replaced with others. An error detecting means 2 detects the errors of the component elements for each unit 1. An error signal holding means 5 holds an error detecting signal. Then a counter 4 starts its counting operation with the error signal outputted from the means 3. An error control means 5 transmits simultaneously the count stop signals to the counters 4 based on the first transmitted error signal. A deciding means 6 checks the count value of each counter 4 and decides that the unit 1 showing the counter 4 having the largest count value has an error.

Description

[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】本発明は、複数のＬＳＩ（大規模
集積回路）やプリント板などで構成されている装置内で
エラーが発生した場合に、交換する部品を指摘するエラ
ー発生部指摘方法および装置に関する。[Industrial Application Field] The present invention provides a method for pointing out parts to be replaced when an error occurs in a device composed of multiple LSIs (Large Scale Integrated Circuits), printed circuit boards, etc. and regarding equipment.

【０００２】0002

【従来の技術】コンピュータシステムにエラーが発生し
た場合に、障害修復時間の短縮化が要求されている。こ
のためには、まずエラー発生部の特定が必要であり、い
くつかの方法があるが、その１つとしてエラー発生部を
突き止めるプログラムがある。これは交換部品内にその
エラー発生を検出するチェッカーと、このチェッカーの
エラー発見信号を自己保持するラッチであるリージョン
コード（Ｒｅｇｉｏｎ　Ｃｏｄｅ　）を設け、障害発生
時にマシンチェックプロセスから起動されるＳＣＩ　（
システム・コンソール・インターフェース（Ｓｙｓｔｅ
ｍ　Ｃｏｎｓｏｌｅ　Ｉｎｔｅｒｆａｃｅ）と称し、Ｓ
ＶＰ（Ｓｒｖｉｃｅ　Ｐｒｏｃｅｓｓｏｒ）　とシステ
ムを構成する装置との間に位置し、ＳＶＰ　と各装置間
のインタフェースを制御する。ＳＶＰ　はＳＣＩ　を介
して各装置の状態を読み出し設定する）　により収集さ
れたリージョンコードから、リージョンコードにエラー
情報がラッチされているか否かを調べ、ラッチされてい
れば最初にラッチされたリージョンコードを調べ、その
リージョンコードのある交換部品にエラーが発生したも
のと識別するプログラムである。2. Description of the Related Art When an error occurs in a computer system, there is a need to shorten the time required to repair the error. To do this, it is first necessary to identify the error occurrence part, and there are several methods, one of which is a program that locates the error occurrence part. This includes a checker that detects the occurrence of an error in the replacement part, and a region code (Region Code) that is a latch that self-holds the error detection signal of this checker.
System Console Interface
M Console Interface),
It is located between the VP (Service Processor) and the devices that make up the system, and controls the interface between the SVP and each device. Check whether error information is latched in the region code from the region code collected by SVP (SVP reads and sets the status of each device via SCI), and if latched, the first latched region code This is a program that identifies replacement parts with that region code as having an error.

【０００３】この識別に使用するのがリンク情報で、こ
れは、エラーの関連部品への波及を示したものであり、
個々のＲＣ（　リージョンコード）　の原因と結果を対
応ずけるものである。図８はリンク情報の１例を示した
もので、（ａ）は各交換部品ごとに設けられたＲＣのリ
ンク節を示す。ＲＣＡ　は交換部品Ａに設けられたＲＣ
で、交換部品Ａは交換部品ＢとＧに接続しており、また
交換部品Ｂは、交換部品ＣとＤに接続されていることを
示す。[0003] Link information is used for this identification, and this indicates the spread of the error to related parts.
It correlates the cause and effect of each RC (region code). FIG. 8 shows an example of link information, and (a) shows the link node of the RC provided for each replacement part. RCA is the RC installed in replacement part A
This shows that replacement part A is connected to replacement parts B and G, and replacement part B is connected to replacement parts C and D.

【０００４】ここで交換部品Ａにエラーが発生すれば、
当然ＲＣＡ　にエラー発見信号がラッチされるが、接続
されているＲＣＢ，ＲＣＧ　にもエラーが波及してゆく
ので、ＲＣＢ，ＲＣＧ　にもエラー信号がラッチされる
。このようにエラーは波及してゆくのでエラー信号がラ
ッチされたＲＣを調べ、これを（ｂ）図に示すリンク情
報を用いることによりエラーが発生した交換部品を遡っ
て突き止めることができる。[0004] If an error occurs in replacement part A,
Naturally, the error detection signal is latched in the RCA, but since the error spreads to the connected RCB and RCG, the error signal is also latched in the RCB and RCG. Since the error spreads in this way, by checking the RC in which the error signal is latched and using the link information shown in Figure (b), it is possible to trace back to the replacement part where the error occurred.

【０００５】[0005]

【発明が解決しようとする課題】このリンク情報は、人
手により作成されるものであり、あらゆるエラーパター
ンを想定して作成されなければならないが、必ずしも完
全なものとはなりがたい。つまり、人手によって作成さ
れるため、人為的なミスにより、リンク情報の抜けや誤
りにより、エラー発生部の指摘を誤り、障害修復時間を
長引かせることが多かった。[Problems to be Solved by the Invention] This link information is created manually, and must be created with all kinds of error patterns in mind, but it is not necessarily perfect. In other words, because they are created manually, human errors often result in omissions or errors in link information, which can lead to incorrect identification of error locations, prolonging the time it takes to repair the problem.

【０００６】しかも、作成されたリンク情報に誤りや抜
けがないかのチェックを行うために、擬似的に障害を発
生させ、正しくエラー発生部を指摘するかどうかの検証
もする必要があり、その作業に要する時間も膨大なもの
であった。さらにシステムが巨大化すればするほど、リ
ンク情報の作成量は増大し、この増大に伴ってリンク情
報の抜けや誤りも多くなり、これによりさらに検証時間
が増大するという悪循環が生じていた。Furthermore, in order to check whether there are any errors or omissions in the created link information, it is necessary to generate a pseudo failure and verify whether the error occurrence part is correctly pointed out. The time required for the work was also enormous. Furthermore, as the system becomes larger, the amount of link information to be created increases, and with this increase, the number of omissions and errors in link information increases, creating a vicious cycle in which verification time further increases.

【０００７】本発明は、上述の問題点に鑑みてなされた
もので、最も早くエラーを検出し、それによりカウンタ
をカウントアップした結果のカウント値が最大の交換単
位をエラー発生部として指摘するエラー発生部指摘方法
および装置を提供することを目的とする。The present invention has been made in view of the above-mentioned problems, and detects an error as soon as possible, and thereby detects an error by incrementing a counter and pointing out the exchange unit with the largest count value as the error occurrence part. The purpose of the present invention is to provide a method and device for pointing out the occurrence area.

【０００８】[0008]

【課題を解決するための手段】図１は本発明のエラー発
生部指摘装置の原理図を示す。同図において、１はコン
ピュータシステムを構成する部品の交換可能な交換単位
であり、２はこの交換単位１ごとに設けた構成要素のエ
ラーを発見しエラー発見信号を出力するエラー発見手段
、３はこのエラー発見信号を保持するエラー信号保持手
段、４はこのエラー信号保持手段３の出力するエラー信
号により共通のクロックでカウントを開始するカウンタ
である。５は最初に伝送されてきた前記エラー信号に基
づき前記各カウンタ４に同時にカウントストップ信号を
伝送するエラー制御手段、６は前記カウンタ４のカウン
ト値の内最大カウント値を有する前記交換単位１にエラ
ーが発生したと判断する判断手段６である。[Means for Solving the Problems] FIG. 1 shows a principle diagram of an error occurrence point indicating device according to the present invention. In the figure, 1 is a replaceable unit of parts constituting a computer system, 2 is an error detection means for detecting errors in the components provided for each replacement unit 1 and outputting an error detection signal, and 3 is an Error signal holding means 4 for holding this error detection signal is a counter that starts counting with a common clock in response to the error signal output from error signal holding means 3. 5 is an error control means for simultaneously transmitting a count stop signal to each of the counters 4 based on the error signal transmitted first; 6 is an error control means for transmitting an error to the exchange unit 1 having the maximum count value among the count values of the counters 4; This is the determining means 6 that determines that the event has occurred.

【０００９】また、複数の前記交換単位１を同一のパネ
ルボードに搭載し、このパネルボード毎に前記エラー制
御手段５を設けるようにする。Furthermore, a plurality of the replacement units 1 are mounted on the same panel board, and the error control means 5 is provided for each panel board.

【００１０】0010

【作用】エラー発見手段２は構成要素のエラーを発見す
るとエラー発見信号を出力する。エラー信号保持手段３
はこのエラー発見信号を保持した後エラー信号を出力す
る。カウンタ４はこのエラー信号により、このエラー発
生部指摘装置共通のクロックでカウントを開始する。ま
たこのエラー信号は、所定の通信路を通り遅延しながら
エラー制御手段５に到着する。エラー制御手段５から各
カウンタ４にカウンタストップ信号が同時に届くように
遅延時間を設定した通信路が設けられているので、エラ
ー制御手段５よりカウンタストップ信号が送信されると
、各カウンタ４は同時にカウントを停止する。これによ
り各カウンタ４は、そのカウンタ４が属する交換単位１
のエラー信号保持手段３よりエラー制御手段５までエラ
ー信号が送られる時間およびエラー制御手段５からカウ
ンタ４までカウンタストップ信号が送られる時間の間カ
ウントをして停止することになる。このため、最初にエ
ラー信号を発信してそれによりカウントを開始したカウ
ント値が最も大きな値となる。判断手段６は各カウンタ
４のカウント値を調べ最も大きなカウント値を有するカ
ウンタ４の属する交換単位１をエラー発生部と識別する
。[Operation] When the error detection means 2 detects an error in a component, it outputs an error detection signal. Error signal holding means 3
outputs an error signal after holding this error detection signal. In response to this error signal, the counter 4 starts counting using a clock common to this error occurrence point pointing device. Further, this error signal passes through a predetermined communication path and arrives at the error control means 5 with a delay. A communication channel is provided with a delay time set so that the counter stop signals arrive from the error control means 5 to each counter 4 at the same time, so when the counter stop signal is transmitted from the error control means 5, each counter 4 simultaneously receives the counter stop signals. Stop counting. As a result, each counter 4 is assigned to the exchange unit 1 to which it belongs.
The counter counts and stops during the time during which the error signal is sent from the error signal holding means 3 to the error control means 5 and the time during which the counter stop signal is sent from the error control means 5 to the counter 4. Therefore, the count value at which the error signal is first transmitted and counting is started becomes the largest value. The determining means 6 checks the count value of each counter 4 and identifies the exchange unit 1 to which the counter 4 having the largest count value belongs as the error occurrence part.

【００１１】複数の交換単位１を同一のパネルボードに
搭載し、このパネルボード毎にエラー制御手段５を設け
ることにより、エラーを発生した交換単位１の検出が容
易となる。By mounting a plurality of replacement units 1 on the same panel board and providing the error control means 5 for each panel board, it becomes easy to detect the replacement unit 1 in which an error has occurred.

【００１２】0012

【実施例】以下、本発明の実施例を図面を参照して説明
する。図２は本発明の実施例を実現するコンピュータの
構成を表すブロック図の１例を示す。本装置は本体装置
と入出力装置よりなり、本体装置は、主記憶ＭＳＵ　、
拡張記憶ＳＳＵ　、これらを制御するメモリ制御ユニッ
トＭＣＵ　、ベクトル演算を行うベクトルユニットＶＵ
、スカラ演算を行うスカラユニットＳＵ、入出力プロセ
ッサＩＯＰ、チャネルＣＨよりなる。またサービスプロ
セッサＳＶＰ　は、本体装置とは独立したコンピュータ
サブシステムであり、オペレータコンソール機能、シス
テム制御機能等を有する。またインタフェース制御ＳＣＩ　はＳＶＰ　とコンピュ
ータシステムを構成する装置との間に位置し、ＳＶＰ　
と各構成装置間のインタフェースを制御する。ＳＶＰ　
はＳＣＩ　を介して各構成装置の状態の読み出しおよび
設定を行う。入出力装置は入出力コントローラＩＯＣ　
と入出力部ＩＯより構成される。Embodiments Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 2 shows an example of a block diagram representing the configuration of a computer that implements an embodiment of the present invention. This device consists of a main unit and an input/output device, and the main unit includes a main memory MSU,
Extended storage SSU, memory control unit MCU that controls these, vector unit VU that performs vector operations
, a scalar unit SU that performs scalar operations, an input/output processor IOP, and a channel CH. The service processor SVP is a computer subsystem independent of the main unit, and has an operator console function, a system control function, and the like. In addition, the interface control SCI is located between the SVP and the devices that make up the computer system.
and the interface between each component device. SVP
reads and sets the status of each component device via the SCI. The input/output device is the input/output controller IOC
It consists of an input/output section IO.

【００１３】図３はパネルボード毎に構成装置を搭載し
た１例を示す図である。ボードＡにＣＰＵ　を搭載し、
ボードＢにメモリ制御ユニットＭＣＵ、ボードＣに入出
力プロセッサＩＯＰ　、ボードＤに主記憶ＭＳＵ　を搭
載する。各ボードには、エラーが発見された場合の交換
単位としての複数のＬＳＩ　および１　個のエラー制御
部ＥＣが配置されている。各ボードのＬＳＩ　は他のボ
ードのＬＳＩ　と接続しているものもあり、エラーが発
生するとエラーは接続しているＬＳＩ　に波及してゆく
。図中の「米」印はエラーが発生して波及してゆく状況
を示す。これらエラーが波及したＬＳＩ　は、そのＬＳ
Ｉ　が属するボードのエラー制御部ＥＣにエラー信号を
出力し、エラー制御部ＥＣはカウントを開始することを
示す。また、各ボードのＥＣ間を結ぶ線は自己のボード
のＬＳＩからエラー信号を受信したＥＣが自己のＬＳＩ
　のカウンタを停止するカウンタストップ信号を出力す
ると共に他のボードのＥＣにカウンタストップ信号を出
力させるようにすることを示す。FIG. 3 is a diagram showing an example in which component devices are mounted on each panel board. A CPU is installed on board A,
Board B is equipped with a memory control unit MCU, board C is equipped with an input/output processor IOP, and board D is equipped with a main memory MSU. Each board is provided with a plurality of LSIs and one error control unit EC, which serve as replacement units when an error is discovered. Some LSIs on each board are connected to LSIs on other boards, and when an error occurs, the error spreads to the connected LSIs. The "US" mark in the diagram indicates a situation in which an error occurs and spreads. The LSI to which these errors have spread is
An error signal is output to the error control unit EC of the board to which I belongs, indicating that the error control unit EC will start counting. In addition, the line connecting the ECs of each board is such that the EC that received the error signal from the LSI of its own board connects to the LSI of its own board.
This indicates that a counter stop signal is output to stop the counter of the other board, and the EC of the other board is caused to output a counter stop signal.

【００１４】図４は各ボードに搭載された交換単位とな
るＬＳＩ　のエラー検出機構を示す図である。ＬＳＩ　
を構成する部品としてはゲート，　ラッチ，ＲＡＭと大
きく分けられる。レジスタはラッチの集合体であり、ラ
ッチで構成できるものはレジスタの他にカウンタ，　シ
フターなど様々ある。普通エラーをチェックする場合、
チェックタイミングが必要であり、ゲートを直接チェッ
クすることはできない。またＲＡＭ　は、その目的上Ｒ
ＡＭ　自身でチェックすることはできない。そこでこの
ようなゲートやＲＡＭ　などは一旦ラッチに受けて同期
化した上でチェックするようにする。それ故、ＬＳＩ　
のエラーチェック部品としてはレジスタやラッチとなり
、図４でこれらをＲで示す。FIG. 4 is a diagram showing an error detection mechanism of an LSI mounted on each board and serving as a replacement unit. LSI
The components that make up can be broadly divided into gates, latches, and RAM. A register is a collection of latches, and in addition to registers, there are various things that can be constructed from latches, such as counters and shifters. When checking for normal errors,
Check timing is required and the gate cannot be checked directly. Also, RAM is R for its purpose.
AM You can't check it yourself. Therefore, such gates, RAM, etc. should be checked once they are latched and synchronized. Therefore, LSI
Error checking components include registers and latches, which are indicated by R in FIG.

【００１５】レジスタＲのエラー発見信号をチェックす
るチェッカーＣＨＫ　はレジスタＲ１個または数個ごと
に設けられ、このチェッカーＣＨＫ　の論理和ＯＲをと
ってチェッカーＣＨＫ　で検出したエラー発見信号を自
己保持するラッチであるリージョンコードＲＣが設けら
れている。このリージョンコードＲＣは、リセット信号
がくるまでそのエラー発見信号を保持する。ＬＳＩ　−
Ｂに設けられた個々のリージョンコードＲＣの論理和を
とったリージョンコードＲＣ−Ｂが、ＬＳＩ　−Ｂに設
けられたカウンターＢのカウントを開始させると共に、
ＬＳＩ　−Ｂが実裝されている自ボード内のエラー制御
部ＥＣへエラー信号を出力する。またレジスタＲは接続
されているＬＳＩ　−Ｃへエラーを波及してゆく。A checker CHK for checking the error detection signal of the register R is provided for every one or several registers R, and is a latch that self-holds the error detection signal detected by the checker CHK by ORing the checkers CHK. A certain region code RC is provided. This region code RC holds the error detection signal until a reset signal is received. LSI-
Region code RC-B, which is the logical sum of the individual region codes RC provided in LSI-B, starts counting on counter B provided in LSI-B, and
An error signal is output to the error control unit EC within the own board on which LSI-B is implemented. Furthermore, the register R spreads the error to the connected LSI-C.

【００１６】通常チェッカーＣＨＫ　はレジスタＲの場
合、信号データのバイト単位に１個設けられている。リ
ージョンコードＲＣをチェッカーＣＨＫ　に対して１対
１の割合で持つ場合と、リージョン・コードＲＣ、１に
対してＣＨＫ　ｎの割合で持つ場合がある。このｎは個
々のＬＳＩ　の論理に使用されるゲート数の多い少ない
による物理的な制限から大、小を決めたり、回路の重要
性により決めたりする。しかしチェッカーＣＨＫ　やリ
ージョンコードＲＣは、直接論理を助けるものではない
ので必要最小限とすることが望ましい。Normally, in the case of register R, one checker CHK is provided for each byte of signal data. In some cases, the region code RC is provided in a ratio of 1:1 to checker CHK, and in other cases, the region code RC is provided in a ratio of CHK n to 1 region code. The value of n is determined based on physical limitations depending on the number of gates used in the logic of each LSI, or depending on the importance of the circuit. However, since the checker CHK and region code RC do not directly assist the logic, it is desirable to minimize them.

【００１７】図５はボードＢのエラー制御部ＥＣ５の詳
細を示す。エラー受付け回路５１はボードＢ内のＬＳＩ
　−Ｂ，ＬＳＩ　−Ｃ，ＬＳＩ　−Ｘのエラー信号、エ
ラーＢ，エラーＣ，エラーＸを受信した中の先着信号を
自己のカウンタストップ回路５３および他のボードＢ，
Ｃ，Ｄへカウンタストップ信号として出力する。ＬＳＩ
　受信部５２は自ボード内の個々のＬＳＩ　から受け付
けるエラー信号のＯＲ論理でカウントストップ信号を出
力する。FIG. 5 shows details of the error control section EC5 of board B. The error reception circuit 51 is an LSI in board B.
-B, LSI -C, LSI -X error signals, error B, error C, error
Output to C and D as a counter stop signal. LSI
The receiving section 52 outputs a count stop signal using OR logic of error signals received from each LSI on its own board.

【００１８】カウンタストップ回路５３は自ボードＢか
ら、または他ボードＡ，Ｃ，Ｄからカウンタストップ要
求を受けてＬＳＩ　−Ｂ，ＬＳＩ　−Ｃ，ＬＳＩ　−Ｘ
のカウンタにカウント停止を指示する。ボード受信ディ
レイ部５４は自ボードのエラー受付け回路５１からのカ
ウンタストップ信号およびボードＡ，Ｃ，Ｄからのカウ
ンタストップ信号のディレイ値の差分を補償してＬＳＩ
　発信ディレイ部５５に出力するカウンタストップ信号
を作成する。ＬＳＩ　発信ディレイ部５５はＬＳＩ　−
Ｂ，ＬＳＩ　−Ｃ，ＬＳＩ　−Ｘにそれぞれ違ったディ
レイ値を各ＬＳＩ　に補償してカウンタストップ信号を
出力し、各ＬＳＩ　のカウンタのカウント動作を同期し
て停止させる。The counter stop circuit 53 receives a counter stop request from its own board B or from other boards A, C, and D, and operates LSI-B, LSI-C, and LSI-X.
instructs the counter to stop counting. The board reception delay unit 54 compensates for the difference in delay values between the counter stop signal from the error reception circuit 51 of its own board and the counter stop signals from boards A, C, and D, and
A counter stop signal is generated to be output to the transmission delay unit 55. The LSI transmission delay section 55 is LSI −
B, LSI-C, and LSI-X are compensated with different delay values for each LSI, and a counter stop signal is output to synchronize and stop the counting operation of the counter of each LSI.

【００１９】各ＬＳＩ　へのカウンタストップ信号は、
その伝送経路によりそれぞれ違ったディレイ値を持つ。このため最大ディレイ値を持つカウンタストップ信号よ
り少ないディレイ値を持つカウンタストップ信号に対し
てその差分を付加する。図５の例ではＬＳＩ　−Ｘより
ＬＳＩ　−Ｂはβ、ＬＳＩ−Ｃはαだけ小さなディレイ
値を有するとすると、ＬＳＩ　−Ｂにβ，ＬＳＩ　−Ｃ
にはαディレイ値を加算することにより各カウンタスト
ップ信号を同期させることを示す。The counter stop signal to each LSI is
Each transmission path has a different delay value. Therefore, the difference is added to the counter stop signal having a delay value smaller than the counter stop signal having the maximum delay value. In the example of FIG. 5, if LSI-B has a delay value smaller than LSI-X by β, and LSI-C has a smaller delay value by α, then LSI-B has a delay value of β, LSI-C
shows that each counter stop signal is synchronized by adding an α delay value.

【００２０】図６はボード内の交換単位であるＬＳＩ　
−ＢとＬＳＩ　−Ｃとエラー制御部ＥＣ５との接続関係
を示した図である。ここでは、ボード間のカウンタスト
ップ要求を省略している。各ＬＳＩ　のリージョンコー
ドＲＣからエラー信号がエラー受付け回路５１に送信さ
れ、このエラー信号に基づきカウンタストップ回路５３
はＬＳＩ　−Ｂ，ＬＳＩ　−Ｃにカウンタストップ信号
を出力する。これにより各カウンタはカウントを停止す
る。またカウンタリセット回路５６は、図２　に示すＳ
ＣＩ　が本体装置および入出力装置の内部状態を一連の
データとして収集するスキャンアウトを収集完了以降に
、それらの装置に発信するエラーリセット要求によって
、リージョンコードＲＣとカウンタをリセットする。FIG. 6 shows an LSI that is a replacement unit within the board.
3 is a diagram showing the connection relationship between LSI -B, LSI -C, and error control unit EC5. Here, the counter stop request between boards is omitted. An error signal is sent from the region code RC of each LSI to the error reception circuit 51, and based on this error signal, the counter stop circuit 53
outputs a counter stop signal to LSI-B and LSI-C. This causes each counter to stop counting. Further, the counter reset circuit 56 is connected to the S
After the CI completes a scanout in which the internal states of the main unit and input/output devices are collected as a series of data, the region code RC and counter are reset by an error reset request sent to those devices.

【００２１】図７は、本実施例の動作を表すタイミング
チャートである。本タイミングチャートは図３に示すボ
ードＡのＬＳＩ　−Ａでエラーが発生し、これがボード
ＢのＬＳＩ−Ｂに波及した場合のタイミングチャートを
示す。ＬＳＩ　−ＡのレジスタＲよりにエラーが発生し
、対応するチェッカーＣＨＫ　よりエラー発見信号が出
力され、リージョンコードＲＣ−Ａにこのエラー発見信
号がラッチされる。これをリージョンコードＡに示す。次のクロックでリージョンコードＲＣ−Ａよりカウンタ
Ａにカウントスタート信号が出される。これをカウンタ
−Ａスタートで示す。また、このカウントスタート信号
と同時にボードＡより他のエラー制御部ＥＣ５にカウン
タストップ要求をするエラー信号Ａが出力される。これ
をエラーＡに示す。FIG. 7 is a timing chart showing the operation of this embodiment. This timing chart shows a timing chart when an error occurs in LSI-A of board A shown in FIG. 3 and this spreads to LSI-B of board B. An error occurs in register R of LSI-A, an error detection signal is output from the corresponding checker CHK, and this error detection signal is latched in region code RC-A. This is shown in region code A. At the next clock, a count start signal is issued to counter A from region code RC-A. This is indicated by counter-A start. Simultaneously with this count start signal, the board A outputs an error signal A requesting another error control section EC5 to stop the counter. This is shown in error A.

【００２２】ボードＢのＬＳＩ　−Ｂのリージョンコー
ドＲＣ−Ｂは、ＬＳＩ　−Ａより波及したエラーをＬＳ
Ｉ　−ＢのＣＨＫ　が検出し発信したエラー発見信号を
リージョンコードＡより１クロック遅れて受け、カウン
タＢをスタートさせる。これをリージョンコードＢおよ
びカウンタ−Ｂスタートに示す。またリージョンコード
ＲＣ−Ｂは、ＲＣ−Ａと同様にエラー信号Ｂをエラー信
号Ａより１クロック遅れて出力する。これをエラーＢに
示す。遅延時間は、エラー制御部５が最初のエラー信号
を認識後、全カウンタを同期化して止めるのに必要な遅
延時間で、図５で説明したＬＳＩ　発信ディレイ部５５
とボード受信ディレイ部５４のそれぞれで補償している
ディレイ値の合計値である。カウンターリセットは図６
　で説明したエラーリセット要求によりリセットされる
ことを示す。[0022] The region code RC-B of LSI-B on board B detects errors that spread from LSI-A.
The error detection signal detected and transmitted by CHK of I-B is received one clock later than region code A, and counter B is started. This is shown in Region Code B and Counter-B Start. Also, region code RC-B outputs error signal B one clock later than error signal A, similar to RC-A. This is shown in error B. The delay time is the delay time required for the error control section 5 to synchronize and stop all counters after recognizing the first error signal, and is the delay time required for the error control section 5 to synchronize and stop all counters, and is the delay time required for the error control section 5 to synchronize and stop all counters.
This is the total value of the delay values compensated for by each of the board reception delay section 54 and the board reception delay section 54. Counter reset is shown in Figure 6.
Indicates that it is reset by the error reset request explained in .

【００２３】このように、ＬＳＩ　−Ａでエラーが発生
した場合、リージョンコードＲＣ−Ａがセットされ、カ
ウンタＡのカウントアップをスタートさせる。またＬＳ
Ｉ　−Ａで発生したエラーがＬＳＩ　−Ｂにも波及し、
リージョンコードＲＣ−ＢにもセットされるとＬＳＩ　
−Ａに比べＬＳＩ　−Ｂのカウンタの方がカウントアッ
プの開始が遅くなる。その後リージョンコードＲＣ−Ａ
から報告されるエラー信号によりエラー制御回路５のエ
ラー受付け回路５１はカウンタストップ回路５３にカウ
ンタストップ信号を送り、ＬＳＩ　−ＡのカウンタＡと
ＬＳＩ　−ＢのカウンタＢを同時に止めてカウント値を
固定（フリーズ）する。なお、エラー制御回路５は、リ
ージョンコードＲＣ−Ａから報告されるエラー信号より
早く到達したエラー信号がある場合には最早に到達した
エラー信号により動作する。As described above, when an error occurs in LSI-A, region code RC-A is set and counter A starts counting up. Also LS
The error that occurred in I-A spreads to LSI-B,
If the region code RC-B is also set, the LSI
The counter of LSI-B starts counting up later than that of -A. Then region code RC-A
The error acceptance circuit 51 of the error control circuit 5 sends a counter stop signal to the counter stop circuit 53 in response to the error signal reported from the error control circuit 5, which simultaneously stops counter A of LSI-A and counter B of LSI-B to fix the count value ( To freeze. Note that, if there is an error signal that arrives earlier than the error signal reported from the region code RC-A, the error control circuit 5 operates based on the error signal that arrived earlier.

【００２４】次にマシンチェック処理により、スキャン
アウトが取られ、フリーズされていた各々のカウンタ値
がＳＶＰ　によって比較される。カウンタＡとカウンタ
Ｂのカウント値は各々「ｎ」と「ｍ＝ｎ−１」となるの
で、ＳＶＰはＬＳＩ　−Ａがエラー発生部であると指摘
する。さらにマシンチェック処理で出されるエラーリセ
ット信号により全カウンタがリセットされる。[0024] Next, in a machine check process, a scanout is taken, and each frozen counter value is compared by the SVP. Since the count values of counter A and counter B are "n" and "m=n-1", respectively, the SVP points out that LSI-A is the error generating part. Furthermore, all counters are reset by an error reset signal issued during machine check processing.

【００２５】上述のマシンチェック処理の機能は非常に
多いが、本実施例に関係する部分のみに限定すると次の
ようになる。スカラユニットＳＵあるいは拡張記憶ＳＳ
Ｕ　でマシンチェックの発生を認識すると、ＳＣＩ　に
ＥＲＲＯＲ　ＲＥＣＯＲＤ　ＲＥＱＵＥＳＴを発生し、
これによりＳＣＩ　は全装置のラッチ状態をＳＣＩ　内
のスキャンメモリに格納する。そしてスカラユニットＳ
ＵにＥＲＲＯＲ　ＲＥＣＯＲＤＥＮＤを応答し、ＳＣＩ
　経由でＳＶＰ　にＥＲＲＯＲ　ＬＯＧＯＵＴ　ＲＥＱ
ＵＥＳＴを発行する。これを契機にマシンチェック処理
は本実施例を実現するプログラムを起動し、採取された
スキャンアウト情報の内の各ＬＳＩ　に保持しているカ
ウンタ値を用いて解折を進め、最終的にエラー原因とな
ったＬＳＩ　をコード化したフラグコードを作成する。Although the above-mentioned machine check processing has a large number of functions, the functions related to this embodiment are limited to the following. Scalar unit SU or extended storage SS
When the U recognizes the occurrence of a machine check, it issues an ERROR RECORD REQUEST to the SCI,
This causes the SCI to store the latch states of all devices in the scan memory within the SCI. And scalar unit S
Responds ERROR RECORDEND to U and sends SCI
ERROR LOGOUT REQ to SVP via
Issue UEST. Taking this as an opportunity, the machine check process starts the program that implements this embodiment, uses the counter value held in each LSI in the scanout information collected to proceed with the analysis, and finally determines the cause of the error. Create a flag code that encodes the LSI that has become .

【００２６】[0026]

【発明の効果】以上の説明から明らかなように、本発明
は、交換単位ごとにエラー発生によりスタートするカウ
ンタを設け、最初に到着したエラー信号により各カウン
タのカウント停止を指示し、このカウント値の最大とな
る交換単位をエラー発生部とするので、従来技術で使用
するリンク情報が不用であり、これに伴うリンク情報作
成、検証作業も不用となり、ミスもなくなる。As is clear from the above explanation, the present invention provides a counter that starts when an error occurs for each replacement unit, instructs each counter to stop counting by the first error signal that arrives, and then changes the count value. Since the maximum exchange unit is set as the error occurrence part, the link information used in the prior art is unnecessary, and the accompanying link information creation and verification work is also unnecessary, eliminating mistakes.

[Brief explanation of drawings]

【図１】本発明の原理図である。FIG. 1 is a diagram showing the principle of the present invention.

【図２】本発明の実施例を実現するコンピュータの構成
例を示す図である。FIG. 2 is a diagram showing an example of the configuration of a computer that implements an embodiment of the present invention.

【図３】ボード上に配置された構成装置例を示す図であ
る。FIG. 3 is a diagram illustrating an example of component devices arranged on a board.

【図４】交換単位であるＬＳＩ　−Ｂのエラー検出機能
を示す図である。FIG. 4 is a diagram showing an error detection function of LSI-B, which is a replacement unit.

【図５】エラー制御部の構成を説明する図である。FIG. 5 is a diagram illustrating the configuration of an error control section.

【図６】交換単位であるＬＳＩ　とエラー制御部との接
続を示す図である。FIG. 6 is a diagram showing a connection between an LSI, which is a replacement unit, and an error control section.

【図７】本実施例のタイミングチャートである。FIG. 7 is a timing chart of this embodiment.

【図８】リンク情報の説明図である。FIG. 8 is an explanatory diagram of link information.

[Explanation of symbols]

５　　エラー制御部５１　　エラー受付け回路５２　　ＬＳＩ　受信部５３　　カウンタストップ回路５４　　ボード受信ディレイ部５５　　ＬＳＩ　発信ディレイ部５６　　カウンタリセット回路 5 Error control section 51 Error reception circuit 52 LSI receiving section 53 Counter stop circuit 54 Board reception delay section 55 LSI transmission delay section 56 Counter reset circuit

Claims

[Claims]

Claim 1: The parts constituting a computer system are composed of an exchangeable exchange unit (1), and the exchange unit (1) is
1), an error detecting means (2) for discovering errors in components and outputting an error finding signal, an error signal holding means (3) for holding this error finding signal, and an error signal holding means (3) for holding this error finding signal. 3) a counter (4) that starts counting with a common clock based on the error signal output from the above, and an error control means that simultaneously transmits a count stop signal to each of the counters (4) based on the error signal transmitted first. (5) and the exchange unit (1) having the maximum count value among the count values of the counter (4).
An error occurrence part pointing device characterized by comprising a determining means (6) for determining that an error has occurred.

2. The error control device according to claim 1, wherein a plurality of the replacement units (1) are mounted on the same panel board, and the error control means (5) is provided for each panel board. Occurrence point pointing device.