JP2814988B2

JP2814988B2 - Failure handling method

Info

Publication number: JP2814988B2
Application number: JP8115655A
Authority: JP
Inventors: 健一鈴木
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1996-04-12
Filing date: 1996-04-12
Publication date: 1998-10-27
Anticipated expiration: 2016-04-12
Also published as: JPH09282191A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、障害処理方式に関
し、特にシステム制御装置（「ＳＣＵ」という）障害時
の障害処理方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a fault handling system, and more particularly, to a fault handling system in the event of a system controller (SCU) failure.

【０００２】[0002]

【従来の技術】従来、この種の障害処理方式では、ＳＣ
Ｕ内のエラー訂正回路を有している箇所において障害が
発生した場合には、処理を継続し、情報処理装置の信頼
性を向上させている。2. Description of the Related Art Conventionally, in this type of fault handling system, SC
When a failure occurs in a portion having an error correction circuit in U, the processing is continued, and the reliability of the information processing device is improved.

【０００３】しかし、障害箇所が固定障害である場合に
は、障害が多発することにより、性能低下または障害情
報がオーバフローする等の理由から、複数のＳＣＵを有
する情報処理装置においては、ＳＣＵで継続運転可能な
間欠障害が発生した場合、診断処理装置（「ＤＧＰ」と
いう）はその回数をカウントし、そのカウント値が一定
時間内に定められた回数以上になった（これを「カウン
トオーバ」という）際に、固定障害とみなし、当該ＳＣ
Ｕの切り離しを行い、当該ＳＣＵ配下で実行していたプ
ロセスは、プロセッサリリーフ（救済処理）による継続
実行が試みられていた。この際、ＤＧＰは、当該ＳＣＵ
配下に接続された演算処理装置（「ＥＰＵ」という）の
動作状態とは無関係に、当該ＳＣＵの切り離しを行って
いた。However, if the fault location is a fixed fault, in an information processing apparatus having a plurality of SCUs, if the fault occurs frequently, the information processing device having a plurality of SCUs will continue to use the SCU due to performance degradation or fault information overflow. When a drivable intermittent fault occurs, the diagnostic processing device (referred to as “DGP”) counts the number of times, and the count value becomes equal to or more than a predetermined number within a predetermined time (this is referred to as “count over”). ) At the time of the SC
The process of disconnecting U and executing under the SCU has been attempted to be continuously executed by processor relief (rescue processing). At this time, the DGP sends the SCU
The SCU has been disconnected regardless of the operation state of an arithmetic processing unit (referred to as “EPU”) connected thereunder.

【０００４】[0004]

【発明が解決しようとする課題】上述したように、従来
の障害処理方式においては、障害ＳＣＵの切り離しは、
当該ＳＣＵ配下のＥＰＵの動作状態とは無関係に行われ
ているために、当該ＳＣＵ配下のＥＰＵがソフトウェア
命令実行中に、再試行不可能な区間で当該ＳＣＵの切り
離しが行われた場合には、このＥＰＵにて実行中のプロ
セスは再試行不可能であるため、プロセッサリリーフが
不可能となり、プロセスアボートあるいはシステムクラ
ッシュを引き起こしてしまうという問題があった。As described above, in the conventional fault handling system, the separation of the faulty SCU is as follows.
Since the operation is performed irrespective of the operation state of the EPU under the SCU, if the SPU under the SCU is disconnected from the SCU during a non-retryable section while executing the software instruction, Since the process being executed by the EPU cannot be retried, the processor cannot be relieved, resulting in a process abort or a system crash.

【０００５】図５は、このような従来の障害処理を行
う、ＥＰＵのソフトウェア命令実行フローを示す図であ
る。図５に示すように、ソフトウェア命令は、一般に再
試行可能な区間と再試行不可能な区間が存在し、命令開
始時には再試行可能であり、共有資源を更新すること等
により再試行不可能となる。FIG. 5 is a diagram showing a software instruction execution flow of the EPU for performing such a conventional failure processing. As shown in FIG. 5, a software instruction generally has a section that can be retried and a section that cannot be retried. The instruction can be retried at the start of the instruction, and cannot be retried by updating a shared resource. Become.

【０００６】ＥＰＵの切り離しが行われる場合、図５に
「Ａ」として示すように、再試行可能な区間である場合
には、実行中の「ソフトウェア命令２」は、再試行可能
であるため、健全なＥＰＵへプロセッサリリーフを行う
ことができ、該ＥＰＵで実行中のプロセスは継続実行が
可能である。[0006] When the EPU is disconnected, as shown as "A" in FIG. 5, if the section is a retryable section, the executing "software instruction 2" can be retried. Processor relief can be performed on a healthy EPU, and a process running in the EPU can be continuously executed.

【０００７】しかしながら、図５に「Ｂ」として示すよ
うに、再試行不可能な区間である場合は、プロセッサリ
リーフを行うことが不可能なため、該ＥＰＵで実行中の
プロセスは継続実行が不可能となり、プロセスアボート
とされていた。また、当該プロセスがオペレーティング
システムの中核（カーネル部）であるような場合には、
システムクラッシュを引き起こしていた。[0007] However, as shown by "B" in FIG. 5, if the section cannot be retried, it is impossible to perform processor relief, and the process being executed by the EPU cannot be continuously executed. It became possible, and it was considered a process abort. If the process is the core of the operating system (kernel),
Was causing a system crash.

【０００８】このため、ＳＣＵの間欠障害の場合での、
ＳＣＵの切り離しを、ソフトウェア命令実行終了まで待
ち合わせることにより、ＥＰＵはプロセッサリリーフ可
能となり、プロセッサリリーフにより、プロセスを健全
なＥＰＵに引き継ぐことにより、プロセスをアボートさ
せることなく継続運転が可能となる。For this reason, in the case of an intermittent failure of the SCU,
By waiting for the disconnection of the SCU until the completion of the execution of the software instruction, the EPU can perform processor relief, and the processor relief allows the process to be taken over by a healthy EPU, thereby enabling continuous operation without aborting the process.

【０００９】従って、本発明は、上記事情に鑑みて為さ
れたものであって、その目的は、間欠障害のカウントオ
ーバに伴うＳＣＵ切り離しの際、ＥＰＵが実行中のプロ
セスをアボートさせることなく、継続運転することを可
能にすることにより、情報処理装置の信頼性を向上させ
る障害処理方式を提供することにある。Accordingly, the present invention has been made in view of the above circumstances, and an object of the present invention is to eliminate an abort of a process that is being executed by an EPU when an SCU is disconnected due to counting over of an intermittent failure. An object of the present invention is to provide a failure processing method that improves the reliability of an information processing device by enabling continuous operation.

【００１０】[0010]

【課題を解決するための手段】前記目的を達成するた
め、本発明の障害処理方式は、システム制御装置（以下
「ＳＣＵ」という）における間欠故障の所定回の発生に
より、障害ＳＣＵの切り離しが行われる際に、前記障害
ＳＣＵ配下の全ての演算処理装置（以下「ＥＰＵ」とい
う）にソフトウェア命令中断要求を送出することによ
り、前記ＥＰＵで実行中のソフトウェア命令を終了する
まで待ち合わせ、前記障害ＳＣＵ配下の全てのＥＰＵを
再試行可能な状態としてから前記ＳＣＵの切り離しを行
い、前記ＳＣＵ配下の前記ＥＰＵで実行していたプロセ
スのプロセッサリリーフ処理を行う、ことを特徴とす
る。In order to achieve the above-mentioned object, according to the fault processing method of the present invention, a faulty SCU is separated by a predetermined number of intermittent faults occurring in a system control unit (hereinafter referred to as "SCU"). when dividing said by sending software instructions interrupt request to all of the arithmetic processing unit (hereinafter referred to as "EPU") under fault SCU, waiting until the end of the software instructions running on the EPU, the disorder under SCU The SCU is disconnected after all the EPUs are ready to be retried, and the processor relief process of the process executed by the EPU under the SCU is performed.

【００１１】本発明の概要を以下に説明する。本発明に
よれば、ＳＣＵの間欠障害のカウントオーバによるＳＣ
Ｕ切り離しに伴うＥＰＵの切り離しにおいて、該ＳＣＵ
配下の全ＥＰＵで実行していたプロセスの確実な継続運
転を実現するものである。The outline of the present invention will be described below. According to the present invention, the SC due to an intermittent failure count over of the SCU
In the EPU disconnection accompanying the U disconnection, the SCU
This realizes a reliable continuous operation of the process executed by all the subordinate EPUs.

【００１２】より具体的には、ＳＣＵの間欠障害のカウ
ントオーバが発生すると、診断処理装置（ＤＧＰ）（図
１の符号７）は、直ちにＥＰＵの切り離しを行うことは
せずに、障害ＳＣＵ配下のＥＰＵで実行中ソフトウェア
命令の終了を待ち合わせを行うためのソフトウェア命令
中断要求機構（図１の符号１２）と、ソフトウェア命令
中断要求を保持するソフトウェア命令中断割込み機構
（図１の符号２２）と、該中断要求をソフトウェア命令
間で割り出しソフトウェア命令中断完了通知をＤＧＰ
（図１の符号７）へ送出した後、再試行可能状態でＥＰ
Ｕの切り離しに備えるソフトウェア命令中断機構（図１
の符号２６）と、を有する。More specifically, when the count-up of the intermittent failure of the SCU occurs, the diagnostic processing unit (DGP) (reference numeral 7 in FIG. 1) does not immediately disconnect the EPU, but directly controls the subordinate of the failed SCU. A software instruction interruption request mechanism for waiting for the end of the software instruction being executed by the EPU (reference numeral 12 in FIG. 1), a software instruction interruption interrupt mechanism for retaining the software instruction interruption request (reference numeral 22 in FIG. 1), The interruption request is determined between software instructions, and the software instruction interruption completion notification is sent to the DGP.
(Symbol 7 in FIG. 1), and in the retryable state
Software instruction suspending mechanism for disconnection of U (Fig. 1
26).

【００１３】そして、障害ＳＣＵ配下の全ＥＰＵが再試
行可能な状態になるまで、ＤＧＰは該ＥＰＵの切り離し
は行わない。[0013] The DGP does not disconnect the EPU until all the EPUs under the failed SCU can be retried.

【００１４】すなわち、本発明によれば、間欠障害のカ
ウントオーバに伴うＳＣＵ切り離しの際、ＥＰＵで実行
中のソフトウェア命令の中断を待ち合わせ、再試行可能
な状態にしてから、該ＥＰＵを切り離し、これによりプ
ロセスをアボートさせることなく、継続運転することを
可能としている。その結果、ＳＣＵのハードウェア障害
の発生に伴うシステムダウンなどの重大な被害を有効に
防止できる。That is, according to the present invention, when the SCU is disconnected due to the count-up of the intermittent failure, the EPU is waited for the interruption of the software instruction being executed, and the EPU is disconnected. This allows continuous operation without aborting the process. As a result, it is possible to effectively prevent serious damage such as a system down due to the occurrence of a hardware failure of the SCU.

【００１５】[0015]

【発明の実施の形態】本発明の実施の形態について図面
を参照して以下に詳細に説明する。図１は、本発明の実
施の形態を説明するための図であり、システムの全体構
成をブロック図にて示したものであり、図２は、図１の
詳細図である。Embodiments of the present invention will be described in detail below with reference to the drawings. FIG. 1 is a diagram for explaining an embodiment of the present invention, and shows the overall configuration of a system in a block diagram. FIG. 2 is a detailed diagram of FIG.

【００１６】図１及び図２を参照すると、演算処理装置
（ＥＰＵ）１、２、３、４は、ソフトウェア命令群によ
り構成されたプロセスを逐次実行する。また、システム
制御処理装置（ＳＣＵ）５、６は、ＥＰＵ１、２、３、
４、不図示の入出力装置（「ＩＯＰ」ともいう）等から
メモリリクエスト等を受付け、不図示の主記憶装置
（「ＭＭＵ」ともいう）に読み出し、及び書き込み等を
行う。ＥＰＵ１、２はＳＣＵ５に、ＥＰＵ３、４はＳＣ
Ｕ６に接続されている。Referring to FIGS. 1 and 2, arithmetic processing units (EPUs) 1, 2, 3, and 4 sequentially execute a process constituted by a group of software instructions. The system control processing units (SCUs) 5 and 6 include EPUs 1, 2, 3,
4. A memory request or the like is received from an input / output device (also referred to as “IOP”) or the like (not shown), and a read / write operation is performed to a main storage device (also referred to as “MMU”) (not shown). EPU1 and 2 are SCU5, EPU3 and 4 are SC
Connected to U6.

【００１７】診断処理装置（ＤＧＰ）７は、ＥＰＵ、Ｉ
ＯＰ、ＳＣＵ、ＭＭＵ等の障害検出機構を備えており、
障害を検出すると該障害装置の切り離しを行う。The diagnostic processing unit (DGP) 7 includes EPU, I
It has a failure detection mechanism such as OP, SCU, MMU, etc.
When a failure is detected, the failed device is disconnected.

【００１８】ＳＣＵ５、６における障害表示機構１５、
１６は、自ＳＣＵに障害が発生すると、障害種別と障害
箇所を表示するフラグである。診断パス３０はＳＣＵの
障害情報をＤＧＰ７に転送するための経路である。Failure indication mechanism 15 in SCUs 5 and 6
Reference numeral 16 denotes a flag for displaying a fault type and a fault location when a fault occurs in the own SCU. The diagnostic path 30 is a path for transferring fault information of the SCU to the DGP 7.

【００１９】診断処理装置（ＤＧＰ）７におけるＳＣＵ
障害検出機構８は、ＳＣＵ毎に設けられ、診断パス３０
を介して対応するＳＣＵの障害表示機構１５、１６のフ
ラグを監視し、障害表示機構１５及び／又は１６がオン
すると、障害種別を判定する。その際、間欠故障と認識
すると、障害カウンタ９を起動すべく出力をアクティブ
とする。SCU in diagnostic processing unit (DGP) 7
The failure detection mechanism 8 is provided for each SCU,
, The flags of the failure display mechanisms 15 and 16 of the corresponding SCU are monitored, and when the failure display mechanisms 15 and / or 16 are turned on, the failure type is determined. At this time, when the intermittent failure is recognized, the output is activated to activate the failure counter 9.

【００２０】障害カウンタ９は、ＳＣＵ毎に設けられ、
起動されるとカウンタ値が＋１（インクリメント）さ
れ、この値は比較器１０の一の入力端に入力される。閾
値１１は、一定時間内に何回ＳＣＵの間欠障害が発生し
た時に、当該ＳＣＵの切り離しを行うかを決定する値が
予め格納されており、比較器１０の他の入力端に入力さ
れている。The fault counter 9 is provided for each SCU.
When activated, the counter value is incremented by +1 (increment), and this value is input to one input terminal of the comparator 10. As the threshold value 11, a value for determining how many times an intermittent failure of the SCU has occurred within a certain period of time to determine whether to disconnect the SCU is stored in advance, and is input to another input terminal of the comparator 10. .

【００２１】この比較器１０も、ＳＣＵ毎に設けられて
おり、閾値１１と障害カウンタ９の値を比較するもの
で、両者の値が等しい場合、ソフトウェア命令中断要求
機構１２を起動すべく出力する（すなわち比較器１０の
出力がアクティブとなる）。This comparator 10 is also provided for each SCU, and compares the threshold value 11 with the value of the fault counter 9. If the two values are equal, the comparator 10 outputs a signal to activate the software instruction interruption request mechanism 12. (That is, the output of the comparator 10 becomes active).

【００２２】ソフトウェア命令中断要求機構１２は、比
較器１０からの起動により、障害ＳＣＵ配下の全ＥＰＵ
に対し、ソフトウェア命令中断要求通信を通信パス２１
に送出し、その応答である、ソフトウェア命令中断完了
通信を障害ＳＣＵ配下全てのＥＰＵから受信すると、プ
ロセッサ切り離し機構１３を起動する。The software instruction interruption request mechanism 12 activates all the EPUs under the faulty SCU by the activation from the comparator 10.
In response, the software instruction interruption request
When the software instruction interruption completion communication, which is the response, is received from all the EPUs under the failed SCU, the processor disconnecting mechanism 13 is activated.

【００２３】通信パス２１は、ＥＰＵ１、２、３、４と
ＤＧＰ７間を接続する汎用通信経路である。The communication path 21 is a general-purpose communication path connecting the EPUs 1, 2, 3, 4 and the DGP 7.

【００２４】ここでＥＰＵ１にのみ着目すると、通信処
理機構１７は通信パス２１を介してＤＧＰ７から送出さ
れる通信種別を判定し、ソフトウェア命令中断要求通信
であると認識すると、ソフトウェア命令中断割り込み機
構２２を起動すべく出力し、またソフトウェア命令中断
機構２６から出力されたソフトウェア命令中断完了通信
を通信パス２１に出力する機能をもつ。Here, focusing only on the EPU 1, the communication processing unit 17 determines the type of communication transmitted from the DGP 7 via the communication path 21, and if it recognizes that the communication is a software instruction interruption request communication, the software instruction interruption interruption unit 22 , And outputs the software instruction interruption completion communication output from the software instruction interruption mechanism 26 to the communication path 21.

【００２５】ソフトウェア命令中断割り込み機構２２
は、起動されるとＤＧＰ７からのソフトウェア命令中断
要求を保持する。Software instruction interruption interrupt mechanism 22
Holds the software instruction interruption request from the DGP 7 when activated.

【００２６】ソフトウェア命令中断機構２６は、現在の
実行中のソフトウェア命令が完了した後に、ソフトウェ
ア命令中断割り込み機構２２でソフトウェア命令中断要
求が保持されている場合、ソフトウェア命令中断完了通
知を通信処理機構１７に送出後、次に続くソフトウェア
命令を実行せずアイドルループを行う。The software instruction interruption mechanism 26 sends a software instruction interruption completion notification to the communication processing unit 17 when the software instruction interruption request is held by the software instruction interruption interrupt mechanism 22 after the currently executing software instruction is completed. , An idle loop is performed without executing the next software instruction.

【００２７】プロセッサ切り離し機構１３は、ＳＣＵの
切り離し、またこのＳＣＵの切り放しに伴うＥＰＵの切
り離しも行う。The processor disconnecting mechanism 13 disconnects the SCU and also disconnects the EPU when the SCU is released.

【００２８】プロセッサリリーフ機構１４は、ＥＰＵの
切り離しにより、凍結されたＥＰＵ内のソフトウェアビ
ジブルな（ソフトウェア命令でアクセス可能な）レジス
タ等の情報を、他の健全なＥＰＵに引継ぎ、これにより
プロセスの動作継続を行うものである。When the EPU is detached, the processor relief mechanism 14 transfers information such as software-visible (accessible by software instructions) registers in the frozen EPU to another sound EPU, thereby executing the operation of the process. It is a continuation.

【００２９】次に、図１に示した情報処理装置の障害処
理方式の実施の形態の動作について説明する。Next, the operation of the embodiment of the failure processing system of the information processing apparatus shown in FIG. 1 will be described.

【００３０】ＳＣＵ５内部で障害が発生すると、ＤＧＰ
７内のＳＣＵ障害検出機構８は、障害表示機構１５にお
ける点灯（オン）により、ＳＣＵ５で障害が発生したこ
とを認識し、障害の種類を分析して間欠障害であると判
断すると、障害カウンタ９のカウントアップを行う。When a failure occurs in the SCU 5, the DGP
The SCU fault detection mechanism 8 in 7 recognizes that a fault has occurred in the SCU 5 by lighting (ON) in the fault display mechanism 15 and analyzes the type of fault and determines that the fault is an intermittent fault, and when the fault is detected, the fault counter 9 Count up.

【００３１】障害カウンタ９のカウンタ値と閾値１１が
等しいとき、すなわち固定故障と判定した場合、比較器
１０はソフトウェア命令中断要求機構１２を起動すべく
出力する。When the counter value of the fault counter 9 is equal to the threshold value 11, that is, when it is determined that the fault is a fixed fault, the comparator 10 outputs the software instruction interrupt request mechanism 12 to activate it.

【００３２】このソフトウェア命令中断要求機構１２
は、障害ＳＣＵ５配下のＥＰＵ１、ＥＰＵ２に対し通信
パス２１を介しソフトウェア命令中断要求通信を送信す
る。ここでＥＰＵ１にのみ着目すると、当該通信は通信
処理機構１７にて受信され、ソフトウェア命令中断要求
通信であると判断されると、ソフトウェア命令中断割り
込み機構２２を起動する。This software instruction interruption request mechanism 12
Transmits a software command interruption request message to the EPU1 and EPU2 under the failure SCU5 via the communication path 21. Here, focusing only on the EPU 1, the communication is received by the communication processing unit 17, and when it is determined that the communication is the software instruction interruption request communication, the software instruction interruption interruption mechanism 22 is activated.

【００３３】ソフトウェア命令中断割り込み機構２２が
起動されソフトウェア命令中断要求が保持されると、ソ
フトウェア命令中断機構２６は現在の実行中のソフトウ
ェア命令が完了した後に、ソフトウェア命令中断完了通
信を通信処理機構１７に送出した後、次に続くソフトウ
ェア命令を実行せずにアイドルループを行う。When the software instruction interruption interrupt mechanism 22 is activated and the software instruction interruption request is held, the software instruction interruption mechanism 26 transmits the software instruction interruption completion communication after the completion of the currently executing software instruction. , An idle loop is executed without executing the next software instruction.

【００３４】このアイドルループ状態でのＥＰＵ切り離
しは、上述したように、図５の「Ａ」と同様に再試行可
能な状態であるため、プロセッサリリーフは必ず成功
し、プロセスが健全なＥＰＵに引き継がれる。ソフトウ
ェア命令中断完了通信は、通信処理機構１７、通信パス
２１を通じＤＧＰ７に届けられる。As described above, the EPU disconnection in the idle loop state can be retried as in the case of "A" in FIG. 5, so that the processor relief always succeeds and the process is taken over by a healthy EPU. It is. The software command interruption completion communication is delivered to the DGP 7 through the communication processing mechanism 17 and the communication path 21.

【００３５】以上の動作がＥＰＵ２においてもＥＰＵ１
と同様になされる。The above operation is performed in EPU2 even in EPU2.
The same is done.

【００３６】ＤＧＰ７は、障害ＳＣＵ５配下の全ＥＰＵ
１、２からのソフトウェア命令中断完了通信を受け取る
と、プロセッサ切り離し機構１３を起動し、障害ＳＣＵ
５の切り離しを行う。DGP7 is used for all EPUs under the failure SCU5.
Upon receiving the software instruction interruption completion communication from the CPU 1 or 2, the processor disconnecting mechanism 13 is activated, and the failed SCU is activated.
5 is cut off.

【００３７】この障害ＳＣＵ５の切り離しに伴い、配下
のＥＰＵ１、２も切り離され、プロセッサリリーフ機構
１４にて該ＥＰＵ１、２で実行されていたプロセスは、
健全なＥＰＵ３もしくはＥＰＵ４にプロセッサリリーフ
（救済処理）する。ここで、ＥＰＵ１、２で実行されて
いたプロセスは、上述したようにすべてアイドルループ
状態で切り離しが行われるため、プロセッサリリーフは
必ず成功する。With the disconnection of the failed SCU 5, the subordinate EPUs 1 and 2 are also disconnected, and the processes executed in the EPUs 1 and 2 by the processor relief mechanism 14 are as follows.
Processor relief (rescue processing) is performed on a healthy EPU3 or EPU4. Here, since the processes executed in the EPUs 1 and 2 are all separated in the idle loop state as described above, the processor relief always succeeds.

【００３８】図５で示した「Ｂ」での切り離しが、本発
明の実施の形態の方式により、再試行不可能区間から再
試行可能区間に改善されることを、図６に示す。FIG. 6 shows that the separation at "B" shown in FIG. 5 is improved from a non-retryable section to a retryable section by the method of the embodiment of the present invention.

【００３９】このように、間欠障害のカウントオーバに
伴うＳＣＵ切り離しの際、ＥＰＵで実行中のソフトウェ
ア命令の完了を待ち合わせ、再試行可能な状態にて当該
ＥＰＵを切り離すことにより、プロセスをアボートさせ
ることなく、継続運転することが可能になる。As described above, when the SCU is disconnected due to the count-over of the intermittent fault, the process is aborted by waiting for the completion of the software instruction being executed in the EPU and disconnecting the EPU in a retryable state. And continuous operation becomes possible.

【００４０】[0040]

【実施例】上記した本発明の実施の形態を更に詳細に説
明すべく、本発明の実施例について図面を参照して説明
する。DESCRIPTION OF THE PREFERRED EMBODIMENTS In order to explain the above-described embodiment of the present invention in more detail, an embodiment of the present invention will be described with reference to the drawings.

【００４１】図３は、本発明の一実施例に係る障害処理
方式が適用された情報処理装置の構成を示すブロック図
であり、図４は、図３の詳細を示した図である。FIG. 3 is a block diagram showing a configuration of an information processing apparatus to which a failure processing system according to one embodiment of the present invention is applied, and FIG. 4 is a diagram showing details of FIG.

【００４２】図３及び図４を参照して、ＳＣＵ５内部の
訂正可能なサブブロック（１）４０で障害が発生する
と、ＥＩＦ（エラー表示フリップフロップ）１５がセッ
トされる。Referring to FIG. 3 and FIG. 4, when a failure occurs in the correctable sub-block (1) 40 inside the SCU 5, an EIF (error indication flip-flop) 15 is set.

【００４３】ＥＩＦ１５は診断パス３０を介して、ＤＧ
Ｐ７内のＳＣＵ障害検出機構８に入力され、該検出機構
８はＳＣＵ５のサブブロック１（１）４０で間欠障害が
発生したことを認識し、障害カウンタ９のカウントアッ
プを行う。ＳＣＵの間欠障害が多発することにより固定
障害とみなす回数である閾値３７が、サービスプロセッ
サのディスク等のシステム設定情報３５内に格納されて
おり、情報処理装置の立ち上げ時等に閾値格納レジスタ
１１に格納される。The EIF 15 receives the DG via the diagnostic path 30
It is input to the SCU fault detection mechanism 8 in P7, and the detection mechanism 8 recognizes that an intermittent fault has occurred in the sub-block 1 (1) 40 of the SCU 5, and counts up the fault counter 9. A threshold value 37, which is the number of times that the SCU is regarded as a fixed failure due to frequent occurrence of intermittent failures, is stored in system setting information 35 such as a disk of the service processor. Is stored in

【００４４】仮に、間欠障害が、例えば１時間内に３回
発生した場合に固定障害と見なす様に設定する場合に
は、システム設定情報３５内の閾値３７を“３”、タイ
マ値３６を“１時間”にする。タイマ３８は減算タイマ
からなり、その値が“０”になったとき障害カウンタ９
を“０”にクリアするように構成されている。If the intermittent fault is set to be regarded as a fixed fault if it occurs three times in one hour, for example, the threshold value 37 in the system setting information 35 is set to "3" and the timer value 36 is set to "3". 1 hour ”. The timer 38 comprises a subtraction timer, and when its value becomes "0", the failure counter 9
Is cleared to “0”.

【００４５】このため、このような設定においては、閾
値格納レジスタ１１に“３”が格納され、障害カウンタ
９には間欠障害発生の度にカウントアップされた障害発
生回数が格納されるので、比較器１０は２入力の値が等
しい場合、すなわち同一のＳＣＵで１時間内に間欠障害
が３回発生した場合、ソフトウェア命令中断要求機構１
２を起動すべく出力信号をアクティブとする。For this reason, in such a setting, "3" is stored in the threshold value storage register 11 and the number of fault occurrences counted up each time an intermittent fault occurs is stored in the fault counter 9; When the values of the two inputs are equal, that is, when three intermittent failures occur within one hour in the same SCU, the software instruction interruption request mechanism 1
The output signal is activated to activate the second signal.

【００４６】診断制御ソフトウェア３９であるソフトウ
ェア命令中断要求機構１２は障害ＳＣＵ５配下のＥＰＵ
１、２に対しソフトウェア命令中断通信を送出し、ＥＰ
Ｕ１、２からソフトウェア命令中断完了通信が返却され
るまで、障害ＳＣＵ５の切り離しを待ち合わせる。The software instruction interruption requesting mechanism 12, which is the diagnostic control software 39, is connected to the EPU under the faulty SCU5.
Send software command interruption communication to 1 and 2, EP
Until the software instruction interruption completion communication is returned from U1, 2, the disconnection of the failed SCU 5 is waited for.

【００４７】ＤＧＰ７より送出されたソフトウェア命令
中断要求通信は、通信パス２１を介してＥＰＵ１、ＥＰ
Ｕ２に届けられる。The software command interruption request communication transmitted from the DGP 7 is transmitted via the communication path 21 to the EPU 1, EP
Delivered to U2.

【００４８】ここでＥＰＵ１のみに着目すると、通信処
理機構１７は、到着した通信がソフトウェア命令中断要
求通信であると判断すると、ソフトウェア命令中断割り
込み機構２２のソフトウェア命令中断表示フラグ３１を
セットする。Focusing only on the EPU 1, the communication processing unit 17 sets the software instruction interruption display flag 31 of the software instruction interruption interruption mechanism 22 when judging that the arrived communication is the software instruction interruption request communication.

【００４９】ソフトウェア命令中断表示フラグ３１は、
リクエストハンドラ割込みフラグ３２の一部でもあり、
制御ファームウェア３３のリクエストハンドラ（リクエ
スト処理ルーチン）３４の割り出し要因の一つとなって
いる。The software instruction interruption display flag 31
It is also a part of the request handler interrupt flag 32,
This is one of the factors for determining the request handler (request processing routine) 34 of the control firmware 33.

【００５０】リクエストハンドラ３４は、リクエストハ
ンドラ割込みフラグ３２がいずれか１つでもセットされ
ていると、ソフトウェア命令間でその要因を割り出し、
要因ごとに用意された制御ファームウェアにより指示さ
れる所定の動作処理を行った後、再びソフトウェアの実
行に制御を移す制御ファームウェア３３の一部である。If any one of the request handler interrupt flags 32 is set, the request handler 34 determines the cause between the software instructions,
This is a part of the control firmware 33 that performs predetermined operation processing instructed by the control firmware prepared for each factor, and then shifts control to software execution again.

【００５１】リクエストハンドラ割込みフラグ３２の１
つであるソフトウェア命令中断表示フラグ３１がセット
されると、前述したように、ソフトウェア命令間でリク
エストハンドラ３４に割り出され、再試行可能な状態
で、制御ファームウェア３３であるソフトウェア命令中
断完了通知処理２６に処理が移る。1 of the request handler interrupt flag 32
When the software instruction interruption display flag 31 is set, as described above, the software instruction is determined by the request handler 34 between the software instructions, and the software instruction interruption completion notification processing of the control firmware 33 is performed in a retryable state. The process moves to 26.

【００５２】ソフトウェア命令中断完了通知処理２６
は、障害ＳＣＵ５の切り離しの処理を待ち合わせている
ＤＧＰ７に対し、障害ＳＣＵ配下のＥＰＵでソフトウェ
ア命令の中断が完了したことを通知するため、ソフトウ
ェア命令中断完了通信を通信処理機構１７を介してＤＧ
Ｐ７に対し発行した後、再試行可能状態であるアイドル
ループに移入し、ＤＧＰ７によるＥＰＵの切り離しに備
える。Software instruction interruption completion notification processing 26
Communicates the software instruction interruption completion communication via the communication processing mechanism 17 to the DGP 7 waiting for the process of disconnecting the failed SCU 5 to notify that the interruption of the software instruction has been completed in the EPU under the failed SCU.
After issuing it to P7, it enters an idle loop that is in a retryable state and prepares for disconnection of the EPU by DGP7.

【００５３】前述したように、ソフトウェア命令中断要
求機構１２は、障害ＳＣＵ５配下のＥＰＵ１、２からの
ソフトウェア命令中断完了通信を全て受信すると、プロ
セッサ切り離し機構１３を起動し、障害ＳＣＵ５の切り
離し、またこれに伴う配下ＥＰＵ１、２のソフトウェア
ビジブルレジスタ等の凍結、及び切り離し処理を行う。As described above, when all the software instruction interruption completion messages from the EPUs 1 and 2 under the failed SCU 5 are received, the software instruction interruption requesting mechanism 12 activates the processor disconnecting mechanism 13 to disconnect the failed SCU 5 and Of the software visible registers of the subordinate EPUs 1 and 2 associated with the above, and the disconnection processing.

【００５４】プロセッサリリーフ機構１４は、切り離さ
れたＥＰＵ１、２の凍結された内容を健全なＥＰＵ３ま
たはＥＰＵ４に引継ぎ、プロセスの継続運転を行う。The processor relief mechanism 14 takes over the frozen contents of the separated EPUs 1 and 2 to the healthy EPU 3 or EPU 4 and performs the continuous operation of the process.

【００５５】上記実施例では、情報処理装置に含まれる
ＳＣＵの台数を２台、ＳＣＵ配下に接続されるＥＰＵの
台数を２台としたが、それぞれ２台以上であっても本発
明が同様にして適用可能であることはいうまでもない。In the above embodiment, the number of SCUs included in the information processing apparatus is two, and the number of EPUs connected under the SCU is two. Needless to say, it is applicable.

【００５６】[0056]

【発明の効果】以上説明したように、本発明によれば、
間欠障害のカウントオーバに伴うＳＣＵ切り離しの際、
プロセスをアボートさせることなく、継続運転すること
を可能としたことにより、ＳＣＵのハードウェア障害の
発生に伴うシステムダウンなどの重大な被害を有効に防
止できるという効果を奏する。As described above, according to the present invention,
At the time of SCU disconnection due to intermittent failure count over,
By enabling the continuous operation without aborting the process, it is possible to effectively prevent serious damage such as a system down due to a hardware failure of the SCU.

【００５７】これは、本発明においては、ＥＰＵで実行
中のソフトウェア命令の中断を待ち合わせ、再試行可能
な状態にしてから、該ＥＰＵの切り離しを行うようにし
たためである。This is because, in the present invention, the EPU is disconnected after waiting for the interruption of the software instruction being executed in the EPU and making it retryable.

[Brief description of the drawings]

【図１】本発明の実施の形態を示すブロック図である。FIG. 1 is a block diagram showing an embodiment of the present invention.

【図２】図１を詳細に示す図である。FIG. 2 is a diagram showing FIG. 1 in detail.

【図３】本発明の実施例を示すブロック図である。FIG. 3 is a block diagram showing an embodiment of the present invention.

【図４】図３を詳細に示す図である。FIG. 4 is a diagram showing FIG. 3 in detail.

【図５】従来の障害処理方式によるソフトウェア命令実
行フローを示す図である。FIG. 5 is a diagram showing a software instruction execution flow according to a conventional failure handling method.

【図６】本発明の障害処理方式によるソフトウェア命令
実行フローを示す図である。FIG. 6 is a diagram showing a software instruction execution flow according to the fault handling method of the present invention.

[Explanation of symbols]

１、２、３、４演算処理装置（ＥＰＵ）５、６システム制御処理装置（ＳＣＵ）７診断処理装置（ＤＧＰ）８ＳＣＵ障害検出機構９障害カウンタ１０比較器１１閾値（閾値格納レジスタ）１２ソフトウェア命令中断要求機構１３プロセッサ切り離し機構１４プロセッサリリーフ機構１５、１６障害表示機構（ＥＩＦ）１７、１８、１９、２０通信処理機構２１通信パス２２、２３、２４、２５ソフトウェア命令中断割り込
み機構２６、２７、２８、２９ソフトウェア命令中断機構３０診断パス３１ソフトウェア命令中断表示フラグ３２リクエストハンドラ割り込みフラグ３３制御ファームウェア３４リクエストハンドラ３５システム設定情報３６タイマ値３７閾値３８タイマ３９制御ソフトウェア４０サブブロック（１）４１サブブロック（２）４２サブブロック（３）４３サブブロック（４）４４サブブロック（５）４５サブブロック（６）1, 2, 3, 4 arithmetic processing unit (EPU) 5, 6 system control processing unit (SCU) 7 diagnostic processing unit (DGP) 8 SCU failure detection mechanism 9 failure counter 10 comparator 11 threshold (threshold storage register) 12 software Instruction interruption request mechanism 13 Processor disconnection mechanism 14 Processor relief mechanism 15, 16 Fault indication mechanism (EIF) 17, 18, 19, 20 Communication processing mechanism 21 Communication path 22, 23, 24, 25 Software instruction interruption interrupt mechanism 26, 27, 28, 29 Software instruction interruption mechanism 30 Diagnostic pass 31 Software instruction interruption display flag 32 Request handler interrupt flag 33 Control firmware 34 Request handler 35 System setting information 36 Timer value 37 Threshold 38 Timer 39 Control software 40 Block (1) 41 sub-block (2) 42 sub-blocks (3) 43 sub-blocks (4) 44 sub-blocks (5) 45 sub-block (6)

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁶，ＤＢ名) G06F 11/14 310 G06F 11/20 310──────────────────────────────────────────────────続き Continued on the front page (58) Field surveyed (Int.Cl. ⁶ , DB name) G06F 11/14 310 G06F 11/20 310

Claims

(57) [Claims]

A system controller (hereinafter, referred to as "SCU") generates a faulty SC by a predetermined number of intermittent faults.
When disconnecting the U is performed, all of said subordinate disorders SCU
By sending a software instruction interruption request to all the arithmetic processing units (hereinafter referred to as “EPU”), it waits until the software instruction being executed in the EPU is completed,
Disconnecting the SCU after setting all the EPUs under the failed SCU to be in a retryable state, and performing a processor relief process of a process executed by the EPU under the SCU. .

2. A software instruction being executed in all EPUs under a failed SCU when a count value of a counter for monitoring an SCU failure and counting the intermittent failure of the SCU exceeds a predetermined threshold value. And the EPU restarts
In order to wait until a trial is possible , a software instruction suspend request is issued to all Es under the failed SCU.
A diagnostic processing unit for sending to the PU, the EPU notifies the diagnostic processing unit of the completion of the software instruction interruption after the execution of the software instruction based on the software instruction interruption request, and
By performing an idle loop without performing the preparation, disconnection in a retryable state is prepared, and the diagnostic processing unit performs all the EPs under the failed SCU.
When the U is ready to retry, the failed SCU is allocated.
A fault processing method, wherein the process is separated from the lower EPU and the process executed by the EPU under the faulty SCU is controlled to be continuously executed by a healthy EPU.