JPH07120296B2

JPH07120296B2 - Error control method in hot standby system

Info

Publication number: JPH07120296B2
Application number: JP5158286A
Authority: JP
Inventors: 義則山本
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1993-06-29
Filing date: 1993-06-29
Publication date: 1995-12-20
Anticipated expiration: 2010-12-20
Also published as: JPH0713792A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、複数のマルチプロセッ
サシステムからなるホットスタンバイシステムにおける
エラー制御方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an error control system in a hot standby system composed of a plurality of multiprocessor systems.

【０００２】[0002]

【従来の技術】従来のこの種のエラー制御方式は、命令
再試行可能な障害が発生した場合には、マルチプロセッ
サシステムが具備しているプロセッサリリーフ機能によ
りエラーが回避され、そのまま継続運転が可能なように
行い、障害が発生したプロセッサはシステムから切り離
すようにしている。2. Description of the Related Art In the conventional error control system of this kind, when a failure in which an instruction can be retried occurs, an error is avoided by a processor relief function of a multiprocessor system, and continuous operation is possible as it is. In this way, the failed processor is disconnected from the system.

【０００３】[0003]

【発明が解決しようとする課題】上述した従来のホット
スタンバイシステムにおけるエラー制御方式では、現用
系システムにおいて系切換の対象となるような障害が発
生し、系切換が行われ待機系が現用系として運用されて
いる場合にも、命令再試行可能な障害が発生した場合に
はプロセッサリリーフ機能により、障害の発生した論理
装置が切離されてしまうため、システムの運用上、性能
的にシステムダウンと等価の状態となってしまう場合が
あり、著しくシステムの信頼性を低下させるという問題
点があった。In the error control method in the above-mentioned conventional hot standby system, a failure occurs which is a target of system switching in the active system, system switching is performed, and the standby system becomes the active system. Even if the system is operating, if a failure that can be retried by an instruction occurs, the processor relief function disconnects the failed logical unit. There is a problem in that they may be in an equivalent state, which significantly reduces the reliability of the system.

【０００４】[0004]

【課題を解決するための手段】本発明のエラー制御方式
は、複数のマルチプロセッサシステムが相互に接続さ
れ、各々が現用系または待機系として運用されるホット
スタンバイシステムにおけるエラー制御方式において、
前記マルチプロセッサシステムを構成する各プロセッサ
が、現用系か待機系かを示す情報を保持する系モード保
持手段と、命令再試行可能なエラー発生時に前記プロセ
ッサの内容を他の正常なプロセッサへ移送し処理を引継
ぐ手段とを含み、前記プロセッサの１つにおいて、命令
再試行可能なエラーが発生した場合に、前記系モード保
持手段の内容が“待機系モード”の場合には、前記エラ
ーが発生したプロセッサにて命令再試行を行い、“現用
系モードの場合には、前記エラーが発生したプロセッサ
の内容を他の正常なプロセッサへ移送して処理を継続す
ることを特徴とする。The error control method of the present invention is an error control method in a hot standby system in which a plurality of multiprocessor systems are mutually connected and each is operated as an active system or a standby system.
Each processor constituting the multiprocessor system holds a system mode holding means for holding information indicating whether it is an active system or a standby system, and transfers the contents of the processor to another normal processor when an instruction retryable error occurs. In the case where an instruction retryable error has occurred in one of the processors, including the means for taking over the processing, if the content of the system mode holding means is "standby system mode", the error has occurred. The processor retries the instruction, and "in the active mode, the contents of the processor in which the error has occurred are transferred to another normal processor to continue the processing.

【０００５】[0005]

【実施例】次に、本発明について図面を参照して説明す
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, the present invention will be described with reference to the drawings.

【０００６】本発明の第１の実施例を示す図１を参照す
ると、本実施例は、２つのマルチプロセッサシステム１
と２から成り、両者はシステム間結合パス３で結合され
ていて、マルチプロセッサシステム１が現用系、マルチ
プロセッサシステム２が待機系として運用されるホット
スタンバイシステムである。Referring to FIG. 1 showing a first embodiment of the present invention, the present embodiment shows a two multiprocessor system 1
2 are connected by an inter-system connection path 3, and the multiprocessor system 1 is a hot standby system in which the multiprocessor system 1 is operated as an active system and the multiprocessor system 2 is operated as a standby system.

【０００７】マルチプロセッサシステム１，２は、各々
２個の論理装置（以降ＣＰＵと記す）から成り、現用系
のマルチプロセッサシステム１においては、ＣＰＵ１
０，１１、待機系のマルチプロセッサシステム２におい
てはＣＰＵ２０，２１から構成されている。また、１２
と２２はシステム全体を制御するＯＳ、１３と２３はそ
れぞれＣＰＵ１０，１１及びＣＰＵ２０，２１とのＣＰ
Ｕ間結合パスである。Each of the multiprocessor systems 1 and 2 is composed of two logical units (hereinafter referred to as CPU), and in the active multiprocessor system 1, the CPU 1
In the multiprocessor system 2 of 0 and 11, the standby system, the CPU 20 and 21 are included. Also, 12
And 22 are OSs that control the entire system, 13 and 23 are CPs of the CPUs 10 and 11 and the CPUs 20 and 21, respectively.
It is a U-to-U connection path.

【０００８】ＣＰＵ２０は、演算全体の制御を行う演算
部２００，エラー検出を行うエラー検出２０３，エラー
の内容を判断し周知の技術であるプロセッサリリーフ機
能を行うエラー制御部２０１，立ち上げ時にＯＳからセ
ットされ“現用系”か“待機系”かの情報を保持する系
モード保持手段２０４とから構成されている。他のＣＰ
Ｕ１０，１１および２１も同様な構成である。従ってマ
ルチプロセッサシステム内の動作については、便宜上、
ＣＰＵ２０と２１を使用して、現用系と待機系両方の動
作を説明することにする。The CPU 20 has an arithmetic unit 200 for controlling the entire arithmetic operation, an error detection 203 for detecting an error, an error control unit 201 for judging the contents of the error and performing a processor relief function which is a well-known technique, and an OS at the time of startup. The system mode holding means 204 holds the information about the set "active system" or "standby system". Other CP
U10, 11 and 21 have the same configuration. Therefore, regarding the operation in the multiprocessor system,
The operations of both the active system and the standby system will be described using the CPUs 20 and 21.

【０００９】また、システムが通常有しているＩ／Ｏ系
装置に関しては特に図には示していないが、ファイル装
置，回線装置等を有しており、その一部はホットスタン
バイシステムとして共用されている。Although not shown in the figure, I / O system devices which the system normally has are provided with file devices, line devices, etc., some of which are commonly used as a hot standby system. ing.

【００１０】本実施例におけるエラー制御は、以下のよ
うにして行われる。The error control in this embodiment is performed as follows.

【００１１】ＣＰＵ２０にてエラーが発生すると、エラ
ー検出部２０３にて検出された後、エラー制御部２０１
にその旨が通知される。エラー制御部２０１は通知され
たエラーが命令再試行可能かを判定し、かつ系モード保
持手段２０２より系モードを読出す。When an error occurs in the CPU 20, it is detected by the error detection unit 203 and then the error control unit 201.
Will be notified to that effect. The error control unit 201 determines whether the notified error is an instruction retry, and reads the system mode from the system mode holding unit 202.

【００１２】系モード保持手段２０２には、システム立
ち上げ時に“現用系モード”か“待機系モード”かを示
す情報が保持されている。今、読出した結果“現用系モ
ード”であれば、プロセッサリリーフ機能により、ＣＰ
Ｕ間結合パス２３を経由して必要な情報が一方のＣＰＵ
２０より他のＣＰＵ２１へと読出され、前ＣＰＵ２０の
処理がそのまま引継がれ運転がそのまま継続される。し
かし、“待機系モード”であった場合には、エラー制御
部２０１は、プロセッサリリーフ機能を使用せず、演算
部２００に対して命令再試行を行うよう指示し、運用を
継続させるよう制御する。The system mode holding means 202 holds information indicating "active system mode" or "standby system mode" when the system is started up. If the read result is the "current system mode", the processor relief function causes the CP
The information required via the U-to-U connection path 23 is one CPU
The data is read from the CPU 20 to the other CPU 21, the processing of the previous CPU 20 is taken over as it is, and the operation is continued as it is. However, in the case of the “standby system mode”, the error control unit 201 does not use the processor relief function, but instructs the arithmetic unit 200 to retry the instruction and controls to continue the operation. .

【００１３】次に、現用系のマルチプロセッサシステム
１に対して、系切換の対象となる障害が発生すると、シ
ステム間結合パス３を経由し待機系のマルチプロセッサ
システム２のＯＳ２２に対しダウン通知が行われる。Ｏ
Ｓ２２は、マルチプロセッサシステム１と共用する資源
を組込み、リカバリ処理を行い、マルチプロセッサシス
テム１の処理を引継ぎ、現用系としてシステムの運転を
再開する。この時、エラー制御部２０１，２１１に対し
て、ＯＳ２２より現用系がダウンした旨が通知される。Next, when a failure that is the target of system switching occurs in the active multiprocessor system 1, a down notification is sent to the OS 22 of the standby multiprocessor system 2 via the intersystem coupling path 3. Done. O
In S22, the resources shared with the multiprocessor system 1 are incorporated, recovery processing is performed, the processing of the multiprocessor system 1 is taken over, and the system operation is resumed as the active system. At this time, the OS 22 notifies the error control units 201 and 211 that the active system is down.

【００１４】この状態でマルチプロセッサシステム２が
運用されていて、ＣＰＵ２０にエラーが発生すると、前
述のようなマルチプロセッサシステム内のエラー処理が
行われることになる。いま、系モード保持手段２０２に
は“待機系モード”が設定されているため、エラー制御
部２０１はプロセッサリリーフ機能を使用せず、演算部
２００に対して命令再試行を行うよう指示する。When the multiprocessor system 2 is operated in this state and an error occurs in the CPU 20, the error processing in the multiprocessor system as described above is performed. Now, since the "standby system mode" is set in the system mode holding means 202, the error control unit 201 does not use the processor relief function but instructs the arithmetic unit 200 to retry the instruction.

【００１５】本発明の第２の実施例を示す図２を参照す
ると、本実施例は第１の実施例に、刻時し、一定時間毎
に信号を出力するタイマ回路２０７，２１７と、上記一
定時間内のエラー回数をカウントするエラー回数カウン
ト手段２０４，２１４と、一定時間内のエラー回数のス
レッシュルド値を保持するスレッシュルド値保持手段２
０５，２１５と、エラー回数カウント手段２０４，２１
４とスレッシュルド値保持手段２０５，２１５との内容
の大小を判定する監視手段２０６，２１６とが付加され
ている。Referring to FIG. 2 showing a second embodiment of the present invention, the present embodiment is different from the first embodiment in that timer circuits 207 and 217 for clocking and outputting a signal at fixed time intervals are provided. Error number counting means 204, 214 for counting the number of errors within a fixed time, and threshold value holding means 2 for holding a threshold value for the number of errors within a fixed time
05, 215 and error number counting means 204, 21
4 and threshold value holding means 205, 215 and monitoring means 206, 216 for judging the magnitude of the contents are added.

【００１６】本実施例においては、現用系のマルチプロ
セッサシステム１において、系ダウンとなる障害が発生
し、待機系のマルチプロセッサシステム２が現用系とし
て運用されるようになったときのエラー制御が第１の実
施例と異なる。In the present embodiment, error control is performed when an active system failure occurs in the active multiprocessor system 1 and the standby multiprocessor system 2 is operated as the active system. Different from the first embodiment.

【００１７】すなわち、現現用系（待機系０）のＣＰＵ
２０において、エラーが発生した場合、エラー検出部２
０３によりエラーが検出されると、エラー制御部２０１
にその旨が通知され、エラー制御部２０１はエラーが命
令再試行可能かを判定し、さらに系モード保持手段２０
２の内容を読出す。系モード保持手段の内容は“待機系
モード”であるから、以下のようにエラー制御する。That is, the CPU of the active system (standby system 0)
If an error occurs in 20, the error detection unit 2
When an error is detected by 03, the error control unit 201
Is notified to that effect, the error control unit 201 determines whether the error is an instruction retry, and further, the system mode holding unit 20.
Read the contents of 2. Since the content of the system mode holding means is the "standby system mode", error control is performed as follows.

【００１８】先ず、エラー制御部２０１は、監視手段２
０６から通知があったか否かのチェックを行う。監視手
段２０６は、スレッシュルド値保持手段２０５とエラー
回数カウント手段２０４の内容を読出し、その大小を比
較しており、エラー回数カウント手段２０４の内容の方
が大きい場合には、エラーが頻発していると判断しその
旨をエラー制御部２０１へ通知する。First, the error control unit 201 includes the monitoring means 2
It is checked whether or not there is a notification from 06. The monitoring means 206 reads the contents of the threshold value holding means 205 and the error number counting means 204 and compares the contents. If the content of the error number counting means 204 is larger, errors frequently occur. It is determined that the error is present and the error control unit 201 is notified of that fact.

【００１９】エラー制御部２０１は、監視手段２０６か
ら通知が何もなかった場合には、エラーが頻発していな
いと判断し、演算部２００に対して命令再試行を指示
し、ＣＰＵ２０にてそのまま処理を継続する。しかし、
監視手段２０６より通知があった場合にはプロセッサリ
リーフ機能により、ＣＰＵ２０の処理をＣＰＵ２１にて
引続いて継続運用を行う。If there is no notification from the monitoring means 206, the error control unit 201 determines that an error does not occur frequently, instructs the operation unit 200 to retry the instruction, and the CPU 20 directly. Continue processing. But,
When notified by the monitoring unit 206, the processor relief function causes the CPU 21 to continue the operation of the CPU 20.

【００２０】本実施例によると、待機系が現用系として
運用されている状態においても、再試行可能なエラーが
発生しても、エラー制御として直ちにプロセッサリリー
フを行わずに、エラーを回復するため、ホットスタンバ
イシステム運用として、より高性能かつ高信頼性なシス
テムを実現できるという効果がある。According to the present embodiment, even when the standby system is operated as the active system, even if an error that can be retried occurs, the error is not immediately corrected by the processor as error control, but the error is recovered. As a hot standby system operation, there is an effect that a system with higher performance and higher reliability can be realized.

【００２１】[0021]

【発明の効果】以上説明したように、本発明は、運用上
システム負荷が大となるようなホットスタンバイシステ
ムにおいて、現用系がダウンし、待機系が現用系として
運用された状態において、さらに再試行可能なエラーが
発生した場合に、エラーが発生したプロセッサで命令再
試行を行うようにしたことにより、高性能かつ高信頼性
なシステムを実現できるという効果がある。As described above, according to the present invention, in a hot standby system in which the system load is heavy in operation, the active system goes down and the standby system is operated as the active system, and the When a trialable error occurs, the processor in which the error has occurred performs the instruction retry, so that there is an effect that a high-performance and highly reliable system can be realized.

[Brief description of drawings]

【図１】本発明の第１の実施例のシステム構成図であ
る。FIG. 1 is a system configuration diagram of a first embodiment of the present invention.

【図２】本発明の第２の実施例のシステム構成図であ
る。FIG. 2 is a system configuration diagram of a second embodiment of the present invention.

[Explanation of symbols]

１，２マルチプロセッサシステム３システム間結合パス１０，１１，２０，２１ＣＰＵ１３，２３ＣＰＵ間結合パス１２，２２オペレーティングシステム２００，２１０演算部２０１，２１１エラー制御部２０２，２１２系モード保持手段２０３，２１３エラー検出部２０４，２１４エラー回数カウント手段２０５，２１５スレッシュルド値保持手段２０６，２１６監視手段２０７，２１７タイマ回路 1, 2 multiprocessor system 3 system coupling path 10, 11, 20, 21 CPU 13, 23 CPU coupling path 12, 22 operating system 200, 210 arithmetic unit 201, 211 error control unit 202, 212 system mode holding means 203 , 213 Error detection section 204, 214 Error number counting means 205, 215 Threshold value holding means 206, 216 Monitoring means 207, 217 Timer circuit

Claims

[Claims]

1. An error control method in a hot standby system in which a plurality of multiprocessor systems are connected to each other and each is operated as an active system or a standby system, wherein each processor forming the multiprocessor system is an active system. One of the processors includes a system mode holding means for holding information indicating a standby system, and a means for transferring the contents of the processor to another normal processor when an error in which an instruction can be retried occurs and taking over the processing. If an error that can be retried by an instruction occurs, and the content of the system mode holding means is "standby system mode", the processor in which the error has occurred retries the instruction and In this case, transfer the contents of the processor in which the error occurred to another normal processor and continue the processing. Error control method in hot standby system to butterflies.

2. A timer circuit for each processor of the multiprocessor system, an error number counting means for counting the number of error occurrences output by the timer circuit at a constant time, and an upper limit of the number of error occurrences within the constant time. A threshold value holding means for holding a value, and a monitoring means for comparing the contents of the error number counting means and the threshold value holding means are added, and in one of the processors, an instruction retryable error occurs. When the error occurs, the processor in which the error has occurred only when the content of the system mode holding means is the "standby system mode" and the content of the error number counting means is smaller than the content of the threshold value holding means. And retry the instruction.In other cases, the contents of the processor in which the error has occurred are returned to the other normal process. Transferred to processor, the error control method in hot standby system of claim 1 wherein characterized in that to continue processing.