JP2005084947A

JP2005084947A - Fault recovery system

Info

Publication number: JP2005084947A
Application number: JP2003316174A
Authority: JP
Inventors: Keiji Fukawa; 恵司普川; Sueo Seto; 末男瀬戸; Shinichi Mihashi; 伸一三橋
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2003-09-09
Filing date: 2003-09-09
Publication date: 2005-03-31

Abstract

<P>PROBLEM TO BE SOLVED: To provide a fault recovery system which, if part of redundant portions is separated from a computer system during maintenance work in a maintenance mode, delivers the computer system without fault to a customer by notifying maintenance personnel of the occurrence of fault prior to customer operation. <P>SOLUTION: A computer system in which command processors, input/output processors, power units and the like are made redundant has a means for recording faults in the units of the redundant portions such as the command processors, input/output processors, power units and the like. During maintenance, fault information recorded at the time of initialization is reported so as to make system initialization abend. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、ＣＰＵ内部が冗長化されている計算機システムの障害処理方式に関し、特に、保守作業中に発生した冗長部位切離しによる計算機システムの障害処理方式に関する。 The present invention relates to a failure processing method for a computer system in which the inside of a CPU is made redundant, and more particularly, to a failure processing method for a computer system by separating a redundant part that occurs during maintenance work.

近年、計算機システムの２４時間稼働が求められるようになり、計算機システムの冗長化が主流になりつつある。この様な冗長化されている計算機システムにおいて、冗長部位がなんらかの障害により切離された場合、システム停止に至らない。しかし、その後、障害対策が完了していない状態で新たな障害が発生すると、システム停止となりうる。 In recent years, computer systems are required to operate for 24 hours, and redundancy of computer systems is becoming mainstream. In such a redundant computer system, when a redundant part is disconnected due to some failure, the system does not stop. However, if a new failure occurs after failure countermeasures have not been completed, the system can be stopped.

該計算機システムには、保守作業と顧客運用の切り替え手段を持っており、保守作業は、保守モードで作業を行い、顧客運用は、顧客モードで運用する手段を有する。保守モード中は、保守員が計算機システムの近くで保守作業を行う為、計算機システム内の冗長部位の一部になんらかの障害が発生しても保守センタへ障害報告は行わない。その為、保守員が障害を見逃すと、障害を含んだ状態で計算機システムを顧客に引き渡してしまうことになる。 The computer system has means for switching between maintenance work and customer operation, the maintenance work is performed in the maintenance mode, and the customer operation has means for operating in the customer mode. Since maintenance personnel perform maintenance work near the computer system during the maintenance mode, no failure is reported to the maintenance center even if some failure occurs in some of the redundant parts in the computer system. For this reason, if the maintenance staff misses the failure, the computer system is handed over to the customer with the failure included.

該計算機システムの保守作業中、障害発生による冗長部位切離しを想定すると、障害を報告するエラーコード（以下ＲＣと称す）をコンソール装置上に報告するが、保守員がなんらかの理由でＲＣを見落としてしまっていた場合、冗長部位の一部に障害が存在する状態でシステム立ち上げを行い、顧客運用することになる。 If it is assumed that a redundant part is disconnected due to a failure during maintenance of the computer system, an error code reporting the failure (hereinafter referred to as RC) is reported on the console device, but the maintenance staff misses the RC for some reason. In such a case, the system will be started up and operated as a customer in a state where a fault exists in a part of the redundant part.

本発明に関連する従来技術として、特許文献１、特許文献２及び特許文献３がある。 As conventional techniques related to the present invention, there are Patent Document 1, Patent Document 2, and Patent Document 3.

特開２００２−２８８０００号公報JP 2002-288000 A 特開２００１−５６９２号公報JP 2001-5692 A 特開２００２−３２３５１号公報JP 2002-32351 A

前記に示すように、従来の計算機システムでは、保守作業中における保守員の見落としの防止策はなされていない。 As described above, the conventional computer system does not take measures to prevent oversight of maintenance personnel during maintenance work.

本発明の目的は、保守モードでの保守作業中に冗長部位の一部がシステムより切離された場合、可能な限り、顧客運用前に保守員に障害が発生していることを通知し、障害の無い状態で計算機システムを顧客に引き渡すことができる障害処理方式を提供することにある。 The purpose of the present invention is to notify the maintenance personnel that a failure has occurred before the customer operation as much as possible when a part of the redundant part is disconnected from the system during the maintenance work in the maintenance mode. An object of the present invention is to provide a failure processing method capable of handing over a computer system to a customer without any failure.

計算機システム内で発生した障害を記録出来るようにする手段と、記録された内容をコンソール装置上に表示し障害部位を識別できるようにする手段を有し、保守モードと顧客モードを識別可能とする手段（保守モード、顧客モードの切り替えは人手で行う）において保守モードとなっている場合に計算機システム内の冗長部位の一部が切離されたことによる障害発生時、システムから切離されたことを障害登録ファイルに記録する。
その後、保守モード時は、計算機システムの初期設定時に記録された障害登録ファイル内を参照し、障害が記録されていればシステム初期設定を異常終了させ、障害が発生している事を保守員に通知し、障害が在る状態で顧客に引き渡すことを防止する。
システム初期設定正常終了後、顧客モードに切り替わる前に発生した冗長部位の障害は、顧客モードに切り替えた後、定期的に障害登録ファイル内を参照し、障害が記録されていれば、保守センタへ障害があることを通報する。通報する場合、障害登録ファイルを添付し、保守センタでは、障害登録ファイルの解析を行い障害部位を識別する。 There is a means for recording faults occurring in the computer system and a means for displaying the recorded contents on the console device so that the faulty part can be identified, so that the maintenance mode and customer mode can be identified. When a failure occurs due to a part of a redundant part in the computer system being disconnected in the maintenance mode in the means (manual switching between maintenance mode and customer mode), it was disconnected from the system Is recorded in the failure registration file.
After that, in maintenance mode, refer to the failure registration file recorded at the initial setting of the computer system. If a failure is recorded, the system initialization is terminated abnormally, and the maintenance staff is informed that the failure has occurred. Notify and prevent delivery to customers in the presence of disabilities.
After the normal initialization of the system, the failure of the redundant part that occurred before switching to the customer mode is referred to the failure registration file periodically after switching to the customer mode. Report that there is a problem. When reporting, a failure registration file is attached, and the maintenance center analyzes the failure registration file and identifies the failure site.

本発明の障害処理方式は、保守作業中に発生した冗長部位の一部の障害が元でシステムダウンとなる事を未然に防ぐことができる。 The failure processing method of the present invention can prevent the system from going down due to a failure of a part of a redundant part that occurs during maintenance work.

本発明の障害処理方式の概念を図1にて説明する。 The concept of the fault handling method of the present invention will be described with reference to FIG.

保守モード中の時間ｔ１で発生した冗長部位Ａの障害は、時間ｔ２のシステム初期設定処理で障害を通知する。保守モード中の時間ｔ３で発生した冗長部位Ｂの障害は、顧客モードに切り替わった後、時間ｔ４で定期冗長部位チェック処理にて障害を通知する。顧客モード中の時間ｔ５で発生した冗長部位Ｃの障害は、障害発生と共に通知される。しかし、冗長部位Ｃの障害対策が完了していないまま時間ｔ６のシステム初期設定を行ったとしても、システムは正常に立ち上がる。仮に冗長部位Ｃの障害対策が長期間行われなかった場合、再度時間ｔ７の定期冗長部位チェック処理にて冗長部位Ｃの障害を通知する。 The failure of the redundant part A that occurred at time t1 in the maintenance mode is notified by the system initial setting process at time t2. The failure of the redundant part B that occurred at time t3 in the maintenance mode is notified by the periodic redundant part check process at time t4 after switching to the customer mode. The failure of the redundant part C that occurs at time t5 in the customer mode is notified together with the occurrence of the failure. However, even if the system initialization is performed at time t6 while failure countermeasures for the redundant part C are not completed, the system starts up normally. If the failure countermeasure for the redundant portion C is not taken for a long time, the failure of the redundant portion C is notified again by the periodic redundant portion check process at time t7.

命令プロセッサ、入出力プロセッサ、電源ユニット等が冗長化されている計算機システムにおいて、保守作業中に発生した冗長部位の障害によって冗長部位の一部がシステムから切離された場合の障害処理方式として保守作業中の処理方式と顧客運用中の処理方式に分けて実施例を図面により説明する。 In a computer system with redundant instruction processors, input / output processors, power supply units, etc., maintenance is performed as a failure handling method when a part of the redundant part is disconnected from the system due to a fault in the redundant part that occurs during maintenance work. The embodiment will be described with reference to the drawings by dividing into a processing method during work and a processing method during customer operation.

図２は本発明の一実施例による計算機システムの構成を示すブロック図であり、大別すると、Ｆ−ＳＶＰ（１００）、Ｃ−ＳＶＰ（２００）、ＰＵ（３００〜３０２）、ＨＤＤ（４００〜４０１）、ＰＳ（５００〜５０１）、ＣＨＰ（６００〜６０１）、ＩＯＰ（７００）、ＣＨ（８００〜８０３）、ＳＣ（９００）、ＳＣＩ／Ｆ（１０００〜１００１）、計算機システムモード切り替えスイッチ（１１００）から構成されており、ＰＵ（３００〜３０２）、ＨＤＤ（４００〜４０１）、ＰＳ（５００〜５０１）、ＣＨＰ（６００〜６０１）、ＳＣ I／Ｆ（１０００〜１００１）は、冗長化されている。 FIG. 2 is a block diagram showing the configuration of a computer system according to an embodiment of the present invention. Broadly speaking, F-SVP (100), C-SVP (200), PU (300 to 302), HDD (400 to 400). 401), PS (500 to 501), CHP (600 to 601), IOP (700), CH (800 to 803), SC (900), SC I / F (1000 to 1001), computer system mode switch ( 1100), and PU (300 to 302), HDD (400 to 401), PS (500 to 501), CHP (600 to 601), and SC I / F (1000 to 1001) are made redundant. ing.

図２に示すように、本実施例に関わる計算機システムは、ＰＵ（３００〜３０２）、ＨＤＤ（４００〜４０１）、ＰＳ（５００〜５０１）、ＣＨＰ（６００〜６０１）、ＳＣＩ／Ｆ（１０００〜１００１）の装置が冗長化されており、現用系の装置に障害が発生した場合、バックアップ系にて運用続行可能となっている。 As shown in FIG. 2, the computer system according to the present embodiment includes PU (300 to 302), HDD (400 to 401), PS (500 to 501), CHP (600 to 601), SC I / F (1000). 1001-1001) are made redundant, and if a failure occurs in the active device, the operation can be continued in the backup system.

本実施例に関わる計算機システムには、保守作業と顧客運用の切り替えるスイッチ（以降、運用切り替えスイッチと称す）を持っており人手により切り替える。運用切り替えスイッチが顧客運用モード側に設定されている状態を顧客モード、保守作業モード側に設定されている状態を保守モードと称す。また、このスイッチの状態はサービスプロセッサ内組込みソフトウエアで判定可能であり、このスイッチの状態により保守モード中は、システム初期設定時に冗長部位の障害チェックを行う。また、顧客モード中は、定期的に冗長部位の障害チェックを行う。 The computer system according to the present embodiment has a switch for switching between maintenance work and customer operation (hereinafter referred to as an operation switch) and is manually switched. A state in which the operation changeover switch is set to the customer operation mode side is referred to as a customer mode, and a state in which the operation switch is set to the maintenance work mode side is referred to as a maintenance mode. Further, the state of this switch can be determined by the software embedded in the service processor, and during the maintenance mode, the failure check of the redundant part is performed at the time of initializing the system. Also, during the customer mode, the redundant part is checked for failures periodically.

計算機システムを構成する冗長部位において障害が発生した場合、サービスプロセッサは表１に示す契機で障害を認識し図３の障害登録ファイルへ障害が有った事を登録する。尚、障害登録ファイルは図２に示すＨＤＤ（４００〜４０１）内に存在する。例として図２のＡＰ０（３０１）に障害を検出した場合、図３に示す障害登録ファイルのアドレス０１の内容を００（障害無し）からＦＦ（障害有り）に設定する。 When a failure occurs in a redundant part constituting the computer system, the service processor recognizes the failure at the timing shown in Table 1 and registers that there is a failure in the failure registration file of FIG. The failure registration file exists in the HDD (400 to 401) shown in FIG. For example, when a failure is detected in AP0 (301) in FIG. 2, the contents of the address 01 of the failure registration file shown in FIG. 3 are set from 00 (no failure) to FF (failure present).

障害登録ファイルへ登録された障害情報は、下記条件でリセットされる。
顧客運用中（顧客モード中）に障害部位を交換し障害部位を再びシステムに組込んだ場合。保守作業中（保守モード中）にシステム全体のパワーオン処理を実施した場合。このケースは、システムを停止し障害部位を交換した場合であり、全障害が対策されたとみなし全障害情報がリセットされる。全障害が未対策（複数の障害部位のうち、全部位の対策が行われていない）の場合は、障害部位の障害情報が障害登録ファイルに再設定される。

Fault information registered in the fault registration file is reset under the following conditions.
When the faulty part is replaced and the faulty part is incorporated into the system again during customer operation (in customer mode). When the entire system is powered on during maintenance (maintenance mode). This case is a case where the system is stopped and the faulty part is replaced, and it is considered that all faults have been taken and all fault information is reset. If all the faults have not been taken (all of the faulty parts have not been taken), the fault information of the faulty part is reset in the fault registration file.

該計算機システムを構成する冗長部の切離し部位の通知方式に際し、保守作業実施中の通知方式を図４〜図６、表１を基に説明する。 The notification method during the maintenance work will be described with reference to FIGS. 4 to 6 and Table 1 in the notification method of the disconnected part of the redundant part constituting the computer system.

図４に保守作業概略フローを示す。保守員は、通常、顧客より計算機システムを引き継ぎ、「運用切り替えスイッチ」を保守モードへ切り替え（３００１）後、保守点検を行う。その後、システム初期設定（３００３）を行い、正常終了した事を確認後「運用切り替えスイッチ」を顧客モードへ切り替え（３００４）、計算機システムを顧客へ引き渡す。本発明により、保守点検中（３００２）に発生した冗長部の障害は、計算機システム初期設定後、顧客モードへの切り替えまでの間に発生した冗長部位の障害は、顧客モードに切り替わった後の定期冗長部位チェック処理（３００６）で通知する。 FIG. 4 shows a maintenance work outline flow. The maintenance staff usually takes over the computer system from the customer, switches the “operation changeover switch” to the maintenance mode (3001), and performs maintenance inspection. Thereafter, system initial setting (3003) is performed, and after confirming normal completion, the “operation changeover switch” is switched to the customer mode (3004), and the computer system is delivered to the customer. According to the present invention, the failure of the redundant part that occurred during the maintenance inspection (3002) is the periodicity after the failure of the redundant part that occurred between the initial setting of the computer system and the switch to the customer mode. Notification is made in the redundant part check process (3006).

図５に計算機システム初期処理の詳細を示す。システム初期設定処理の４０００〜４００３までの処理は、システムを立ち上げるための処理であるため「運用切り替えスイッチ」の状態に関わらず処理される。そのため、ＩＯＰ初期設定（４００３）処理までが正常に終了すれば冗長部位の一部に障害が有ってもシステムを運用出来る状態である。よって、「運用切り替えスイッチ」が顧客モードの場合、システム運用を可能とするためにシステム初期設定処理を正常終了させる（４００６）。「運用切り替えスイッチ」が保守モードの場合、障害登録ファイル内の障害情報を検索し障害情報が登録されていなければ、システム初期設定処理を正常終了させ（４００６）、障害情報が登録されていれば、例えシステムを運用出来る状態であってもシステム初期設定処理を異常終了させる（４００７）。この時、冗長部位の一部に障害が発生したことを示すメッセージを図２のコンソール装置（１００）に表示する（４００８）。また、冗長部位状態表示画面で障害部位を特定し対策する。 FIG. 5 shows details of the computer system initial processing. Since the system initialization process from 4000 to 4003 is a process for starting up the system, it is processed regardless of the state of the “operation changeover switch”. For this reason, if the process up to the IOP initial setting (4003) is completed normally, the system can be operated even if there is a failure in a part of the redundant part. Therefore, when the “operation changeover switch” is in the customer mode, the system initial setting process is normally terminated to enable the system operation (4006). When the “operation changeover switch” is in the maintenance mode, the failure information in the failure registration file is searched and if the failure information is not registered, the system initial setting process is terminated normally (4006), and if the failure information is registered. Even if the system can be operated, the system initial setting process is abnormally terminated (4007). At this time, a message indicating that a failure has occurred in a part of the redundant portion is displayed on the console device (100) of FIG. 2 (4008). In addition, the faulty part is identified and countermeasures are taken on the redundant part state display screen.

図６にシステム運用中の定期システム内冗長部位障害通知処理を示す。顧客モード時は、前述したように冗長部位の一部の障害でシステム初期設定処理を異常終了させることは出来ない。そこで定期的（一日に一回程度）に障害登録ファイル内の障害情報をサーチし障害情報が登録されていれば、障害コードを生成し、保守センタに対し冗長部の一部に障害が発生していることを自動的に通知する（５００４）。この時、通報データに障害登録ファイルを添付することにより保守センタで障害登録ファイルを編集し、障害部位を特定可能とする。障害登録ファイル内に障害情報が登録されていなければ、自動通報は実施されない（５００３〜５００２）。保守モード時は、計算機システムの近くに保守員がいるため、定期システム内冗長部位通知処理は動作しない（５００１〜５００２）。 FIG. 6 shows a regular system redundant part failure notification process during system operation. In the customer mode, as described above, the system initial setting process cannot be abnormally terminated due to a failure of a part of the redundant part. Therefore, if failure information is registered by searching for failure information in the failure registration file periodically (about once a day), a failure code is generated and a failure occurs in a part of the redundant part to the maintenance center. (5004). At this time, by attaching the failure registration file to the report data, the maintenance registration center edits the failure registration file so that the failure portion can be identified. If failure information is not registered in the failure registration file, automatic notification is not performed (5003 to 5002). In the maintenance mode, since there is a maintenance person near the computer system, the redundant part notification process in the regular system does not operate (5001 to 5002).

冗長部位切離し通知機能の考え方を示す図である。It is a figure which shows the view of the redundant part separation notification function. 冗長化構成計算機システムのブロック図である。It is a block diagram of a redundant configuration computer system. 図２における計算機システムにおいて障害発生時に障害情報を各装置別に格納するファイルフォーマットである。3 is a file format for storing failure information for each device when a failure occurs in the computer system in FIG. 計算機システム保守実施概略フロー図である。It is a computer system maintenance execution outline flowchart. 計算機システム初期設定（ＳＹＳＩＭＬ）時のチェック処理フロー図である。It is a check processing flowchart at the time of computer system initialization (SYSIML). 計算機システム運用中の冗長部位切離し定期チェックフロー図である。It is a redundant part isolation | separation periodic check flow figure during computer system operation.

Explanation of symbols

１００コンソール装置（Ｆ−ＳＶＰ）
２００計算機システム制御監視装置（Ｃ−ＳＶＰ）
３００命令プロセッサ（ＩＰ）
３０１〜３０２交替プロセッサ（ＡＰ）
４００〜４０１磁気ディスク装置
５００〜５０１電源ユニット
６００〜６０１チャネル制御装置（ＣＨＰ）
７００入出力装置（ＩＯＰ）
８００〜８０３チャネル装置（ＣＨ）
９００ＳＣ
１０００〜１００１ＳＣＩ/Ｆ
１１００計算機システムモード切り替えスイッチ
100 Console device (F-SVP)
200 Computer system control and monitoring device (C-SVP)
300 Instruction processor (IP)
301-302 Alternate processor (AP)
400 to 401 Magnetic disk device 500 to 501 Power supply unit 600 to 601 Channel control device (CHP)
700 Input / output unit (IOP)
800 to 803 channel device (CH)
900 SC
1000-10001 SC I / F
1100 Computer system mode switch

Claims

An external storage device for storing a failure in a redundant part in a computer system having a service processor for controlling the instruction processor, input / output processor, power supply unit, etc. And a means for switching the computer operation mode, and the service processor searches the failure registration file recorded in the external storage device when the computer operation mode switching means is in the maintenance mode when the system initial setting is executed. A failure processing method for a computer system, characterized in that if the information is registered, the system initial setting process is abnormally terminated.