JP2005084947A - Fault recovery system - Google Patents
Fault recovery system Download PDFInfo
- Publication number
- JP2005084947A JP2005084947A JP2003316174A JP2003316174A JP2005084947A JP 2005084947 A JP2005084947 A JP 2005084947A JP 2003316174 A JP2003316174 A JP 2003316174A JP 2003316174 A JP2003316174 A JP 2003316174A JP 2005084947 A JP2005084947 A JP 2005084947A
- Authority
- JP
- Japan
- Prior art keywords
- failure
- computer system
- maintenance
- redundant
- customer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Test And Diagnosis Of Digital Computers (AREA)
- Stored Programmes (AREA)
- Hardware Redundancy (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
本発明は、CPU内部が冗長化されている計算機システムの障害処理方式に関し、特に、保守作業中に発生した冗長部位切離しによる計算機システムの障害処理方式に関する。 The present invention relates to a failure processing method for a computer system in which the inside of a CPU is made redundant, and more particularly, to a failure processing method for a computer system by separating a redundant part that occurs during maintenance work.
近年、計算機システムの24時間稼働が求められるようになり、計算機システムの冗長化が主流になりつつある。この様な冗長化されている計算機システムにおいて、冗長部位がなんらかの障害により切離された場合、システム停止に至らない。しかし、その後、障害対策が完了していない状態で新たな障害が発生すると、システム停止となりうる。 In recent years, computer systems are required to operate for 24 hours, and redundancy of computer systems is becoming mainstream. In such a redundant computer system, when a redundant part is disconnected due to some failure, the system does not stop. However, if a new failure occurs after failure countermeasures have not been completed, the system can be stopped.
該計算機システムには、保守作業と顧客運用の切り替え手段を持っており、保守作業は、保守モードで作業を行い、顧客運用は、顧客モードで運用する手段を有する。保守モード中は、保守員が計算機システムの近くで保守作業を行う為、計算機システム内の冗長部位の一部になんらかの障害が発生しても保守センタへ障害報告は行わない。その為、保守員が障害を見逃すと、障害を含んだ状態で計算機システムを顧客に引き渡してしまうことになる。 The computer system has means for switching between maintenance work and customer operation, the maintenance work is performed in the maintenance mode, and the customer operation has means for operating in the customer mode. Since maintenance personnel perform maintenance work near the computer system during the maintenance mode, no failure is reported to the maintenance center even if some failure occurs in some of the redundant parts in the computer system. For this reason, if the maintenance staff misses the failure, the computer system is handed over to the customer with the failure included.
該計算機システムの保守作業中、障害発生による冗長部位切離しを想定すると、障害を報告するエラーコード(以下RCと称す)をコンソール装置上に報告するが、保守員がなんらかの理由でRCを見落としてしまっていた場合、冗長部位の一部に障害が存在する状態でシステム立ち上げを行い、顧客運用することになる。 If it is assumed that a redundant part is disconnected due to a failure during maintenance of the computer system, an error code reporting the failure (hereinafter referred to as RC) is reported on the console device, but the maintenance staff misses the RC for some reason. In such a case, the system will be started up and operated as a customer in a state where a fault exists in a part of the redundant part.
本発明に関連する従来技術として、特許文献1、特許文献2及び特許文献3がある。
As conventional techniques related to the present invention, there are
前記に示すように、従来の計算機システムでは、保守作業中における保守員の見落としの防止策はなされていない。 As described above, the conventional computer system does not take measures to prevent oversight of maintenance personnel during maintenance work.
本発明の目的は、保守モードでの保守作業中に冗長部位の一部がシステムより切離された場合、可能な限り、顧客運用前に保守員に障害が発生していることを通知し、障害の無い状態で計算機システムを顧客に引き渡すことができる障害処理方式を提供することにある。 The purpose of the present invention is to notify the maintenance personnel that a failure has occurred before the customer operation as much as possible when a part of the redundant part is disconnected from the system during the maintenance work in the maintenance mode. An object of the present invention is to provide a failure processing method capable of handing over a computer system to a customer without any failure.
計算機システム内で発生した障害を記録出来るようにする手段と、記録された内容をコンソール装置上に表示し障害部位を識別できるようにする手段を有し、保守モードと顧客モードを識別可能とする手段(保守モード、顧客モードの切り替えは人手で行う)において保守モードとなっている場合に計算機システム内の冗長部位の一部が切離されたことによる障害発生時、システムから切離されたことを障害登録ファイルに記録する。
その後、保守モード時は、計算機システムの初期設定時に記録された障害登録ファイル内を参照し、障害が記録されていればシステム初期設定を異常終了させ、障害が発生している事を保守員に通知し、障害が在る状態で顧客に引き渡すことを防止する。
システム初期設定正常終了後、顧客モードに切り替わる前に発生した冗長部位の障害は、顧客モードに切り替えた後、定期的に障害登録ファイル内を参照し、障害が記録されていれば、保守センタへ障害があることを通報する。通報する場合、障害登録ファイルを添付し、保守センタでは、障害登録ファイルの解析を行い障害部位を識別する。
There is a means for recording faults occurring in the computer system and a means for displaying the recorded contents on the console device so that the faulty part can be identified, so that the maintenance mode and customer mode can be identified. When a failure occurs due to a part of a redundant part in the computer system being disconnected in the maintenance mode in the means (manual switching between maintenance mode and customer mode), it was disconnected from the system Is recorded in the failure registration file.
After that, in maintenance mode, refer to the failure registration file recorded at the initial setting of the computer system. If a failure is recorded, the system initialization is terminated abnormally, and the maintenance staff is informed that the failure has occurred. Notify and prevent delivery to customers in the presence of disabilities.
After the normal initialization of the system, the failure of the redundant part that occurred before switching to the customer mode is referred to the failure registration file periodically after switching to the customer mode. Report that there is a problem. When reporting, a failure registration file is attached, and the maintenance center analyzes the failure registration file and identifies the failure site.
本発明の障害処理方式は、保守作業中に発生した冗長部位の一部の障害が元でシステムダウンとなる事を未然に防ぐことができる。 The failure processing method of the present invention can prevent the system from going down due to a failure of a part of a redundant part that occurs during maintenance work.
本発明の障害処理方式の概念を図1にて説明する。 The concept of the fault handling method of the present invention will be described with reference to FIG.
保守モード中の時間t1で発生した冗長部位Aの障害は、時間t2のシステム初期設定処理で障害を通知する。保守モード中の時間t3で発生した冗長部位Bの障害は、顧客モードに切り替わった後、時間t4で定期冗長部位チェック処理にて障害を通知する。顧客モード中の時間t5で発生した冗長部位Cの障害は、障害発生と共に通知される。しかし、冗長部位Cの障害対策が完了していないまま時間t6のシステム初期設定を行ったとしても、システムは正常に立ち上がる。仮に冗長部位Cの障害対策が長期間行われなかった場合、再度時間t7の定期冗長部位チェック処理にて冗長部位Cの障害を通知する。 The failure of the redundant part A that occurred at time t1 in the maintenance mode is notified by the system initial setting process at time t2. The failure of the redundant part B that occurred at time t3 in the maintenance mode is notified by the periodic redundant part check process at time t4 after switching to the customer mode. The failure of the redundant part C that occurs at time t5 in the customer mode is notified together with the occurrence of the failure. However, even if the system initialization is performed at time t6 while failure countermeasures for the redundant part C are not completed, the system starts up normally. If the failure countermeasure for the redundant portion C is not taken for a long time, the failure of the redundant portion C is notified again by the periodic redundant portion check process at time t7.
命令プロセッサ、入出力プロセッサ、電源ユニット等が冗長化されている計算機システムにおいて、保守作業中に発生した冗長部位の障害によって冗長部位の一部がシステムから切離された場合の障害処理方式として保守作業中の処理方式と顧客運用中の処理方式に分けて実施例を図面により説明する。 In a computer system with redundant instruction processors, input / output processors, power supply units, etc., maintenance is performed as a failure handling method when a part of the redundant part is disconnected from the system due to a fault in the redundant part that occurs during maintenance work. The embodiment will be described with reference to the drawings by dividing into a processing method during work and a processing method during customer operation.
図2は本発明の一実施例による計算機システムの構成を示すブロック図であり、大別すると、F−SVP(100)、C−SVP(200)、PU(300〜302)、HDD(400〜401)、PS(500〜501)、CHP(600〜601)、IOP(700)、CH(800〜803)、SC(900)、SC I/F(1000〜1001)、計算機システムモード切り替えスイッチ(1100)から構成されており、PU(300〜302)、HDD(400〜401)、PS(500〜501)、CHP(600〜601)、SC I/F(1000〜1001)は、冗長化されている。 FIG. 2 is a block diagram showing the configuration of a computer system according to an embodiment of the present invention. Broadly speaking, F-SVP (100), C-SVP (200), PU (300 to 302), HDD (400 to 400). 401), PS (500 to 501), CHP (600 to 601), IOP (700), CH (800 to 803), SC (900), SC I / F (1000 to 1001), computer system mode switch ( 1100), and PU (300 to 302), HDD (400 to 401), PS (500 to 501), CHP (600 to 601), and SC I / F (1000 to 1001) are made redundant. ing.
図2に示すように、本実施例に関わる計算機システムは、PU(300〜302)、HDD(400〜401)、PS(500〜501)、CHP(600〜601)、SC I/F(1000〜1001)の装置が冗長化されており、現用系の装置に障害が発生した場合、バックアップ系にて運用続行可能となっている。 As shown in FIG. 2, the computer system according to the present embodiment includes PU (300 to 302), HDD (400 to 401), PS (500 to 501), CHP (600 to 601), SC I / F (1000). 1001-1001) are made redundant, and if a failure occurs in the active device, the operation can be continued in the backup system.
本実施例に関わる計算機システムには、保守作業と顧客運用の切り替えるスイッチ(以降、運用切り替えスイッチと称す)を持っており人手により切り替える。運用切り替えスイッチが顧客運用モード側に設定されている状態を顧客モード、保守作業モード側に設定されている状態を保守モードと称す。また、このスイッチの状態はサービスプロセッサ内組込みソフトウエアで判定可能であり、このスイッチの状態により保守モード中は、システム初期設定時に冗長部位の障害チェックを行う。また、顧客モード中は、定期的に冗長部位の障害チェックを行う。 The computer system according to the present embodiment has a switch for switching between maintenance work and customer operation (hereinafter referred to as an operation switch) and is manually switched. A state in which the operation changeover switch is set to the customer operation mode side is referred to as a customer mode, and a state in which the operation switch is set to the maintenance work mode side is referred to as a maintenance mode. Further, the state of this switch can be determined by the software embedded in the service processor, and during the maintenance mode, the failure check of the redundant part is performed at the time of initializing the system. Also, during the customer mode, the redundant part is checked for failures periodically.
計算機システムを構成する冗長部位において障害が発生した場合、サービスプロセッサは表1に示す契機で障害を認識し図3の障害登録ファイルへ障害が有った事を登録する。尚、障害登録ファイルは図2に示すHDD(400〜401)内に存在する。例として図2のAP0(301)に障害を検出した場合、図3に示す障害登録ファイルのアドレス01の内容を00(障害無し)からFF(障害有り)に設定する。
When a failure occurs in a redundant part constituting the computer system, the service processor recognizes the failure at the timing shown in Table 1 and registers that there is a failure in the failure registration file of FIG. The failure registration file exists in the HDD (400 to 401) shown in FIG. For example, when a failure is detected in AP0 (301) in FIG. 2, the contents of the
顧客運用中(顧客モード中)に障害部位を交換し障害部位を再びシステムに組込んだ場合。保守作業中(保守モード中)にシステム全体のパワーオン処理を実施した場合。このケースは、システムを停止し障害部位を交換した場合であり、全障害が対策されたとみなし全障害情報がリセットされる。全障害が未対策(複数の障害部位のうち、全部位の対策が行われていない)の場合は、障害部位の障害情報が障害登録ファイルに再設定される。
When the faulty part is replaced and the faulty part is incorporated into the system again during customer operation (in customer mode). When the entire system is powered on during maintenance (maintenance mode). This case is a case where the system is stopped and the faulty part is replaced, and it is considered that all faults have been taken and all fault information is reset. If all the faults have not been taken (all of the faulty parts have not been taken), the fault information of the faulty part is reset in the fault registration file.
該計算機システムを構成する冗長部の切離し部位の通知方式に際し、保守作業実施中の通知方式を図4〜図6、表1を基に説明する。 The notification method during the maintenance work will be described with reference to FIGS. 4 to 6 and Table 1 in the notification method of the disconnected part of the redundant part constituting the computer system.
図4に保守作業概略フローを示す。保守員は、通常、顧客より計算機システムを引き継ぎ、「運用切り替えスイッチ」を保守モードへ切り替え(3001)後、保守点検を行う。その後、システム初期設定(3003)を行い、正常終了した事を確認後「運用切り替えスイッチ」を顧客モードへ切り替え(3004)、計算機システムを顧客へ引き渡す。本発明により、保守点検中(3002)に発生した冗長部の障害は、計算機システム初期設定後、顧客モードへの切り替えまでの間に発生した冗長部位の障害は、顧客モードに切り替わった後の定期冗長部位チェック処理(3006)で通知する。 FIG. 4 shows a maintenance work outline flow. The maintenance staff usually takes over the computer system from the customer, switches the “operation changeover switch” to the maintenance mode (3001), and performs maintenance inspection. Thereafter, system initial setting (3003) is performed, and after confirming normal completion, the “operation changeover switch” is switched to the customer mode (3004), and the computer system is delivered to the customer. According to the present invention, the failure of the redundant part that occurred during the maintenance inspection (3002) is the periodicity after the failure of the redundant part that occurred between the initial setting of the computer system and the switch to the customer mode. Notification is made in the redundant part check process (3006).
図5に計算機システム初期処理の詳細を示す。システム初期設定処理の4000〜4003までの処理は、システムを立ち上げるための処理であるため「運用切り替えスイッチ」の状態に関わらず処理される。そのため、IOP初期設定(4003)処理までが正常に終了すれば冗長部位の一部に障害が有ってもシステムを運用出来る状態である。よって、「運用切り替えスイッチ」が顧客モードの場合、システム運用を可能とするためにシステム初期設定処理を正常終了させる(4006)。「運用切り替えスイッチ」が保守モードの場合、障害登録ファイル内の障害情報を検索し障害情報が登録されていなければ、システム初期設定処理を正常終了させ(4006)、障害情報が登録されていれば、例えシステムを運用出来る状態であってもシステム初期設定処理を異常終了させる(4007)。この時、冗長部位の一部に障害が発生したことを示すメッセージを図2のコンソール装置(100)に表示する(4008)。また、冗長部位状態表示画面で障害部位を特定し対策する。 FIG. 5 shows details of the computer system initial processing. Since the system initialization process from 4000 to 4003 is a process for starting up the system, it is processed regardless of the state of the “operation changeover switch”. For this reason, if the process up to the IOP initial setting (4003) is completed normally, the system can be operated even if there is a failure in a part of the redundant part. Therefore, when the “operation changeover switch” is in the customer mode, the system initial setting process is normally terminated to enable the system operation (4006). When the “operation changeover switch” is in the maintenance mode, the failure information in the failure registration file is searched and if the failure information is not registered, the system initial setting process is terminated normally (4006), and if the failure information is registered. Even if the system can be operated, the system initial setting process is abnormally terminated (4007). At this time, a message indicating that a failure has occurred in a part of the redundant portion is displayed on the console device (100) of FIG. 2 (4008). In addition, the faulty part is identified and countermeasures are taken on the redundant part state display screen.
図6にシステム運用中の定期システム内冗長部位障害通知処理を示す。顧客モード時は、前述したように冗長部位の一部の障害でシステム初期設定処理を異常終了させることは出来ない。そこで定期的(一日に一回程度)に障害登録ファイル内の障害情報をサーチし障害情報が登録されていれば、障害コードを生成し、保守センタに対し冗長部の一部に障害が発生していることを自動的に通知する(5004)。この時、通報データに障害登録ファイルを添付することにより保守センタで障害登録ファイルを編集し、障害部位を特定可能とする。障害登録ファイル内に障害情報が登録されていなければ、自動通報は実施されない(5003〜5002)。保守モード時は、計算機システムの近くに保守員がいるため、定期システム内冗長部位通知処理は動作しない(5001〜5002)。 FIG. 6 shows a regular system redundant part failure notification process during system operation. In the customer mode, as described above, the system initial setting process cannot be abnormally terminated due to a failure of a part of the redundant part. Therefore, if failure information is registered by searching for failure information in the failure registration file periodically (about once a day), a failure code is generated and a failure occurs in a part of the redundant part to the maintenance center. (5004). At this time, by attaching the failure registration file to the report data, the maintenance registration center edits the failure registration file so that the failure portion can be identified. If failure information is not registered in the failure registration file, automatic notification is not performed (5003 to 5002). In the maintenance mode, since there is a maintenance person near the computer system, the redundant part notification process in the regular system does not operate (5001 to 5002).
100 コンソール装置(F−SVP)
200 計算機システム制御監視装置(C−SVP)
300 命令プロセッサ(IP)
301〜302 交替プロセッサ(AP)
400〜401 磁気ディスク装置
500〜501 電源ユニット
600〜601 チャネル制御装置(CHP)
700 入出力装置(IOP)
800〜803 チャネル装置(CH)
900 SC
1000〜1001 SC I/F
1100 計算機システムモード切り替えスイッチ
100 Console device (F-SVP)
200 Computer system control and monitoring device (C-SVP)
300 Instruction processor (IP)
301-302 Alternate processor (AP)
400 to 401 Magnetic disk device 500 to 501
700 Input / output unit (IOP)
800 to 803 channel device (CH)
900 SC
1000-10001 SC I / F
1100 Computer system mode switch
Claims (1)
An external storage device for storing a failure in a redundant part in a computer system having a service processor for controlling the instruction processor, input / output processor, power supply unit, etc. And a means for switching the computer operation mode, and the service processor searches the failure registration file recorded in the external storage device when the computer operation mode switching means is in the maintenance mode when the system initial setting is executed. A failure processing method for a computer system, characterized in that if the information is registered, the system initial setting process is abnormally terminated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2003316174A JP2005084947A (en) | 2003-09-09 | 2003-09-09 | Fault recovery system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2003316174A JP2005084947A (en) | 2003-09-09 | 2003-09-09 | Fault recovery system |
Publications (1)
Publication Number | Publication Date |
---|---|
JP2005084947A true JP2005084947A (en) | 2005-03-31 |
Family
ID=34416156
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP2003316174A Pending JP2005084947A (en) | 2003-09-09 | 2003-09-09 | Fault recovery system |
Country Status (1)
Country | Link |
---|---|
JP (1) | JP2005084947A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008152552A (en) * | 2006-12-18 | 2008-07-03 | Hitachi Ltd | Computer system and failure information management method |
JP2013254423A (en) * | 2012-06-08 | 2013-12-19 | Canon Inc | Information processing apparatus and control method, and program |
-
2003
- 2003-09-09 JP JP2003316174A patent/JP2005084947A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008152552A (en) * | 2006-12-18 | 2008-07-03 | Hitachi Ltd | Computer system and failure information management method |
JP2013254423A (en) * | 2012-06-08 | 2013-12-19 | Canon Inc | Information processing apparatus and control method, and program |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI337304B (en) | Method for fast system recovery via degraded reboot | |
US7716520B2 (en) | Multi-CPU computer and method of restarting system | |
CN100517250C (en) | Apparatus and method for controlling RAID array rebuild | |
US7751310B2 (en) | Fault tolerant duplex computer system and its control method | |
WO2018095107A1 (en) | Bios program abnormal processing method and apparatus | |
US8713236B2 (en) | Maintenance guidance display device, maintenance guidance display method, and maintenance guidance display program | |
CN102541682A (en) | Method for restoring abnormal programs in embedded system quickly and automatically | |
KR20080028751A (en) | Information processing apparatus, control apparatus therefor, control method therefor and control program | |
JP2016200981A (en) | Operation management program, operation management method and operation management device | |
JP2010067115A (en) | Data storage system and data storage method | |
JP5104479B2 (en) | Information processing device | |
JP2005084947A (en) | Fault recovery system | |
JP2014191491A (en) | Information processor and information processing system | |
JP3551079B2 (en) | Recovery method and device after replacement of modified load module | |
JP2006065440A (en) | Process management system | |
JP2007233667A (en) | Method of detecting fault | |
JP2007207169A (en) | Operation monitoring work support apparatus | |
JP5734107B2 (en) | Process failure determination and recovery device, process failure determination and recovery method, process failure determination and recovery program, and recording medium | |
US8181162B2 (en) | Manager component for checkpoint procedures | |
JP2011159234A (en) | Fault handling system and fault handling method | |
JP2005157462A (en) | System switching method and information processing system | |
US7962781B2 (en) | Control method for information storage apparatus, information storage apparatus and computer readable information recording medium | |
JP2006146685A (en) | Multi-node system and failure restoration method | |
JP2006024066A (en) | Client server system | |
JP2019168928A (en) | Urgency determination device, urgency determination method, and urgency determination program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A621 | Written request for application examination |
Free format text: JAPANESE INTERMEDIATE CODE: A621 Effective date: 20060309 |
|
RD01 | Notification of change of attorney |
Free format text: JAPANESE INTERMEDIATE CODE: A7421 Effective date: 20060421 |
|
A131 | Notification of reasons for refusal |
Free format text: JAPANESE INTERMEDIATE CODE: A131 Effective date: 20080401 |
|
A521 | Written amendment |
Free format text: JAPANESE INTERMEDIATE CODE: A523 Effective date: 20080602 |
|
A02 | Decision of refusal |
Free format text: JAPANESE INTERMEDIATE CODE: A02 Effective date: 20080624 |