JP2010282326A

JP2010282326A - Information processing system, failure countermeasure mechanism for the same, and failure countermeasure method for the same

Info

Publication number: JP2010282326A
Application number: JP2009133837A
Authority: JP
Inventors: Yukio Saito; 幸男齋藤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-06-03
Filing date: 2009-06-03
Publication date: 2010-12-16
Anticipated expiration: 2029-06-03
Also published as: JP5532687B2

Abstract

<P>PROBLEM TO BE SOLVED: To solve the following problem: in a countermeasure against a failure by a hot standby system fail-over technology of an information processing system configured of a multiplexed computer device, when a currently-operated computer device cannot replay while mounted with a logical disk and a spare computer is tried to be mounted with a logical disk, the try is recognized as a double mounting request, and the mounting is disturbed, consequently the spare computer cannot take over the processing from the faulty computer. <P>SOLUTION: A failure countermeasure mechanism by a hot standby system fail-over technology in an information processing system connected to a common external storage device, and configured of a multiplexed computer device includes: a monitoring means for individually monitoring the occurrence of the failure of the computer device; an access interruption means for, when the occurrence of the failure is detected, interrupting input/output access from the computer device in which the failure has occurred with respect to the external storage device; and a notification means for notifying the other computer device of the information of the occurrence of the failure. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、コンピュータ装置とディスクアレイ装置とを接続して構成した情報処理システムに関し、特にコンピュータ装置の障害対応機構及び障害対応方法に関する。 The present invention relates to an information processing system configured by connecting a computer device and a disk array device, and more particularly to a failure handling mechanism and a failure handling method for a computer device.

今日、コンピュータネットワークを構成する情報処理システムは社会的な重要性が増大し、高い可用性と信頼性が要求されている。例えばサーバーシステムにおいては、サーバーに障害が発生した場合には、管理者への通知、誤動作防止のため障害発生サーバーの停止、復旧処置などの対応を迅速に行う必要がある。 Today, information processing systems constituting computer networks are becoming more socially important and require high availability and reliability. For example, in a server system, when a failure occurs in a server, it is necessary to promptly respond to the administrator, stop the failed server, and take recovery measures to prevent malfunction.

近年ではサーバーシステムの障害対策として、フェイルオーバー機能の導入が一般的になっている。フェイルオーバーとは、サーバーに障害が発生した場合に代替サーバーが処理やデータを自動的に引き継ぐ機能である。具体的には例えば、予備を含む複数のサーバーを共通のディスクアレイ装置に接続し、各サーバーはディスクアレイ装置内の論理ディスクをマウントして処理を実行できるシステム構成とする。運用中のサーバーに障害が発生した場合には、予備サーバーが当該論理ディスクをマウントして運用サーバーの処理を引き継ぐことができる。 In recent years, the introduction of a failover function has become common as a countermeasure for failure of server systems. Failover is a function in which an alternative server automatically takes over processing and data when a failure occurs in the server. Specifically, for example, a plurality of servers including spares are connected to a common disk array device, and each server has a system configuration in which a logical disk in the disk array device is mounted and processing can be executed. When a failure occurs in the operating server, the spare server can mount the logical disk and take over the processing of the operating server.

なおフェイルオーバーには、コールドスタンバイおよびホットスタンバイと呼ばれる方式がある。 Note that there are methods called failover and hot standby for failover.

コールドスタンバイは、通常稼動させる運用サーバーの他に、同等のサーバーを用意しておき、予備サーバーとして稼動させずに待機させておく。もし運用サーバーに障害が発生した場合は、予備サーバーが自動的に稼動し、運用サーバーに替わって処理を開始する。この方式はサーバーの切り替えに若干時間がかかり、切り替え中はシステムが停止することになる。特許文献１にはコールドスタンバイ方式においてフェイルオーバーを実行するまでのシステム停止時間を短縮するための技術の例が開示されている。特許文献１のコールドスタンバイ方式フェイルオーバーは、運用サーバーや予備サーバーとは別に管理用のサーバーを用意し、当該管理サーバーが系切り替えのための情報管理や制御を実施する構成となっている。 In cold standby, an equivalent server is prepared in addition to the operation server that is normally operated, and the standby server is not operated as a spare server. If a failure occurs in the operation server, the spare server automatically operates and starts processing in place of the operation server. This method takes some time to switch servers, and the system stops during switching. Patent Document 1 discloses an example of a technique for shortening the system stop time until the failover is executed in the cold standby method. The cold standby failover of Patent Document 1 is configured such that a management server is prepared separately from the operation server and the spare server, and the management server performs information management and control for system switching.

一方ホットスタンバイは、コールドスタンバイとは異なり、予備サーバーは運用サーバーとともに常に稼動させておき、運用サーバーに障害が発生した場合には即座に予備サーバーを運用サーバーとして切り替える方式である。ホットスタンバイはコールドスタンバイよりコストはかかるが、より高い可用性と信頼性を得ることができる。特許文献２にはホットスタンバイ方式フェイルオーバーの例が開示されている。特許文献２が開示する技術は、運用サーバーと予備サーバーとが常に互いの動作状態を監視し、かつ処理のステータス情報を共有し、運用サーバーに障害が発生しても無停止で予備サーバーに処理を引き継ぐシステムである。 On the other hand, unlike the cold standby, the hot standby is a system in which the spare server is always operated together with the operation server, and when a failure occurs in the operation server, the spare server is immediately switched as the operation server. Hot standby costs more than cold standby, but can provide higher availability and reliability. Patent Document 2 discloses an example of hot standby type failover. The technology disclosed in Patent Document 2 is such that the operation server and the spare server always monitor each other's operating state and share processing status information, so that even if a failure occurs in the operation server, the operation is performed without interruption. It is a system that takes over.

特開２００８−２９３２４５JP 2008-293245 A 特開平１１−２５９３２６JP-A-11-259326

しかしながら上述したホットスタンバイ方式フェイルオーバー技術は、障害の監視や障害発生時の系切り替え等の対応をサーバー側が主体的に行っているため、実際の運用上、障害の内容や発生状況によっては以下の問題を生じる場合があった。 However, the above-mentioned hot standby method failover technology is mainly handled by the server side, such as failure monitoring and system switchover in the event of a failure. It sometimes caused problems.

第１の問題点は、運用サーバーに障害が発生し、ディスクアレイ装置内の論理ディスクをマウントしたまま応答不能になると、予備サーバーへの系切り替えが上手く行えない場合があることである。 The first problem is that when a failure occurs in the operation server and the response becomes impossible while the logical disk in the disk array device is mounted, the system switching to the spare server may not be performed successfully.

その理由は、予備サーバーが処理の引き継ぎのために、運用サーバーがマウントしていた論理ディスクをマウントしようとすると、ディスクアレイ装置からはこれが二重マウントの要求に見え、マウントを妨げる場合があるからである。従って予備サーバーは運用サーバーの障害は検出したものの、ディスクアレイ装置内の論理ディスクをマウントできずサービス継続不能となる。 The reason is that if the spare server tries to mount the logical disk mounted by the primary server to take over the processing, this appears to the disk array device as a double mount request and may prevent the mount. It is. Therefore, although the spare server detects a failure of the operation server, the logical disk in the disk array device cannot be mounted and the service cannot be continued.

第２の問題点は、系切り替えが正常に行われても、故障として系切り替えを実施されていた当初の運用サーバーが、その後リブートなどにより突然復旧すると、論理ディスクのデータを破壊する場合があることである。 The second problem is that even if the system switchover is performed normally, if the original operation server that had been switched over as a failure suddenly recovers after a reboot, the logical disk data may be destroyed. That is.

その理由は、当初の運用サーバーは障害発生時にディスクアレイ装置内の論理ディスクをマウントした状態を解除できずに停止しており、復旧すると再びその論理ディスクに対してアクセスを開始するためである。すなわち復旧した当初の運用サーバーが、現行の運用サーバーがマウントしている論理ディスクに対して、再びアクセスすることにより二重マウントが起きる場合がありデータが破壊されうる。 The reason is that the original operation server is stopped without releasing the mounted state of the logical disk in the disk array device when a failure occurs, and starts to access the logical disk again after recovery. That is, when the restored original operational server accesses the logical disk mounted by the current operational server again, double mounting may occur, and data may be destroyed.

本発明の目的は、上記問題を解決し、多重化したコンピュータ装置により構成した情報処理システムにおいて、論理ディスクの二重マウント防止が管理されたコンピュータ装置の障害対応機構および障害対応方法を提供することである。 SUMMARY OF THE INVENTION An object of the present invention is to solve the above problems and provide a failure handling mechanism and a failure handling method for a computer device in which prevention of double mounting of logical disks is managed in an information processing system constituted by multiplexed computer devices. It is.

本発明の情報処理システムの障害対応機構は、予備のコンピュータ装置を含む、少なくとも２つのコンピュータ装置を共通の外部記憶装置に接続して構成され、前記予備のコンピュータ装置は常に、運用中のコンピュータ装置と同じ稼動状態を保ちながら待機している、多重化したコンピュータ装置を構成する情報処理システムにおける障害対応機構であって、前記コンピュータ装置の障害発生を個別に監視する監視手段と、前記監視手段が、前記コンピュータ装置のうちのいずれかに障害発生したことを検出した時に、当該障害発生の検出を受けて、当該障害発生したコンピュータ装置からの前記外部記憶装置への入出力アクセスを遮断するアクセス遮断手段と、前記監視手段が、前記コンピュータ装置のうちのいずれかに障害発生したことを検出した時に、当該障害発生の検出を受けて、前記障害発生の情報を前記コンピュータ装置のうち障害発生していないコンピュータ装置に通知する通知手段と、を有する。 The failure handling mechanism of the information processing system of the present invention is configured by connecting at least two computer devices including a spare computer device to a common external storage device, and the spare computer device is always in operation. A failure handling mechanism in an information processing system constituting a multiplexed computer device that is waiting while maintaining the same operating state as the monitoring device, wherein the monitoring unit individually monitors the occurrence of a failure in the computer device, and the monitoring unit includes: An access block that blocks the input / output access to the external storage device from the computer device in which the failure has occurred upon detecting the occurrence of a failure in any of the computer devices And the monitoring means indicate that one of the computer devices has failed. Upon detecting a, it receives the detection of the failure, having, a notification unit that notifies the computer system that is not a failure of the computing device information of the failure.

また本発明の情報処理システムの障害対応方法は、予備のコンピュータ装置を含む、少なくとも２つのコンピュータ装置を共通の外部記憶装置に接続して構成され、前記予備のコンピュータ装置は常に、運用中のコンピュータ装置と同じ稼動状態を保ちながら待機している、多重化したコンピュータ装置を構成する情報処理システムにおける障害対応方法であって、
前記コンピュータ装置の障害発生を個別に監視するステップと、
前記監視手段が、前記コンピュータ装置のうちのいずれかに障害発生したことを検出した時に、当該障害発生の検出を受けて、当該障害発生したコンピュータ装置からの前記外部記憶装置への入出力アクセスを遮断するステップと、
前記監視手段が、前記コンピュータ装置のうちのいずれかに障害発生したことを検出した時に、当該障害発生の検出を受けて、前記障害発生の情報を前記コンピュータ装置のうち障害発生していないコンピュータ装置に通知するステップと、を有する。 The failure handling method for an information processing system according to the present invention is configured by connecting at least two computer devices including a spare computer device to a common external storage device, and the spare computer device is always in operation. A failure handling method in an information processing system that constitutes a multiplexed computer device that is on standby while maintaining the same operating state as the device,
Individually monitoring the occurrence of a fault in the computer device;
When the monitoring unit detects that a failure has occurred in any of the computer devices, the monitoring unit receives the detection of the failure and performs input / output access to the external storage device from the computer device in which the failure has occurred. A blocking step;
When the monitoring means detects that a failure has occurred in any of the computer devices, the computer device that has received the detection of the failure and uses the information on the occurrence of the failure as a computer device in which no failure has occurred. And notifying to.

本発明によれば、多重化したコンピュータ装置を構成する情報処理システムにおいて、運用コンピュータ装置が論理ディスクをマウントしたまま応答不能となった場合でも、予備コンピュータ装置が論理ディスクをマウントするのを妨げられることがなくなる。また障害発生した運用コンピュータ装置が突然復旧した場合でも二重マウントによるデータの破壊を引き起こすことがなくなる。 According to the present invention, in the information processing system constituting the multiplexed computer apparatus, even if the operation computer apparatus becomes unable to respond with the logical disk mounted, the spare computer apparatus is prevented from mounting the logical disk. Nothing will happen. In addition, even if the operation computer apparatus in which the failure occurs suddenly recovers, the data is not destroyed due to the double mount.

図１Ａは本発明の実施の形態の基本構成を示すブロック図である。FIG. 1A is a block diagram showing the basic configuration of the embodiment of the present invention. 図１Ｂは本発明の実施の形態の基本動作を示すフローチャートである。FIG. 1B is a flowchart showing the basic operation of the embodiment of the present invention. 図２は本発明の実施の形態の二重化サーバーシステムのシステム構成図である。FIG. 2 is a system configuration diagram of the redundant server system according to the embodiment of this invention. 図３は本発明の実施の形態の二重化サーバーシステムの動作手順を示すフローチャートである。FIG. 3 is a flowchart showing an operation procedure of the redundant server system according to the embodiment of this invention.

次に、本発明の実施の形態について図面を参照して詳細に説明する。 Next, embodiments of the present invention will be described in detail with reference to the drawings.

図１Ａ及び図１Ｂはそれぞれ、本発明の実施の形態の情報処理システムの障害対応機構及び障害対策方法についての、システム基本構成を示すブロック図および基本動作を示すフローチャートである。 1A and 1B are a block diagram showing a basic system configuration and a flowchart showing a basic operation of a failure handling mechanism and a failure handling method of an information processing system according to an embodiment of the present invention.

図１Ａを参照すると、本システムの障害監視機構は、障害監視手段、アクセス遮断手段、障害通知手段を装備している。障害監視手段は、外部記憶装置に接続された各コンピュータ装置の障害発生を個別に監視している。アクセス遮断手段は、運用中のコンピュータ装置のいずれかに障害発生したことを障害監視手段が検出した場合、障害発生したコンピュータ装置からの外部記憶装置への入出力アクセスを遮断する。また障害通知手段も、障害発生の情報を正常に稼動している各コンピュータ装置へ通知する。 Referring to FIG. 1A, the fault monitoring mechanism of this system includes a fault monitoring unit, an access blocking unit, and a fault notification unit. The failure monitoring means individually monitors the occurrence of a failure in each computer device connected to the external storage device. The access blocking means blocks input / output access to the external storage device from the failed computer device when the failure monitoring means detects that a failure has occurred in any of the operating computer devices. The failure notification means also notifies failure information to each computer device that is operating normally.

図１Ｂは、運用中のコンピュータ装置と予備のコンピュータ装置を含む複数のコンピュータ装置が共通の外部記憶装置に接続され、多重化されたコンピュータ装置で構成した情報処理システムにおける基本動作を示す。運用中の各コンピュータ装置は、外部記憶装置の所定の領域をそれぞれマウントし、運用データにアクセスして処理を実行している。そして予備のコンピュータ装置は、運用中のコンピュータ装置と常に同じ稼動状態を保ちながら、運用中のコンピュータ装置の故障に備えて待機している。 FIG. 1B shows a basic operation in an information processing system constituted by a plurality of computer devices in which a plurality of computer devices including an operating computer device and a spare computer device are connected to a common external storage device. Each computer apparatus in operation mounts a predetermined area of the external storage device, accesses operation data, and executes processing. The spare computer device is on standby for failure of the operating computer device while maintaining the same operating state as the operating computer device.

障害監視手段は、外部記憶装置に接続された各コンピュータ装置の障害発生を個別に監視している。運用中のコンピュータ装置のひとつに障害が発生したのを検出した場合、アクセス遮断手段は障害発生したコンピュータ装置からの外部記憶装置への入出力アクセスを遮断する。さらに障害通知手段により、正常に稼動している各コンピュータ装置に対し、障害発生の情報を通知する。 The failure monitoring means individually monitors the occurrence of a failure in each computer device connected to the external storage device. When it is detected that a failure has occurred in one of the operating computer devices, the access blocking means blocks the input / output access to the external storage device from the failed computer device. Further, the failure notification means notifies the failure occurrence information to each normally operating computer device.

以上のステップを経ることにより、障害発生したコンピュータ装置の処理を引き継ぐべき予備のコンピュータ装置が、障害発生情報の通知を受け、系を切り替え、処理引き継ぎを行う際に、以下の効果が得られる。すなわち予備のコンピュータ装置は、処理引き継ぎのために、障害発生したコンピュータ装置が使用していた外部記憶装置の所定領域をマウントし運用データへのアクセスを行う必要がある。もし障害発生したコンピュータ装置が外部記憶装置をマウントした状態で停止していた場合でも、停止した時点で外部記憶装置は当該コンピュータ装置からの入出力アクセスを遮断されている。従って予備のコンピュータ装置が処置引き継ぎのため外部記憶装置の同じ領域のマウントを行っても二重マウントによる問題は発生しない。このため外部記憶装置は、予備のコンピュータ装置によるマウントを妨げる必要がなく、系の切り替えをスムーズに行うことができる。 Through the above steps, the following effects can be obtained when the spare computer device that should take over the processing of the computer device in which the failure has occurred receives notification of the failure occurrence information, switches the system, and takes over the processing. In other words, the spare computer device needs to mount a predetermined area of the external storage device used by the computer device in which the failure has occurred in order to take over the processing and access the operation data. Even if the computer device in which the failure has occurred is stopped with the external storage device mounted, the input / output access from the computer device is blocked when the external storage device is stopped. Therefore, even if the spare computer device mounts the same area of the external storage device for taking over the procedure, the problem due to the double mounting does not occur. For this reason, the external storage device does not need to prevent mounting by the spare computer device, and can switch the system smoothly.

また障害発生により停止していたコンピュータ装置が、予備のコンピュータ装置が処理を引き継いだ後に不意に動作を再開したとしても、既に外部記憶装置へのアクセスは遮断されているため、再度運用データを処理することができない。従ってこの場合も、二重マウントによりデータが破壊されるという問題は生じない。 Even if a computer that has been stopped due to a failure occurs unexpectedly and resumes operation after the spare computer takes over processing, access to the external storage device is already blocked, so the operation data is processed again. Can not do it. Therefore, in this case as well, there is no problem that data is destroyed by double mounting.

図２は、本発明の実施の形態の具体的なシステム構成の例として、二重化サーバーシステムについてのシステム構成を示す図である。 FIG. 2 is a diagram illustrating a system configuration of a duplex server system as an example of a specific system configuration according to the embodiment of this invention.

コンピュータ装置として運用サーバー１０および予備サーバー２０は共に一般的なサーバー機能を具備し、外部記憶装置として共通のディスクアレイ装置３０にインターフェースを介して接続する。運用サーバー１０はディスクアレイ装置の論理ディスクをマウントして処理を実行する。予備サーバー２０は運用サーバーが故障したときには直ちに処理を引き継げるよう、稼動状態で待機している。なお運用サーバー１０および予備サーバー２０は互いの状態を監視できるインターフェースを具備し、相互にデータの同期を取っている。そして運用サーバー１０、予備サーバー２０およびディスクアレイ装置３０は、ホットスタンバイ方式フェイルオーバー機能を備えた二重化システムを構成している。 Both the operation server 10 and the spare server 20 as computer devices have a general server function, and are connected to a common disk array device 30 as an external storage device via an interface. The operation server 10 mounts the logical disk of the disk array device and executes processing. The spare server 20 stands by in an operating state so that processing can be immediately taken over when the operation server fails. The operation server 10 and the spare server 20 have an interface that can monitor the state of each other, and synchronize data with each other. The operation server 10, the spare server 20, and the disk array device 30 constitute a duplex system having a hot standby type failover function.

ディスクアレイ装置３０は、複数のハードディスクから構成される論理ディスク３１と、各サーバーからの定期的な書き込み処理がなされる監視ディスク３２を具備している。なお監視ディスク３２には、運用サーバー１０と予備サーバー２０とで専用の対象領域を割当て、それぞれ個別に書き込みできるものとしている。各サーバーは監視ディスク装置３２へ定期的にアクセスし、アクセス記録を書き込む。ディスクアレイ装置３０に装備された障害対応装置４０は、障害監視手段として、監視ディスク３２への書き込みがあったことをチェックして各サーバーの動作状態を確認する機能を有している。また障害対応装置４０は、アクセス遮断手段として、サーバーの異常を検出した際は、異常発生サーバーからの論理ディスク３１へのアクセスを遮断する機能を有している。さらに障害対応装置４０は、障害通知手段として、正常なサーバー及びシステム管理施設５０に異常発生を通知する機能を有している。 The disk array device 30 includes a logical disk 31 composed of a plurality of hard disks and a monitoring disk 32 on which periodic write processing from each server is performed. The monitoring disk 32 is assigned with dedicated target areas by the operation server 10 and the spare server 20 and can be written individually. Each server periodically accesses the monitoring disk device 32 and writes an access record. The failure handling device 40 installed in the disk array device 30 has a function of checking the operation state of each server by checking that data has been written to the monitoring disk 32 as failure monitoring means. Further, the failure handling apparatus 40 has a function of blocking access to the logical disk 31 from an abnormal server when an abnormality of the server is detected as access blocking means. Furthermore, the failure handling apparatus 40 has a function of notifying occurrence of an abnormality to a normal server and the system management facility 50 as a failure notification means.

次に図３に示したフローチャートによって、本システムを構成する運用サーバー１０、予備サーバー２０、論理ディスク３１、監視ディスク３２、障害対応装置４０について、具体的な障害対応動作を説明する。 Next, with reference to the flowchart shown in FIG. 3, specific failure handling operations will be described for the operation server 10, the spare server 20, the logical disk 31, the monitoring disk 32, and the failure handling apparatus 40 that constitute this system.

運用サーバー１０および予備サーバー２０はそれぞれ定期的にディスクアレイ装置３０内の監視ディスク３２に対してアクセスし、障害監視用データとして、例えば自己のハートビートシグナル符号を書き込む（Ｓｔｅｐ１）。 Each of the operation server 10 and the spare server 20 periodically accesses the monitoring disk 32 in the disk array device 30, and writes, for example, its own heartbeat signal code as failure monitoring data (Step 1).

障害対応装置４０は当該書き込み処理が定期的に行われていることを監視することにより、各サーバーの正常性を確認する（Ｓｔｅｐ２）。 The failure handling apparatus 40 confirms the normality of each server by monitoring that the writing process is periodically performed (Step 2).

運用サーバーに障害発生し、書き込み処理が途絶すると、障害対応装置４０は運用サーバーに障害が発生したと判断する（Ｓｔｅｐ３）。 When a failure occurs in the operation server and the writing process is interrupted, the failure handling apparatus 40 determines that a failure has occurred in the operation server (Step 3).

障害対応装置４０は、障害と判断した運用サーバー１０の論理ディスク３１へのアクセス権を削除することにより、運用サーバー１０を切り離す（Ｓｔｅｐ４）。これにより予備サーバー２０は運用サーバー１０の処理を引き継ぐため論理ディスク３１をマウントできるようになる。また停止した運用サーバー１０は、その後リブートなどにより不意に再起動することがあっても、もはや論理ディスク３１にはアクセスできない。 The failure handling apparatus 40 disconnects the operation server 10 by deleting the access right to the logical disk 31 of the operation server 10 determined to be a failure (Step 4). As a result, the spare server 20 can mount the logical disk 31 to take over the processing of the operation server 10. The stopped operation server 10 can no longer access the logical disk 31 even if it is unexpectedly restarted by a reboot or the like.

障害対応装置４０は、運用サーバー１０の障害発生を予備サーバー２０及びシステム管理施設５０に対してそれぞれ通知する（Ｓｔｅｐ５）。 The failure handling apparatus 40 notifies the standby server 20 and the system management facility 50 of the occurrence of a failure in the operation server 10 (Step 5).

障害対応装置４０より、運用サーバー１０の障害発生を通知された予備サーバー２０は、直ちに自らが運用サーバーとなるための系切り替え処理を実施する（Ｓｔｅｐ６）。そして予備サーバー２０は障害発生した運用サーバー１０がマウントしていた論理ディスク３１をマウントし、運用サーバー１０の処理を引き継ぐ（Ｓｔｅｐ７）。 The backup server 20 notified of the failure of the operation server 10 by the failure handling apparatus 40 immediately performs a system switching process for itself to become the operation server (Step 6). Then, the spare server 20 mounts the logical disk 31 mounted by the failed operational server 10 and takes over the processing of the operational server 10 (Step 7).

なおＳｔｅｐ４での運用サーバー１０の論理ディスク３１への入出力アクセスの遮断は、障害対応装置４０が実施する処理であり、運用サーバー１０が再起動したとしても、自動的にアクセスまで復旧することはない。Ｓｔｅｐ５でシステム管理施設に障害発生が通知された後は、システム管理者は復旧のためのオペレーターを手配し、必ず保守作業者の介在のもとに復旧作業が行われる（Ｓｔｅｐ８）。 Note that the block of input / output access to the logical disk 31 of the operation server 10 at Step 4 is a process performed by the failure handling apparatus 40, and even if the operation server 10 is restarted, it is not possible to automatically recover to the access. Absent. After the occurrence of a failure is notified to the system management facility in Step 5, the system administrator arranges an operator for restoration, and the restoration work is always performed with the intervention of the maintenance worker (Step 8).

このように本実施の形態は、単にサーバー同士が互いの状態を監視することによって障害発生を検出しフェイルオーバー動作を行う場合と異なり、ディスクアレイ装置側が主体で各サーバーの障害発生を検知している。ディスクアレイ装置は障害発生を検知すると直ちに障害発生サーバーとの入出力アクセスを遮断するので二重マウントが起きる場合はなくなる。従って系切り替え時に、予備サーバーによる論理ディスクのマウントをディスクアレイ装置が妨げる動作は不要なものとして排除でき、系切り替えは常にスムーズに行われる。また障害発生した運用サーバーが突然復旧した場合でも、ディスクアレイ装置への入出力アクセスの遮断は自動では復旧しないので、データの破壊などを引き起こすこともなくなる。 As described above, in this embodiment, unlike the case where the server detects each other's state and detects the failure and performs the failover operation, the disk array device mainly detects the failure of each server. Yes. When the disk array device detects the occurrence of a failure, the I / O access to the failed server is cut off immediately, so there is no case where double mounting occurs. Therefore, when the system is switched, the operation that prevents the disk array device from mounting the logical disk by the spare server can be eliminated as unnecessary, and the system switching is always performed smoothly. Even if the operation server where the failure occurred suddenly recovers, the blockage of the input / output access to the disk array device is not automatically recovered, so that the data is not destroyed.

なお上述した本発明の実施の形態は、図３のＳｔｅｐ１で各サーバーが監視ディスク３２へ書き込む障害監視用データとして、各サーバーのハートビートシグナル符号を用いた場合について説明した。一方、本発明の他の実施の形態として、この障害監視用データとして各サーバーが管理する内部時刻情報を用いることが出来る。この実施の形態では、サーバーに障害が発生して障害監視用データとしての時刻情報の監視ディスク３２への書き込みが途絶えた場合に、最後に書き込まれた時刻データから、障害が発生した時刻をある程度特定することができ、障害解析に役立てることが出来るという利点がある。 In the above-described embodiment of the present invention, the case where the heartbeat signal code of each server is used as the failure monitoring data that each server writes to the monitoring disk 32 in Step 1 of FIG. 3 has been described. On the other hand, as another embodiment of the present invention, internal time information managed by each server can be used as the failure monitoring data. In this embodiment, when a failure occurs in the server and the writing of the time information as the failure monitoring data to the monitoring disk 32 is interrupted, the time when the failure occurs is determined to some extent from the last written time data. This has the advantage that it can be identified and used for failure analysis.

以上、本発明の実施の形態として、二重化されたサーバーシステムについて説明したが、もちろん本発明は３台以上のサーバーが接続され、それらが多重化された一般的なサーバーシステムにおいても同様に使用することができる。すなわち、図３において運用サーバー１０や予備サーバー２０が多数存在したとしても、障害対応装置４０は図１Ｂのフローに基づき各サーバーの障害発生に個別に対応する手順を実施すればよい。 As described above, the redundant server system has been described as an embodiment of the present invention. Of course, the present invention is similarly used in a general server system in which three or more servers are connected and multiplexed. be able to. That is, even if there are a large number of operation servers 10 and spare servers 20 in FIG. 3, the failure handling apparatus 40 may perform a procedure corresponding to the failure occurrence of each server based on the flow of FIG. 1B.

１０運用サーバー
２０予備サーバー
３０ディスクアレイ装置
３１論理ディスク
３２監視ディスク
４０障害対応装置
５０システム管理施設 10 operation server 20 spare server 30 disk array device 31 logical disk 32 monitoring disk 40 failure response device 50 system management facility

Claims

It is configured by connecting at least two computer devices, including a spare computer device, to a common external storage device, and the spare computer device is always on standby while maintaining the same operating state as an operating computer device. A failure handling mechanism in an information processing system constituting a multiplexed computer device,
Monitoring means for individually monitoring the occurrence of failures in the computer device;
When the monitoring unit detects that a failure has occurred in any of the computer devices, the monitoring unit receives the detection of the failure and performs input / output access to the external storage device from the computer device in which the failure has occurred. Access blocking means for blocking;
When the monitoring means detects that a failure has occurred in any of the computer devices, the computer device that has received the detection of the failure and uses the information on the occurrence of the failure as a computer device in which no failure has occurred. A failure response mechanism of the information processing system.

The failure handling mechanism of an information processing system according to claim 1, wherein the computer device is a server, and the information processing system constitutes a multiplexed server system.

The information processing system according to claim 1, wherein the external storage device is a disk array device, and the computer device has a monitoring storage area for individually writing failure monitoring data. Failure response mechanism.

The information processing system according to claim 3, wherein the monitoring unit monitors a status of writing the failure monitoring data to the monitoring storage area periodically performed by the computer device. mechanism.

5. The failure handling mechanism of an information processing system according to claim 4, wherein the failure monitoring data is internal time information individually managed by the computer device.

When the monitoring unit detects that a failure has occurred in any of the computer devices, the monitoring unit has a unit for notifying the occurrence information of the failure to a pre-registered notification destination. Item 6. An information processing system failure handling mechanism according to any one of Items 1 to 5.

An information processing system comprising: the failure handling mechanism according to any one of claims 1 to 6, the at least two computer devices, and the external storage device.

It is configured by connecting at least two computer devices, including a spare computer device, to a common external storage device, and the spare computer device is always on standby while maintaining the same operating state as an operating computer device. A failure handling method in an information processing system constituting a multiplexed computer device,
Individually monitoring the occurrence of a fault in the computer device;
When the monitoring unit detects that a failure has occurred in any of the computer devices, the monitoring unit receives the detection of the failure and performs input / output access to the external storage device from the computer device in which the failure has occurred. A blocking step;
When the monitoring means detects that a failure has occurred in any of the computer devices, the computer device that has received the detection of the failure and uses the information on the occurrence of the failure as a computer device in which no failure has occurred. And a step of notifying the information processing system.

9. The information processing system failure handling method according to claim 8, wherein the computer device is a server, and the information processing system constitutes a multiplexed server system.

9. The external storage device according to claim 8, wherein the external storage device is a disk array device, and the computer device individually has a step of writing failure monitoring data in a monitoring storage area of the disk array device. 10. A failure handling method for an information processing system according to 9.

The step of individually monitoring the occurrence of a failure of the computer device by the external storage device is by monitoring the state of writing the failure monitoring data to the monitoring storage area periodically performed by the computer device. The information processing system failure handling method according to claim 10, wherein:

12. The failure handling method for an information processing system according to claim 11, wherein the failure monitoring data is internal time information individually managed by the computer device.

When the external storage device detects that a failure has occurred in any of the computer devices, the external storage device has a step of notifying the failure occurrence information to a pre-registered notification destination, The failure handling method for an information processing system according to any one of claims 8 to 12.