JP2022185768A

JP2022185768A - Information processing device and recovery method

Info

Publication number: JP2022185768A
Application number: JP2021093594A
Authority: JP
Inventors: 悦男須藤; Etsuo Sudo
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2021-06-03
Filing date: 2021-06-03
Publication date: 2022-12-15

Abstract

To continue utilization of an information processing device when recovering a management unit that manages the information processing device from a failure.SOLUTION: A management unit includes an image processing part which performs image processing and manages an information processing device. A monitoring unit monitors the management unit and outputs an abnormality detection signal when an abnormality in the management unit is detected. An information processing unit performs information processing, separates the image processing part on the basis of the abnormality detection signal when the abnormality detection signal is output from the monitoring unit and reboots the management unit.SELECTED DRAWING: Figure 2

Description

本発明は、復旧技術に関する。 The present invention relates to recovery technology.

サーバシステムの運用中に、ファームウェア又はハードウェアの問題に起因して、サーバ内のＢＭＣ（Baseboard Management Controller）がハングアップする故障が発生することがある。ＢＭＣの故障の発生頻度は、例えば、１年に数件程度である。 2. Description of the Related Art During operation of a server system, a failure may occur in which a BMC (Baseboard Management Controller) in the server hangs up due to a firmware or hardware problem. The frequency of BMC failures is, for example, about several per year.

ＢＭＣは、リモート端末からサーバを管理するための制御用ＩＣ（Integrated Circuit）チップである。リモート端末は、ＢＭＣを介して、サーバ内のハードウェアの情報を取得したり、ハードウェアに対するリモート操作を行ったりすることができる。ＢＭＣがハングアップしてもサーバの運用に支障はないが、ハードウェアの情報の取得及びリモート操作に支障が生じるため、ＢＭＣの復旧作業が行われる。 BMC is a control IC (Integrated Circuit) chip for managing the server from a remote terminal. A remote terminal can acquire information about the hardware in the server and remotely operate the hardware via the BMC. Even if the BMC hangs up, the operation of the server will not be hindered, but it will hinder acquisition of hardware information and remote operation.

ＢＭＣの復旧作業に関連して、ＢＭＣ、又は、サーバシステムの別のコンポーネントをリセットするためのシステムが知られている（例えば、特許文献１を参照）。ＢＭＣ等のコントローラを二重化することなく、ＢＭＣ等のコントローラにストールが発生した場合でも、コンピュータシステムを継続して動作させるストール監視装置も知られている（例えば、特許文献２を参照）。 A system for resetting the BMC or another component of the server system is known in connection with the BMC recovery operation (see, for example, Patent Document 1). There is also known a stall monitoring device that allows a computer system to continue operating even when a stall occurs in a controller such as BMC without duplicating a controller such as BMC (see, for example, Patent Document 2).

特開２０１９－１２５３３９号公報JP 2019-125339 A 特開２０１１－２０４０４６号公報JP 2011-204046 A

ＢＭＣの復旧作業では、ＢＭＣのリセットが行われる。ＢＭＣには、オペレーティングシステム（Operating System，ＯＳ）が使用するＶＧＡ（Video Graphics Array）チップも内蔵されているため、ＢＭＣのリセットはＯＳにも影響して、サーバシステムの動作に異常が発生する。このため、ＯＳを含むサーバシステム全体を停止させてから、ＢＭＣのリセットが行われる。 In the BMC recovery work, the BMC is reset. Since the BMC also incorporates a VGA (Video Graphics Array) chip used by the operating system (OS), resetting the BMC affects the OS as well, causing anomalies in server system operations. Therefore, the BMC is reset after stopping the entire server system including the OS.

このように、ＢＭＣの復旧作業は大掛かりな作業である。顧客のサーバシステムにおいて、ＢＭＣがハングアップした場合、作業者がサーバシステムの設置場所へ出向いて、ＢＭＣの復旧作業を実施する。このため、ＢＭＣがハングアップする度に作業者が客先へ出向く手間が発生し、かつ、ＢＭＣが復旧するまでに長い時間がかかる。復旧作業によりサーバシステムの運用が長時間停止すると、顧客の業務に与える影響が大きくなる。 Thus, the BMC restoration work is a large-scale work. In the customer's server system, when the BMC hangs up, a worker goes to the installation location of the server system and performs the BMC restoration work. Therefore, every time the BMC hangs up, the worker has to go to the customer's site, and it takes a long time to restore the BMC. If the operation of the server system is stopped for a long time due to recovery work, the impact on the customer's business will be great.

なお、かかる問題は、ＢＭＣの故障に限らず、情報処理装置（コンピュータ）を管理する様々な管理部の故障が発生したときに生ずるものである。 This problem is not limited to failure of the BMC, but occurs when failure occurs in various management units that manage the information processing apparatus (computer).

１つの側面において、本発明は、情報処理装置を管理する管理部を故障から復旧させる際に、情報処理装置の運用を継続することを目的とする。 An object of the present invention is to continue the operation of an information processing device when restoring a management unit that manages the information processing device from a failure.

１つの案では、情報処理装置は、管理部、監視部、及び情報処理部を含む。管理部は、画像処理を行う画像処理部を含み、情報処理装置を管理する。監視部は、管理部を監視し、管理部の異常を検出した場合、異常検出信号を出力する。情報処理部は、情報処理を行い、監視部から異常検出信号が出力された場合、異常検出信号に基づいて画像処理部を切り離し、管理部を再起動する。 In one scheme, the information processing device includes a management unit, a monitoring unit, and an information processing unit. The management unit includes an image processing unit that performs image processing, and manages the information processing device. The monitoring unit monitors the management unit, and outputs an abnormality detection signal when detecting an abnormality in the management unit. The information processing section performs information processing, and when an abnormality detection signal is output from the monitoring section, disconnects the image processing section based on the abnormality detection signal, and restarts the management section.

１つの側面によれば、情報処理装置を管理する管理部を故障から復旧させる際に、情報処理装置の運用を継続することができる。 According to one aspect, the operation of the information processing device can be continued when the management unit that manages the information processing device recovers from the failure.

比較例のサーバシステムにおけるサーバのハードウェア構成図である。It is a hardware block diagram of the server in the server system of a comparative example. 実施形態の情報処理装置の機能的構成図である。1 is a functional configuration diagram of an information processing apparatus according to an embodiment; FIG. 復旧処理のフローチャートである。4 is a flowchart of recovery processing; 実施形態のサーバシステムのハードウェア構成図である。1 is a hardware configuration diagram of a server system according to an embodiment; FIG. 実施形態のサーバシステムにおけるサーバのハードウェア構成図である。3 is a hardware configuration diagram of a server in the server system of the embodiment; FIG. ＢＭＣの復旧処理を示す図である。It is a figure which shows the restoration processing of BMC. 第１の復旧処理のフローチャート（その１）である。10 is a flowchart (part 1) of a first recovery process; 第１の復旧処理のフローチャート（その２）である。FIG. 11 is a flowchart (part 2) of the first recovery process; FIG. 第１の復旧処理のフローチャート（その３）である。FIG. 11 is a flowchart (part 3) of the first recovery process; FIG. 第１の復旧処理のフローチャート（その４）である。FIG. 11 is a flowchart (part 4) of the first recovery process; FIG. 第２の復旧処理のフローチャート（その１）である。FIG. 11 is a flowchart (part 1) of second recovery processing; FIG. 第２の復旧処理のフローチャート（その２）である。FIG. 11 is a flowchart (part 2) of a second recovery process; FIG. 第３の復旧処理のフローチャート（その１）である。FIG. 11 is a flowchart (part 1) of a third recovery process; FIG. 第３の復旧処理のフローチャート（その２）である。FIG. 11 is a flowchart (part 2) of the third recovery process; FIG. 第３の復旧処理のフローチャート（その３）である。FIG. 11 is a flowchart (part 3) of a third recovery process; FIG. 第３の復旧処理のフローチャート（その４）である。FIG. 11 is a flowchart (part 4) of the third recovery process; FIG. 第４の復旧処理のフローチャート（その１）である。FIG. 11 is a flowchart (part 1) of a fourth recovery process; FIG. 第４の復旧処理のフローチャート（その２）である。FIG. 11 is a flowchart (part 2) of a fourth recovery process; FIG. 第４の復旧処理のフローチャート（その３）である。FIG. 11 is a flowchart (part 3) of a fourth recovery process; FIG. 第４の復旧処理のフローチャート（その４）である。FIG. 11 is a flowchart (part 4) of a fourth recovery process; FIG.

以下、図面を参照しながら、実施形態を詳細に説明する。 Hereinafter, embodiments will be described in detail with reference to the drawings.

図１は、比較例のサーバシステムにおけるサーバのハードウェア構成例を示している。図１のサーバ１０１は、ＢＭＣ１１１、チップセット１１２、及びＣＰＵ（Central Processing Unit）１１３を含み、ＢＭＣ１１１は、ＶＧＡ１２１を含む。これらの構成要素は、ハードウェアである。ＶＧＡ１２１とチップセット１１２は、ＰＣＩ－Ｅ（Peripheral Component Interconnect-Express）バス１２２により接続されている。 FIG. 1 shows an example hardware configuration of a server in a server system of a comparative example. The server 101 of FIG. 1 includes a BMC 111 , a chipset 112 and a CPU (Central Processing Unit) 113 , and the BMC 111 includes a VGA 121 . These components are hardware. The VGA 121 and chipset 112 are connected by a PCI-E (Peripheral Component Interconnect-Express) bus 122 .

サーバシステムは、サーバ１０１に加えて、ＲＡＩＤ（Redundant Arrays of Inexpensive Disks）装置、テープ装置等の他の装置を含んでいてもよい。 In addition to the server 101, the server system may include other devices such as RAID (Redundant Arrays of Inexpensive Disks) devices and tape devices.

ＣＰＵ１１３は、基本入出力システム（Basic Input/Output System，ＢＩＯＳ）１３１及びＯＳ１３２を実行する。ＯＳ１３２は、ＶＧＡドライバ１４１を含む。ＯＳ１３２は、ＶＧＡドライバ１４１を用いてＶＧＡ１２１にアクセスすることで、ＶＧＡ１２１にビデオリダイレクション等の画像処理を行わせる。 The CPU 113 executes a basic input/output system (BIOS) 131 and an OS 132 . OS 132 includes VGA driver 141 . The OS 132 accesses the VGA 121 using the VGA driver 141 to cause the VGA 121 to perform image processing such as video redirection.

ＢＭＣ１１１の復旧作業において、ＢＭＣ１１１がリセットされると、ＶＧＡ１２１もリセットされる。ＰＣＩ－Ｅバス１２２及びチップセット１１２を介してＶＧＡ１２１に接続されているＣＰＵ１１３は、ＶＧＡ１２１がリセットされると、ＶＧＡ１２１に対するアクセスのタイムアウトを検出し、ＯＳ１３２の実行を停止する。このため、リモート端末の表示装置に、ＯＳ１３２の画面が表示されなくなる。 In the recovery work of the BMC 111, when the BMC 111 is reset, the VGA 121 is also reset. When the VGA 121 is reset, the CPU 113 connected to the VGA 121 via the PCI-E bus 122 and the chipset 112 detects timeout of access to the VGA 121 and stops the execution of the OS 132 . Therefore, the screen of the OS 132 is no longer displayed on the display device of the remote terminal.

PCI Hotplugの機能を用いてＯＳ１３２からＶＧＡ１２１を切り離すことは可能であるが、ＯＳ１３２は、ＢＭＣ１１１がハングアップしたことを検知しないため、ＶＧＡ１２１を切り離すことはない。また、ＢＭＣ１１１は、ＯＳ１３２とは独立にサーバ１０１を監視する役割を有するため、ＯＳ１３２からＢＭＣ１１１を切り離す仕組みが存在しない。 Although it is possible to disconnect the VGA 121 from the OS 132 using the PCI Hotplug function, the OS 132 does not detect that the BMC 111 has hung up, so the VGA 121 is not disconnected. Also, since the BMC 111 has a role of monitoring the server 101 independently of the OS 132, there is no mechanism for separating the BMC 111 from the OS 132. FIG.

このため、図１のサーバ１０１では、以下のような手順でＢＭＣ１１１の復旧作業が実施される。 Therefore, in the server 101 of FIG. 1, the recovery work of the BMC 111 is performed in the following procedure.

（Ｐ１）リモート端末のＷｅｂＵＩ（User Interface）に不具合が発生し、作業者が異常の発生に気付く。 (P1) A problem occurs in the Web UI (User Interface) of the remote terminal, and the operator notices the occurrence of an abnormality.

（Ｐ２）作業者は、サーバシステムの設置場所（サーバルーム等）へ出向いて、サーバ１０１のＬＥＤ（Light Emitting Diode）を確認する。そして、ＬＥＤが点灯から消灯に変化していることから、ＢＭＣ１１１の異常であることが判明する。 (P2) The worker goes to the installation place (server room, etc.) of the server system and checks the LED (Light Emitting Diode) of the server 101 . Since the LED has changed from lit to extinguished, it is found that the BMC 111 is abnormal.

（Ｐ３）作業者は、サーバ１０１で実行中の処理をすべて終了させてから、ＯＳ１３２を停止する。 (P3) The operator stops the OS 132 after finishing all processes being executed by the server 101 .

（Ｐ４）作業者は、サーバ１０１以外の各装置の電源をオフにすることで、サーバシステムを停止する。 (P4) The operator shuts down the server system by turning off the power of each device other than the server 101 .

（Ｐ５）作業者は、ＢＭＣ１１１のリセットを行う。 (P5) The operator resets the BMC 111 .

（Ｐ６）作業者は、監視画面上でリセットの結果を確認し、結果が正常であれば、（Ｐ７）の作業に進み、結果が異常であれば、保守作業を依頼する。 (P6) The operator confirms the result of resetting on the monitoring screen. If the result is normal, proceed to the work of (P7), and if the result is abnormal, request maintenance work.

（Ｐ７）作業者は、各装置の電源をオンにすることで、サーバシステムを起動する。 (P7) The operator activates the server system by turning on the power of each device.

このように、ＯＳ１３２を含むサーバシステム全体を停止させてから、ＢＭＣ１１１のリセットが行われるため、作業者がサーバルーム等へ出向く手間が発生し、かつ、ＢＭＣ１１１が復旧するまでに長い時間がかかる。 Since the BMC 111 is reset after stopping the entire server system including the OS 132 in this manner, the operator has to go to a server room or the like, and it takes a long time until the BMC 111 is restored.

図２は、実施形態の情報処理装置の機能的構成例を示している。図２の情報処理装置２０１は、管理部２１１、監視部２１２、及び情報処理部２１３を含む。管理部２１１は、画像処理を行う画像処理部２２１を含み、情報処理装置２０１を管理する。 FIG. 2 shows a functional configuration example of the information processing apparatus according to the embodiment. The information processing device 201 in FIG. 2 includes a management unit 211 , a monitoring unit 212 and an information processing unit 213 . The management unit 211 includes an image processing unit 221 that performs image processing, and manages the information processing device 201 .

図３は、図２の情報処理装置２０１が行う復旧処理の例を示すフローチャートである。情報処理部２１３は、情報処理を行う（ステップ３０１）。そして、監視部２１２は、管理部２１１を監視し、管理部２１１の異常を検出した場合、異常検出信号を出力する。 FIG. 3 is a flowchart showing an example of recovery processing performed by the information processing apparatus 201 of FIG. The information processing section 213 performs information processing (step 301). The monitoring unit 212 monitors the management unit 211 and outputs an abnormality detection signal when detecting an abnormality in the management unit 211 .

監視部２１２から異常検出信号が出力された場合、情報処理部２１３は、異常検出信号に基づいて画像処理部２２１を切り離し（ステップ３０２）、管理部２１１を再起動する（ステップ３０３）。 When the abnormality detection signal is output from the monitoring section 212, the information processing section 213 disconnects the image processing section 221 based on the abnormality detection signal (step 302), and restarts the management section 211 (step 303).

図２の情報処理装置２０１によれば、情報処理装置２０１を管理する管理部２１１を故障から復旧させる際に、情報処理装置２０１の運用を継続することができる。 According to the information processing device 201 of FIG. 2, the operation of the information processing device 201 can be continued when the management unit 211 that manages the information processing device 201 is restored from a failure.

図４は、実施形態のサーバシステムのハードウェア構成例を示している。図４のサーバシステムは、サーバ４０１及びＲＡＩＤ装置４０２を含む。サーバ４０１は、図２の情報処理装置２０１に対応する。サーバ４０１とＲＡＩＤ装置４０２は、通信ネットワークにより接続されている。 FIG. 4 shows a hardware configuration example of the server system of the embodiment. The server system of FIG. 4 includes a server 401 and RAID devices 402 . A server 401 corresponds to the information processing apparatus 201 in FIG. The server 401 and RAID device 402 are connected by a communication network.

ＲＡＩＤ装置４０２は、サーバ４０１が使用するデータを記憶する。ＲＡＩＤ装置４０２は、補助記憶装置の一例である。サーバ４０１は、ＲＡＩＤ装置４０２が記憶するデータを用いて情報処理を行い、処理結果をＲＡＩＤ装置４０２に格納する。 RAID device 402 stores data used by server 401 . The RAID device 402 is an example of an auxiliary storage device. The server 401 performs information processing using data stored in the RAID device 402 and stores the processing result in the RAID device 402 .

図５は、図４のサーバ４０１のハードウェア構成例を示している。図５のサーバ４０１は、ＢＭＣ５１１、チップセット５１２、ＣＰＵ５１３、監視回路５１４、メモリ５１５、及び通信回路５１６を含む。ＢＭＣ５１１は、ＶＧＡ５２１を含む。これらの構成要素は、ハードウェアである。ＣＰＵ５１３は、プロセッサと呼ばれることもある。 FIG. 5 shows a hardware configuration example of the server 401 in FIG. The server 401 of FIG. 5 includes a BMC 511 , a chipset 512 , a CPU 513 , a monitoring circuit 514 , a memory 515 and a communication circuit 516 . BMC511 includes VGA521. These components are hardware. The CPU 513 may also be called a processor.

ＢＭＣ５１１、ＣＰＵ５１３、監視回路５１４、及びＶＧＡ５２１は、図２の管理部２１１、情報処理部２１３、監視部２１２、及び画像処理部２２１にそれぞれ対応する。監視回路５１４は、例えば、ＣＰＬＤ（Complex Programmable Logic Device）である。 The BMC 511, CPU 513, monitoring circuit 514, and VGA 521 correspond to the management unit 211, information processing unit 213, monitoring unit 212, and image processing unit 221 in FIG. 2, respectively. The monitoring circuit 514 is, for example, a CPLD (Complex Programmable Logic Device).

ＣＰＵ５１３、メモリ５１５、及び通信回路５１６は、チップセット５１２に接続されており、監視回路５１４は、ＢＭＣ５１１及びチップセット５１２に接続されている。ＶＧＡ５２１とチップセット５１２は、ＰＣＩ－Ｅバス５２２により接続されている。 The CPU 513 , memory 515 and communication circuit 516 are connected to the chipset 512 , and the monitoring circuit 514 is connected to the BMC 511 and chipset 512 . VGA 521 and chipset 512 are connected by PCI-E bus 522 .

メモリ５１５は、例えば、ＲＡＭ（Random Access Memory）等の半導体メモリであり、情報処理に用いられるプログラム及びデータを記憶する。通信回路５１６は、通信ネットワークに接続され、ＲＡＩＤ装置４０２と通信する。 The memory 515 is, for example, a semiconductor memory such as a RAM (Random Access Memory), and stores programs and data used for information processing. Communication circuitry 516 is connected to a communication network and communicates with RAID device 402 .

ＣＰＵ５１３は、通信回路５１６を介して、ＲＡＩＤ装置４０２からデータを取得し、取得されたデータをメモリ５１５に格納する。そして、ＣＰＵ５１３は、メモリ５１５に格納されたデータを用いてプログラムを実行することにより情報処理を行い、通信回路５１６を介して、処理結果をＲＡＩＤ装置４０２へ送信する。 The CPU 513 acquires data from the RAID device 402 via the communication circuit 516 and stores the acquired data in the memory 515 . The CPU 513 then performs information processing by executing a program using the data stored in the memory 515 and transmits the processing result to the RAID device 402 via the communication circuit 516 .

ＢＭＣ５１１は、サーバ４０１を管理する。不図示のリモート端末は、ＢＭＣ５１１を介して、サーバ４０１内のハードウェアの情報を取得したり、ハードウェアに対するリモート操作を行ったりすることができる。ＶＧＡ５２１は、ビデオリダイレクション等の画像処理を行う。 BMC 511 manages server 401 . A remote terminal (not shown) can acquire information about the hardware in the server 401 and remotely operate the hardware via the BMC 511 . The VGA 521 performs image processing such as video redirection.

ＣＰＵ５１３は、情報処理を行う際、ＢＩＯＳ５３１及びＯＳ５３２を実行する。ＢＩＯＳ５３１は、第１プログラムの一例であり、ＯＳ５３２は、第２プログラムの一例である。 The CPU 513 executes the BIOS 531 and the OS 532 when performing information processing. The BIOS 531 is an example of the first program, and the OS 532 is an example of the second program.

ＢＩＯＳ５３１は、割り込み処理ルーチン５４１を含む。割り込み処理ルーチン５４１は、ＯＳ５３２に対する割り込み要求を生成する処理と、ＣＰＵ５１３とＶＧＡ５２１との間の通信状態をチェックする処理とを含む。割り込み処理ルーチン５４１は、割り込みサービスルーチンと呼ばれることもある。 BIOS 531 includes an interrupt handling routine 541 . The interrupt processing routine 541 includes processing for generating an interrupt request to the OS 532 and processing for checking the state of communication between the CPU 513 and the VGA 521 . The interrupt handling routine 541 is sometimes called an interrupt service routine.

ＯＳ５３２は、ＶＧＡドライバ５４２を含む。ＯＳ５３２は、ＶＧＡドライバ５４２を用いてＶＧＡ５２１にアクセスすることで、ＶＧＡ５２１に画像処理を行わせる。また、ＯＳ５３２は、ＶＧＡドライバ５４２が有するPCI Hotplugの機能を用いて、ＯＳ５３２からＶＧＡ５２１を切り離すことができる。 OS 532 includes VGA driver 542 . The OS 532 accesses the VGA 521 using the VGA driver 542 to cause the VGA 521 to perform image processing. Also, the OS 532 can disconnect the VGA 521 from the OS 532 using the PCI Hotplug function of the VGA driver 542 .

ＢＩＯＳ５３１は、ＯＳ５３２と通信することができるため、ＢＭＣ５１１がハングアップした場合、ＢＩＯＳ５３１からＯＳ５３２へＢＭＣ５１１の異常を通知することが可能である。しかし、ＢＩＯＳ５３１は、ＢＭＣ５１１を監視しているわけではなく、ＢＭＣ５１１の状態を認識していない。そこで、ＢＭＣ５１１の異常を検出するために、監視回路５１４が設けられる。 Since the BIOS 531 can communicate with the OS 532, when the BMC 511 hangs up, the BIOS 531 can notify the OS 532 of the BMC 511 abnormality. However, the BIOS 531 does not monitor the BMC 511 and does not recognize the state of the BMC 511 . Therefore, a monitoring circuit 514 is provided to detect an abnormality in the BMC 511 .

監視回路５１４は、ウォッチドッグ機能によりＢＭＣ５１１を監視する。監視回路５１４は、問合せ信号をＢＭＣ５１１へ送信し、所定期間内にＢＭＣ５１１から応答信号を受信したか否かをチェックする。所定期間内に応答信号を受信しない場合、監視回路５１４は、ＢＭＣ５１１に異常が発生したと判定する。これにより、ＢＭＣ５１１の異常を検出することができる。 A monitoring circuit 514 monitors the BMC 511 with a watchdog function. The monitoring circuit 514 transmits an inquiry signal to the BMC 511 and checks whether or not a response signal is received from the BMC 511 within a predetermined period. When the response signal is not received within the predetermined period, the monitoring circuit 514 determines that an abnormality has occurred in the BMC 511 . Thereby, an abnormality of the BMC 511 can be detected.

ＢＭＣ５１１の異常が検出された場合、監視回路５１４は、ＢＩＯＳ５３１に対する割り込み要求を、チップセット５１２を介してＣＰＵ５１３へ出力する。ＢＩＯＳ５３１に対する割り込み要求は、第１プログラムに対する第１割り込み要求の一例であり、異常検出信号に対応する。 When an abnormality of the BMC 511 is detected, the monitoring circuit 514 outputs an interrupt request to the BIOS 531 to the CPU 513 via the chipset 512 . The interrupt request to the BIOS 531 is an example of the first interrupt request to the first program and corresponds to the abnormality detection signal.

ＣＰＵ５１３は、監視回路５１４からの割り込み要求に基づいて、ＢＩＯＳ５３１の割り込み処理ルーチン５４１を実行することで、ＯＳ５３２に対する割り込み要求を生成する。ＯＳ５３２に対する割り込み要求は、第２プログラムに対する第２割り込み要求の一例である。 The CPU 513 generates an interrupt request to the OS 532 by executing the interrupt processing routine 541 of the BIOS 531 based on the interrupt request from the monitor circuit 514 . The interrupt request to OS 532 is an example of a second interrupt request to the second program.

ＣＰＵ５１３は、割り込み処理ルーチン５４１からの割り込み要求に基づき、ＯＳ５３２のＶＧＡドライバ５４２を用いて、ＰＣＩ－Ｅバス５２２からＶＧＡ５２１を切り離すことで、ＣＰＵ５１３からＶＧＡ５２１を切り離す。 The CPU 513 disconnects the VGA 521 from the PCI-E bus 522 by using the VGA driver 542 of the OS 532 based on the interrupt request from the interrupt processing routine 541 .

ＣＰＵ５１３からＶＧＡ５２１を切り離す処理は、ＣＰＵ５１３とＶＧＡ５２１との間の通信状態を、通信可能な状態から通信不可の状態に変更する処理を表す。通信可能な状態は、リンクアップに対応し、通信不可の状態は、リンクダウンに対応する。 The process of disconnecting the VGA 521 from the CPU 513 represents the process of changing the communication state between the CPU 513 and the VGA 521 from a communicable state to a communicable state. A state in which communication is possible corresponds to link up, and a state in which communication is not possible corresponds to link down.

監視回路５１４がＢＩＯＳ５３１に対する割り込み要求を出力することで、ＢＭＣ５１１の異常をＢＩＯＳ５３１に通知することができる。また、ＢＩＯＳ５３１の割り込み処理ルーチン５４１がＯＳ５３２に対する割り込み要求を生成することで、ＢＭＣ５１１の異常をＯＳ５３２に通知することができる。 By outputting an interrupt request to the BIOS 531 by the monitoring circuit 514 , it is possible to notify the BIOS 531 of an abnormality in the BMC 511 . Further, the interrupt processing routine 541 of the BIOS 531 generates an interrupt request to the OS 532, so that the OS 532 can be notified of the abnormality of the BMC 511. FIG.

監視回路５１４は、ＢＩＯＳ５３１に対する割り込み要求を出力した後、ＢＩＯＳ５３１に対する次の割り込み要求を、チップセット５１２を介してＣＰＵ５１３へ出力する。ＢＩＯＳ５３１に対する次の割り込み要求は、第１プログラムに対する第３割り込み要求の一例である。 After outputting the interrupt request to the BIOS 531 , the monitor circuit 514 outputs the next interrupt request to the BIOS 531 to the CPU 513 via the chipset 512 . The next interrupt request to BIOS 531 is an example of the third interrupt request to the first program.

ＣＰＵ５１３は、監視回路５１４からの次の割り込み要求に基づいて、割り込み処理ルーチン５４１を実行することで、ＣＰＵ５１３とＶＧＡ５２１との間の通信状態をチェックする。そして、通信状態が通信不可である場合、ＣＰＵ５１３は、ＢＭＣ５１１をリセットすることで、ＢＭＣ５１１を再起動する。 The CPU 513 checks the communication state between the CPU 513 and the VGA 521 by executing the interrupt processing routine 541 based on the next interrupt request from the monitoring circuit 514 . Then, when the communication state is communication disabled, the CPU 513 restarts the BMC 511 by resetting the BMC 511 .

監視回路５１４がＢＩＯＳ５３１に対する次の割り込み要求を出力することで、ＶＧＡ５２１が切り離されたか否かを割り込み処理ルーチン５４１にチェックさせることができる。そして、ＶＧＡ５２１が切り離されている場合にＢＭＣ５１１をリセットすることで、ＯＳ５３２に影響を与えることなく、ＢＭＣ５１１を再起動することが可能になる。 When the monitor circuit 514 outputs the next interrupt request to the BIOS 531, the interrupt processing routine 541 can check whether the VGA 521 has been disconnected. By resetting the BMC 511 when the VGA 521 is disconnected, the BMC 511 can be restarted without affecting the OS 532 .

図６は、図５のサーバ４０１におけるＢＭＣ５１１の復旧処理の例を示している。復旧処理は、以下のような手順で行われる。 FIG. 6 shows an example of recovery processing of the BMC 511 in the server 401 of FIG. The recovery process is performed in the following procedure.

（Ｐ１１）監視回路５１４は、定期的にウォッチドッグ機能によりＢＭＣ５１１を監視する。 (P11) The monitoring circuit 514 periodically monitors the BMC 511 by the watchdog function.

（Ｐ１２）ＢＭＣ５１１に異常が発生し、ＢＭＣ５１１がハングアップする。 (P12) An abnormality occurs in the BMC 511 and the BMC 511 hangs up.

（Ｐ１３）所定期間内にＢＭＣ５１１から応答信号を受信しないため、監視回路５１４は、ＢＭＣ５１１に異常が発生したと判定し、ＢＩＯＳ５３１に対する最初の割り込み要求をＣＰＵ５１３へ出力する。その後、監視回路５１４は、定期的に、ＢＩＯＳ５３１に対する割り込み要求をＣＰＵ５１３へ出力する。 (P13) Since no response signal is received from the BMC 511 within the predetermined period, the monitoring circuit 514 determines that an abnormality has occurred in the BMC 511 and outputs the first interrupt request for the BIOS 531 to the CPU 513 . After that, the monitor circuit 514 periodically outputs an interrupt request for the BIOS 531 to the CPU 513 .

（Ｐ１４）ＣＰＵ５１３は、監視回路５１４からの最初の割り込み要求に基づいて割り込み処理ルーチン５４１を実行することで、ＯＳ５３２に対する割り込み要求を生成する。これにより、ＣＰＵ５１３は、ＶＧＡ５２１の切り離しをＯＳ５３２に要求する。 (P14) The CPU 513 generates an interrupt request to the OS 532 by executing the interrupt processing routine 541 based on the first interrupt request from the monitoring circuit 514 . As a result, the CPU 513 requests the OS 532 to disconnect the VGA 521 .

（Ｐ１５）ＣＰＵ５１３は、割り込み処理ルーチン５４１からの割り込み要求に基づき、ＯＳ５３２のＶＧＡドライバ５４２を用いて、ＣＰＵ５１３からＶＧＡ５２１を切り離す。 (P15) The CPU 513 disconnects the VGA 521 from the CPU 513 using the VGA driver 542 of the OS 532 based on the interrupt request from the interrupt processing routine 541 .

（Ｐ１６）ＣＰＵ５１３は、監視回路５１４から定期的に出力される割り込み要求に基づいて、割り込み処理ルーチン５４１を実行することで、ＣＰＵ５１３とＶＧＡ５２１との間の通信がリンクダウンしているか否かをチェックする。 (P16) The CPU 513 checks whether or not the communication between the CPU 513 and the VGA 521 is linked down by executing the interrupt processing routine 541 based on the interrupt request periodically output from the monitoring circuit 514. do.

（Ｐ１７）ＣＰＵ５１３とＶＧＡ５２１との間の通信がリンクダウンしている場合、ＣＰＵ５１３は、ＶＧＡ５２１が切り離されたと判定し、ＢＭＣ５１１をリセットすることで、ＢＭＣ５１１を再起動する。そして、ＢＭＣ５１１は、監視回路５１４をリセットすることで、割り込み要求の出力を停止させる。 (P17) When the communication between the CPU 513 and the VGA 521 is linked down, the CPU 513 determines that the VGA 521 is disconnected and resets the BMC 511 to restart the BMC 511 . The BMC 511 then resets the monitoring circuit 514 to stop outputting the interrupt request.

（Ｐ１８）ＢＭＣ５１１は、再起動された後、ＢＩＯＳ５３１に対する割り込み要求をＣＰＵ５１３へ出力する。これにより、ＢＭＣ５１１は、ＶＧＡ５２１の接続をＢＩＯＳ５３１に要求する。 (P18) After being restarted, the BMC 511 outputs an interrupt request for the BIOS 531 to the CPU 513 . As a result, the BMC 511 requests the BIOS 531 to connect the VGA 521 .

（Ｐ１９）ＣＰＵ５１３は、ＢＭＣ５１１からの割り込み要求に基づいて、割り込み処理ルーチン５４１を実行することで、ＯＳ５３２に対する割り込み要求を生成する。これにより、ＣＰＵ５１３は、ＶＧＡ５２１の接続をＯＳ５３２に要求する。 (P19) The CPU 513 generates an interrupt request to the OS 532 by executing the interrupt processing routine 541 based on the interrupt request from the BMC 511 . As a result, the CPU 513 requests the OS 532 to connect the VGA 521 .

（Ｐ２０）ＣＰＵ５１３は、割り込み処理ルーチン５４１からの割り込み要求に基づき、ＯＳ５３２のＶＧＡドライバ５４２を用いて、ＶＧＡ５２１をＰＣＩ－Ｅバス５２２に組み込むことで、ＶＧＡ５２１をＣＰＵ５１３に接続する。 (P20) Based on the interrupt request from the interrupt processing routine 541, the CPU 513 connects the VGA 521 to the CPU 513 by incorporating the VGA 521 into the PCI-E bus 522 using the VGA driver 542 of the OS 532.

（Ｐ１６）において、ＣＰＵ５１３は、ＣＰＵ５１３とＶＧＡ５２１との間の通信がリンクダウンしているか否かを、以下のような手順でチェックすることができる。 At (P16), the CPU 513 can check whether or not the communication link between the CPU 513 and the VGA 521 is down by the following procedure.

（Ｐ２１）ＣＰＵ５１３は、lspci、PCI device info等のコマンド、又はライブラリ関数を用いて、ＰＣＩデバイスの一覧を取得する。ＰＣＩデバイスは、ＰＣＩ－Ｅバス５２２に組み込まれているデバイスを表す。 (P21) The CPU 513 acquires a list of PCI devices using commands such as lspci and PCI device info, or library functions. A PCI device represents a device that is incorporated into the PCI-E bus 522 .

（Ｐ２２）ＣＰＵ５１３は、ＰＣＩデバイスの一覧からＶＧＡ５２１を検索する。 (P22) The CPU 513 searches for the VGA 521 from the list of PCI devices.

（Ｐ２３）ＰＣＩデバイスの一覧にＶＧＡ５２１が含まれている場合、ＣＰＵ５１３は、通信がリンクダウンしていないと判定し、（Ｐ２１）以降の処理を繰り返す。 (P23) When the VGA 521 is included in the list of PCI devices, the CPU 513 determines that the communication is not linked down, and repeats the processing after (P21).

（Ｐ２４）ＰＣＩデバイスの一覧にＶＧＡ５２１が含まれていない場合、ＣＰＵ５１３は、通信がリンクダウンしていると判定する。 (P24) If the VGA 521 is not included in the list of PCI devices, the CPU 513 determines that the communication link is down.

図５のサーバ４０１によれば、監視回路５１４及び割り込み処理ルーチン５４１を設けることで、ＢＭＣ５１１の異常を検出してＯＳ５３２に通知することができ、自動的にＢＭＣ５１１を故障から復旧させることが可能になる。 According to the server 401 of FIG. 5, by providing the monitoring circuit 514 and the interrupt processing routine 541, an abnormality of the BMC 511 can be detected and notified to the OS 532, and the BMC 511 can be automatically recovered from the failure. Become.

復旧処理を行う際、作業者がサーバルーム等へ出向いて復旧作業を行う必要がないため、復旧作業開始までの待ち時間が発生しない。また、サーバシステムを停止する必要がなく、ＢＭＣ５１１が活性でリセットされるため、復旧処理の間もサーバシステムの運用を継続することができる。 Since it is not necessary for the worker to go to a server room or the like to perform the restoration work when performing the restoration processing, there is no waiting time until the start of the restoration work. In addition, since the server system does not need to be stopped and the BMC 511 is reset when active, the server system can continue to operate even during recovery processing.

新たなハードウェアとしては監視回路５１４を追加するだけで済むため、既存のサーバシステムにも容易に図６の復旧処理を適用することができる。また、ＢＩＯＳ５３１及びＯＳ５３２の割り込み処理を利用することで、容易に復旧処理を実装することができる。 Since it is enough to add the monitoring circuit 514 as new hardware, the restoration process of FIG. 6 can be easily applied to the existing server system. In addition, recovery processing can be easily implemented by using interrupt processing of the BIOS 531 and OS 532 .

次に、図７Ａから図１０Ｄまでを参照しながら、図５のサーバ４０１におけるＢＭＣ５１１の復旧処理について、より詳細に説明する。以下では、ＣＰＵ５１３がＯＳ５３２を実行することで行われる処理を、ＯＳ５３２が行う処理として記述することがある。また、ＣＰＵ５１３がＢＩＯＳ５３１の割り込み処理ルーチン５４１を実行することで行われる処理を、割り込み処理ルーチン５４１が行う処理として記述することがある。 Next, restoration processing of the BMC 511 in the server 401 of FIG. 5 will be described in more detail with reference to FIGS. 7A to 10D. Hereinafter, processing performed by the CPU 513 executing the OS 532 may be described as processing performed by the OS 532 . Also, the processing performed by the CPU 513 executing the interrupt processing routine 541 of the BIOS 531 may be described as processing performed by the interrupt processing routine 541 .

図７Ａ～図７Ｄは、ＯＳ５３２が動作している場合の第１の復旧処理の例を示すフローチャートである。まず、ユーザは、サーバ４０１の電源をオンにする（ステップ７０１）。これにより、ＢＭＣ５１１が起動し（ステップ７０２）、サーバ４０１は、ＢＭＣ５１１の起動に失敗したか否かをチェックする（ステップ７０３）。 7A to 7D are flowcharts showing an example of the first restoration process when the OS 532 is running. First, the user turns on the server 401 (step 701). This activates the BMC 511 (step 702), and the server 401 checks whether or not activation of the BMC 511 has failed (step 703).

ＢＭＣ５１１の起動に失敗した場合（ステップ７０３，ＹＥＳ）、サーバ４０１は、アラームを出力して（ステップ７２４）、処理を終了する。一方、ＢＭＣ５１１の起動に成功した場合（ステップ７０３，ＮＯ）、ＢＭＣ５１１は、サーバ４０１の管理を開始する。そして、ＢＭＣ５１１は、監視回路５１４を起動し（ステップ７０４）、監視回路５１４の起動に失敗したか否かをチェックする（ステップ７０５）。 If the activation of the BMC 511 fails (step 703, YES), the server 401 outputs an alarm (step 724) and terminates the process. On the other hand, if the BMC 511 is successfully activated (step 703 , NO), the BMC 511 starts managing the server 401 . The BMC 511 then activates the monitoring circuit 514 (step 704) and checks whether activation of the monitoring circuit 514 has failed (step 705).

監視回路５１４の起動に失敗した場合（ステップ７０５，ＹＥＳ）、サーバ４０１は、アラームを出力して（ステップ７２４）、処理を終了する。一方、監視回路５１４の起動に成功した場合（ステップ７０５，ＮＯ）、監視回路５１４は、ウォッチドッグ機能を開始し（ステップ７０６）、ウォッチドッグ機能の開始に失敗したか否かをチェックする（ステップ７０７）。 If the activation of the monitoring circuit 514 fails (step 705, YES), the server 401 outputs an alarm (step 724) and terminates the process. On the other hand, if the monitor circuit 514 is successfully activated (step 705, NO), the monitor circuit 514 starts the watchdog function (step 706) and checks whether the start of the watchdog function has failed (step 706). 707).

ウォッチドッグ機能の開始に失敗した場合（ステップ７０７，ＹＥＳ）、サーバ４０１は、アラームを出力して（ステップ７２４）、処理を終了する。一方、ウォッチドッグ機能の開始に成功した場合（ステップ７０７，ＮＯ）、ＣＰＵ５１３は、ＯＳ５３２を起動する（ステップ７０８）。これにより、ＯＳ５３２が起動し（ステップ７３１）、ＣＰＵ５１３は、ＯＳ５３２の起動に失敗したか否かをチェックする（ステップ７３２）。 If the watchdog function fails to start (step 707, YES), the server 401 outputs an alarm (step 724) and terminates the process. On the other hand, if the watchdog function has successfully started (step 707, NO), the CPU 513 activates the OS 532 (step 708). As a result, the OS 532 is activated (step 731), and the CPU 513 checks whether or not activation of the OS 532 has failed (step 732).

ＯＳ５３２の起動に失敗した場合（ステップ７３２，ＹＥＳ）、ＣＰＵ５１３は、アラームを出力して（ステップ７３６）、処理を終了する。一方、ＯＳ５３２の起動に成功した場合（ステップ７３２，ＮＯ）、ＣＰＵ５１３は、ＯＳ５３２を実行する。 If the OS 532 fails to boot (step 732, YES), the CPU 513 outputs an alarm (step 736) and terminates the process. On the other hand, if the OS 532 has successfully started (step 732 , NO), the CPU 513 executes the OS 532 .

次に、監視回路５１４は、問合せ信号をＢＭＣ５１１へ送信し（ステップ７０９）、所定期間内にＢＭＣ５１１から応答信号を受信したか否かをチェックする（ステップ７１０）。所定期間内に応答信号を受信した場合（ステップ７１０，ＹＥＳ）、監視回路５１４は、ステップ７０９以降の処理を繰り返す。 Next, monitor circuit 514 transmits an inquiry signal to BMC 511 (step 709), and checks whether or not a response signal is received from BMC 511 within a predetermined period (step 710). If the response signal is received within the predetermined period (step 710, YES), the monitoring circuit 514 repeats the processing from step 709 onwards.

一方、所定期間内に応答信号を受信しない場合（ステップ７１０，ＮＯ）、監視回路５１４は、ＢＩＯＳ５３１に対する割り込み要求をＣＰＵ５１３へ出力する（ステップ７１１）。ＢＩＯＳ５３１に対する割り込み要求により起動された割り込み処理ルーチン５４１は、ＯＳ５３２に対する割り込み要求を出力する（ステップ７１２）。 On the other hand, if the response signal is not received within the predetermined period (step 710, NO), the monitor circuit 514 outputs an interrupt request for the BIOS 531 to the CPU 513 (step 711). The interrupt processing routine 541 activated by the interrupt request to the BIOS 531 outputs an interrupt request to the OS 532 (step 712).

ＯＳ５３２は、ＯＳ５３２に対する割り込み要求を受け付け（ステップ７３３）、ＶＧＡドライバ５４２を用いて、ＰＣＩ－Ｅバス５２２からＶＧＡ５２１を切り離す（ステップ７３４）。 The OS 532 accepts the interrupt request to the OS 532 (step 733), and disconnects the VGA 521 from the PCI-E bus 522 using the VGA driver 542 (step 734).

次に、監視回路５１４から定期的に出力される割り込み要求により、割り込み処理ルーチン５４１が起動される。起動された割り込み処理ルーチン５４１は、タイマを起動して、一定期間の間、ＣＰＵ５１３とＶＧＡ５２１との間の通信がリンクダウンしているか否かの判定を繰り返す。 Next, an interrupt processing routine 541 is activated by an interrupt request periodically output from the monitoring circuit 514 . The activated interrupt processing routine 541 activates a timer and repeatedly determines whether or not the communication between the CPU 513 and the VGA 521 is linked down for a certain period of time.

リンクダウンの判定において、割り込み処理ルーチン５４１は、コマンド又はライブラリ関数を用いてＰＣＩデバイスの一覧を取得し（ステップ７１３）、タイムアウトしたか否かをチェックする（ステップ７１４）。 In determining link down, the interrupt processing routine 541 obtains a list of PCI devices using a command or library function (step 713), and checks whether or not timeout has occurred (step 714).

タイムアウトしていない場合（ステップ７１４，ＮＯ）、割り込み処理ルーチン５４１は、ＰＣＩデバイスの一覧からＶＧＡ５２１を検索する（ステップ７１５）。ＰＣＩデバイスの一覧にＶＧＡ５２１が含まれている場合（ステップ７１５，ＹＥＳ）、割り込み処理ルーチン５４１は、ステップ７１３以降の処理を繰り返す。 If it has not timed out (step 714, NO), the interrupt processing routine 541 searches for the VGA 521 from the list of PCI devices (step 715). If the VGA 521 is included in the list of PCI devices (step 715, YES), the interrupt processing routine 541 repeats the processing from step 713 onwards.

一方、ＰＣＩデバイスの一覧にＶＧＡ５２１が含まれていない場合（ステップ７１５，ＮＯ）、割り込み処理ルーチン５４１は、ＣＰＵ５１３とＶＧＡ５２１との間の通信がリンクダウンしていると判定し、ＢＭＣ５１１をリセットする（ステップ７１６）。そして、サーバ４０１は、ＢＭＣ５１１のリセットに失敗したか否かをチェックする（ステップ７１７）。 On the other hand, if VGA 521 is not included in the list of PCI devices (step 715, NO), interrupt processing routine 541 determines that communication between CPU 513 and VGA 521 is down, and resets BMC 511 ( step 716). Then, the server 401 checks whether the reset of the BMC 511 has failed (step 717).

ＢＭＣ５１１のリセットに失敗した場合（ステップ７１７，ＹＥＳ）、サーバ４０１は、アラームを出力して（ステップ７２４）、処理を終了する。一方、ＢＭＣ５１１のリセットに成功した場合（ステップ７１７，ＮＯ）、ＢＭＣ５１１は、監視回路５１４をリセットし（ステップ７１８）、監視回路５１４のリセットに失敗したか否かをチェックする（ステップ７１９）。 If the reset of the BMC 511 fails (step 717, YES), the server 401 outputs an alarm (step 724) and terminates the process. On the other hand, if the reset of the BMC 511 succeeds (step 717, NO), the BMC 511 resets the monitoring circuit 514 (step 718) and checks whether the resetting of the monitoring circuit 514 has failed (step 719).

監視回路５１４のリセットに失敗した場合（ステップ７１９，ＹＥＳ）、サーバ４０１は、アラームを出力して（ステップ７２４）、処理を終了する。一方、監視回路５１４のリセットに成功した場合（ステップ７１９，ＮＯ）、監視回路５１４は、ウォッチドッグ機能を開始し（ステップ７２０）、ウォッチドッグ機能の開始に失敗したか否かをチェックする（ステップ７２１）。 If the reset of the monitoring circuit 514 fails (step 719, YES), the server 401 outputs an alarm (step 724) and terminates the process. On the other hand, if the monitor circuit 514 has been successfully reset (step 719, NO), the monitor circuit 514 starts the watchdog function (step 720) and checks whether the start of the watchdog function has failed (step 721).

ウォッチドッグ機能の開始に失敗した場合（ステップ７２１，ＹＥＳ）、サーバ４０１は、アラームを出力して（ステップ７２４）、処理を終了する。一方、ウォッチドッグ機能の開始に成功した場合（ステップ７２１，ＮＯ）、ＢＭＣ５１１は、ＢＩＯＳ５３１に対する割り込み要求をＣＰＵ５１３へ出力する（ステップ７２２）。ＢＩＯＳ５３１に対する割り込み要求により起動された割り込み処理ルーチン５４１は、ＯＳ５３２に対する割り込み要求を出力する（ステップ７２３）。 If the watchdog function fails to start (step 721, YES), the server 401 outputs an alarm (step 724) and terminates the process. On the other hand, if the watchdog function has successfully started (step 721, NO), the BMC 511 outputs an interrupt request for the BIOS 531 to the CPU 513 (step 722). The interrupt processing routine 541 activated by the interrupt request to the BIOS 531 outputs an interrupt request to the OS 532 (step 723).

ＯＳ５３２は、ＯＳ５３２に対する割り込み要求に基づき、ＶＧＡドライバ５４２を用いて、ＶＧＡ５２１をＰＣＩ－Ｅバス５２２に組み込む（ステップ７３５）。 OS 532 incorporates VGA 521 into PCI-E bus 522 using VGA driver 542 based on the interrupt request to OS 532 (step 735).

リンクダウンの判定においてタイムアウトした場合（ステップ７１４，ＹＥＳ）、サーバ４０１は、ステップ７１６以降の処理を行うことで、強制的にＢＭＣ５１１をリセットする。 If a timeout has occurred in determining whether the link is down (step 714, YES), the server 401 forcibly resets the BMC 511 by performing the processing from step 716 onward.

図８Ａ及び図８Ｂは、ＯＳ５３２が停止している場合の第２の復旧処理の例を示すフローチャートである。ステップ８０１～ステップ８２３の処理は、図７Ａのステップ７０１～ステップ７０７及びステップ７０９～ステップ７１５と図７Ｂのステップ７１６～ステップ７２４の処理と同様である。 8A and 8B are flowcharts showing an example of the second recovery process when the OS 532 has stopped. The processing of steps 801 to 823 is the same as the processing of steps 701 to 707 and 709 to 715 in FIG. 7A and steps 716 to 724 in FIG. 7B.

第１の復旧処理とは異なり、ウォッチドッグ機能の開始に成功した場合（ステップ８０７，ＮＯ）、ＯＳ５３２は起動されず、停止したままである。このため、ステップ８１１において、割り込み処理ルーチン５４１からＯＳ５３２に対する割り込み要求が出力された場合、割り込み要求はペンディングされる。そして、ＯＳ５３２が起動されたときに、ペンディングされている割り込み要求が消去される。 Unlike the first recovery process, when the watchdog function has successfully started (step 807, NO), the OS 532 is not activated and remains stopped. Therefore, when an interrupt request to the OS 532 is output from the interrupt processing routine 541 in step 811, the interrupt request is pending. Then, when the OS 532 is booted, the pending interrupt request is cleared.

同様に、ステップ８２２において、割り込み処理ルーチン５４１からＯＳ５３２に対する割り込み要求が出力された場合も、割り込み要求はペンディングされる。そして、ＯＳ５３２が起動されたときに、ペンディングされている割り込み要求が消去される。 Similarly, when an interrupt request to the OS 532 is output from the interrupt processing routine 541 in step 822, the interrupt request is also pending. Then, when the OS 532 is booted, the pending interrupt request is cleared.

ＯＳ５３２が停止しているため、ＶＧＡ５２１の切り離しが行われず、リンクダウンの判定においてタイムアウトが発生する（ステップ８１３，ＹＥＳ）。したがって、ステップ８１５以降の処理が行われ、ＢＭＣ５１１が強制的にリセットされる。 Since the OS 532 is stopped, the disconnection of the VGA 521 is not performed, and a timeout occurs in determining whether the link is down (step 813, YES). Therefore, the processing after step 815 is performed, and the BMC 511 is forcibly reset.

図９Ａ～図９Ｄは、ＯＳ５３２からの応答がない場合の第３の復旧処理の例を示すフローチャートである。ステップ９０１～ステップ９２４の処理は、図７Ａのステップ７０１～ステップ７１５と図７Ｂのステップ７１６～ステップ７２４の処理と同様である。また、ステップ９３１、ステップ９３２、及びステップ９３４の処理は、図７Ｃのステップ７３１及びステップ７３２と図７Ｄのステップ７３６の処理と同様である。 9A to 9D are flowcharts showing an example of the third restoration process when there is no response from the OS 532. FIG. The processing of steps 901 to 924 is similar to the processing of steps 701 to 715 in FIG. 7A and steps 716 to 724 in FIG. 7B. Also, the processing of steps 931, 932, and 934 is the same as the processing of steps 731 and 732 in FIG. 7C and step 736 in FIG. 7D.

第１の復旧処理とは異なり、ＯＳ５３２の起動に成功した場合（ステップ９３２，ＮＯ）、起動されたＯＳ５３２がハングアップし（ステップ９３３）、無応答の状態になる。このため、ステップ９１２において、割り込み処理ルーチン５４１からＯＳ５３２に対する割り込み要求が出力された場合、割り込み要求はペンディングされる。そして、ＯＳ５３２が次に起動されたときに、ペンディングされている割り込み要求が消去される。 Unlike the first recovery process, if the OS 532 is successfully booted (step 932, NO), the booted OS 532 hangs up (step 933) and becomes unresponsive. Therefore, when an interrupt request to the OS 532 is output from the interrupt processing routine 541 in step 912, the interrupt request is pending. Then, the next time OS 532 is booted, the pending interrupt request is cleared.

同様に、ステップ９２３において、割り込み処理ルーチン５４１からＯＳ５３２に対する割り込み要求が出力された場合も、割り込み要求はペンディングされる。そして、ＯＳ５３２が次に起動されたときに、ペンディングされている割り込み要求が消去される。 Similarly, when an interrupt request to the OS 532 is output from the interrupt processing routine 541 in step 923, the interrupt request is also pending. Then, the next time OS 532 is booted, the pending interrupt request is cleared.

ＯＳ５３２がハングアップしているため、ＶＧＡ５２１の切り離しが行われず、リンクダウンの判定においてタイムアウトが発生する（ステップ９１４，ＹＥＳ）。したがって、ステップ９１６以降の処理が行われ、ＢＭＣ５１１が強制的にリセットされる。 Since the OS 532 is hung up, the disconnection of the VGA 521 is not performed, and a timeout occurs in determining whether the link is down (step 914, YES). Therefore, the processing after step 916 is performed, and the BMC 511 is forcibly reset.

図１０Ａ～図１０Ｄは、ＶＧＡ５２１の切り離しが失敗した場合の第４の復旧処理の例を示すフローチャートである。ステップ１００１～ステップ１０２４の処理は、図７Ａのステップ７０１～ステップ７１５と図７Ｂのステップ７１６～ステップ７２４の処理と同様である。また、ステップ１０３１～ステップ１０３６の処理は、図７Ｃのステップ７３１～ステップ７３４と図７Ｄのステップ７３５及びステップ７３６の処理と同様である。 10A to 10D are flowcharts showing an example of the fourth recovery process when disconnection of the VGA 521 fails. The processing of steps 1001 to 1024 is the same as the processing of steps 701 to 715 in FIG. 7A and steps 716 to 724 in FIG. 7B. Also, the processing of steps 1031 to 1036 is the same as the processing of steps 731 to 734 in FIG. 7C and steps 735 and 736 in FIG. 7D.

第１の復旧処理とは異なり、ＯＳ５３２は、ＯＳ５３２に対する割り込み要求を受け付けた後（ステップ１０３３）、ＶＧＡドライバ５４２を用いて、ＰＣＩ－Ｅバス５２２からＶＧＡ５２１を切り離そうとするが、切り離しに失敗する（ステップ１０３４）。このため、リンクダウンの判定においてタイムアウトが発生し（ステップ１０１４，ＹＥＳ）、ステップ１０１６以降の処理が行われて、ＢＭＣ５１１が強制的にリセットされる。 Unlike the first recovery process, OS 532 attempts to disconnect VGA 521 from PCI-E bus 522 using VGA driver 542 after accepting an interrupt request to OS 532 (step 1033), but fails to disconnect. (step 1034). Therefore, a time-out occurs in determining whether the link is down (step 1014, YES), and the processing from step 1016 onwards is performed, and the BMC 511 is forcibly reset.

その後、割り込み処理ルーチン５４１からＯＳ５３２に対する割り込み要求が出力された場合（ステップ１０２３）、ＯＳ５３２は、ＶＧＡドライバ５４２を用いて、ＶＧＡ５２１をＰＣＩ－Ｅバス５２２に組み込もうとする。しかし、ＶＧＡ５２１の切り離しが行われていないため、ＯＳ５３２は、ＶＧＡ５２１の組み込みに失敗する（ステップ１０３５）。 After that, when an interrupt request to the OS 532 is output from the interrupt processing routine 541 (step 1023 ), the OS 532 uses the VGA driver 542 to try to incorporate the VGA 521 into the PCI-E bus 522 . However, since the VGA 521 has not been disconnected, the OS 532 fails to incorporate the VGA 521 (step 1035).

図２の情報処理装置２０１の構成は一例に過ぎず、情報処理装置２０１の用途又は条件に応じて一部の構成要素を省略又は変更してもよい。図４のサーバシステムの構成は一例に過ぎず、サーバシステムの用途又は条件に応じて一部の構成要素を省略又は変更してもよい。例えば、サーバシステムは、複数のサーバを含んでいてもよく、テープ装置等の他の装置を含んでいてもよい。 The configuration of the information processing apparatus 201 in FIG. 2 is merely an example, and some components may be omitted or changed according to the application or conditions of the information processing apparatus 201. FIG. The configuration of the server system in FIG. 4 is merely an example, and some components may be omitted or changed according to the usage or conditions of the server system. For example, a server system may include multiple servers and may include other devices such as tape devices.

図１のサーバ１０１及び図５のサーバ４０１の構成は一例に過ぎず、サーバシステムの構成又は条件に応じて一部の構成要素を省略又は変更してもよい。例えば、図５のサーバ４０１がＲＡＩＤ装置４０２と通信しない場合は、通信回路５１６を省略することができる。 The configurations of the server 101 in FIG. 1 and the server 401 in FIG. 5 are merely examples, and some components may be omitted or changed according to the configuration or conditions of the server system. For example, if the server 401 in FIG. 5 does not communicate with the RAID device 402, the communication circuit 516 can be omitted.

図３及び図７Ａ～図１０Ｄのフローチャートは一例に過ぎず、サーバシステムの構成又は条件に応じて一部の処理を省略又は変更してもよい。図６に示したＢＭＣ５１１の復旧処理は一例に過ぎず、サーバシステムの構成又は条件に応じて一部の処理を省略又は変更してもよい。 The flowcharts of FIGS. 3 and 7A to 10D are merely examples, and some of the processes may be omitted or changed according to the configuration or conditions of the server system. The recovery processing of the BMC 511 shown in FIG. 6 is merely an example, and part of the processing may be omitted or changed according to the configuration or conditions of the server system.

開示の実施形態とその利点について詳しく説明したが、当業者は、特許請求の範囲に明確に記載した本発明の範囲から逸脱することなく、様々な変更、追加、省略をすることができるであろう。 Although the disclosed embodiments and their advantages have been described in detail, those skilled in the art can make various modifications, additions and omissions without departing from the scope of the invention as defined in the claims. deaf.

図１乃至図１０Ｄを参照しながら説明した実施形態に関し、さらに以下の付記を開示する。
（付記１）
情報処理を行う情報処理装置であって、
画像処理を行う画像処理部を含み、前記情報処理装置を管理する管理部と、
前記管理部を監視し、前記管理部の異常を検出した場合、異常検出信号を出力する監視部と、
前記情報処理を行い、前記監視部から前記異常検出信号が出力された場合、前記異常検出信号に基づいて前記画像処理部を切り離し、前記管理部を再起動する情報処理部と、
を備えることを特徴とする情報処理装置。
（付記２）
前記情報処理部は、第１プログラム及び第２プログラムを実行し、
前記監視部は、前記第１プログラムに対する第１割り込み要求を、前記異常検出信号として前記情報処理部へ出力し、
前記情報処理部は、前記第１割り込み要求に基づいて、前記第２プログラムに対する第２割り込み要求を生成し、前記第２割り込み要求に基づいて前記画像処理部を切り離すことを特徴とする付記１記載の情報処理装置。
（付記３）
前記監視部は、前記第１割り込み要求を出力した後、前記第１プログラムに対する第３割り込み要求を前記情報処理部へ出力し、
前記情報処理部は、前記第３割り込み要求に基づいて、前記情報処理部と前記画像処理部との間の通信状態をチェックし、前記通信状態が通信不可である場合、前記管理部を再起動することを特徴とする付記２記載の情報処理装置。
（付記４）
前記第１プログラムは、基本入出力システムであり、前記第２プログラムは、オペレーティングシステムであることを特徴とする付記２又は３記載の情報処理装置。
（付記５）
前記監視部は、問合せ信号を前記管理部へ送信し、所定期間内に前記管理部から応答信号を受信しない場合、前記管理部の異常が発生したと判定して、前記異常検出信号を出力することを特徴とする付記１乃至４の何れか１項に記載の情報処理装置。
（付記６）
情報処理を行い、
画像処理を行う画像処理部を含み、情報処理装置を管理する管理部が、監視部により監視され、前記監視部により前記管理部の異常が検出されて異常検出信号が出力された場合、前記異常検出信号に基づいて前記画像処理部を切り離し、
前記管理部を再起動する、
処理をプロセッサが実行することを特徴とする復旧方法。
（付記７）
前記情報処理を行う処理は、第１プログラム及び第２プログラムを実行する処理を含み、
前記監視部は、前記第１プログラムに対する第１割り込み要求を、前記異常検出信号として出力し、
前記画像処理部を切り離す処理は、
前記第１割り込み要求に基づいて、前記第２プログラムに対する第２割り込み要求を生成する処理と、
前記第２割り込み要求に基づいて前記画像処理部を切り離す処理とを含むことを特徴とする付記６記載の復旧方法。
（付記８）
前記監視部は、前記第１割り込み要求を出力した後、前記第１プログラムに対する第３割り込み要求を出力し、
前記プロセッサは、前記第３割り込み要求に基づいて、前記プロセッサと前記画像処理部との間の通信状態をチェックする処理をさらに実行し、
前記管理部を再起動する処理は、前記通信状態が通信不可である場合、前記管理部を再起動する処理を含むことを特徴とする付記７記載の復旧方法。
（付記９）
前記第１プログラムは、基本入出力システムであり、前記第２プログラムは、オペレーティングシステムであることを特徴とする付記７又は８記載の復旧方法。
（付記１０）
前記監視部は、問合せ信号を前記管理部へ送信し、所定期間内に前記管理部から応答信号を受信しない場合、前記管理部の異常が発生したと判定して、前記異常検出信号を出力することを特徴とする付記６乃至９の何れか１項に記載の復旧方法。 The following remarks are further disclosed with respect to the embodiments described with reference to FIGS. 1-10D.
(Appendix 1)
An information processing device that performs information processing,
a management unit that includes an image processing unit that performs image processing and manages the information processing device;
a monitoring unit that monitors the management unit and outputs an abnormality detection signal when an abnormality in the management unit is detected;
an information processing unit that performs the information processing and, when the abnormality detection signal is output from the monitoring unit, disconnects the image processing unit based on the abnormality detection signal and restarts the management unit;
An information processing device comprising:
(Appendix 2)
The information processing unit executes a first program and a second program,
The monitoring unit outputs a first interrupt request for the first program to the information processing unit as the abnormality detection signal,
Supplementary Note 1, wherein the information processing section generates a second interrupt request for the second program based on the first interrupt request, and disconnects the image processing section based on the second interrupt request. information processing equipment.
(Appendix 3)
After outputting the first interrupt request, the monitoring unit outputs a third interrupt request for the first program to the information processing unit,
The information processing unit checks a communication state between the information processing unit and the image processing unit based on the third interrupt request, and restarts the management unit when the communication state indicates that communication is impossible. The information processing apparatus according to Supplementary Note 2, wherein:
(Appendix 4)
The information processing apparatus according to appendix 2 or 3, wherein the first program is a basic input/output system, and the second program is an operating system.
(Appendix 5)
The monitoring unit transmits an inquiry signal to the management unit, and if it does not receive a response signal from the management unit within a predetermined period, determines that an abnormality has occurred in the management unit, and outputs the abnormality detection signal. The information processing apparatus according to any one of Appendices 1 to 4, characterized by:
(Appendix 6)
process information,
A management unit that manages an information processing apparatus including an image processing unit that performs image processing is monitored by a monitoring unit, and if the monitoring unit detects an abnormality in the management unit and outputs an abnormality detection signal, the abnormality disconnecting the image processing unit based on the detection signal;
restarting the management unit;
A recovery method characterized in that a processor executes processing.
(Appendix 7)
The process of performing information processing includes a process of executing a first program and a second program,
The monitoring unit outputs a first interrupt request to the first program as the abnormality detection signal,
The process of separating the image processing unit includes:
a process of generating a second interrupt request for the second program based on the first interrupt request;
7. The recovery method according to claim 6, further comprising a process of disconnecting the image processing unit based on the second interrupt request.
(Appendix 8)
After outputting the first interrupt request, the monitoring unit outputs a third interrupt request for the first program,
based on the third interrupt request, the processor further performs a process of checking a communication state between the processor and the image processing unit;
The recovery method according to appendix 7, wherein the process of restarting the management unit includes a process of restarting the management unit when the communication state indicates that communication is impossible.
(Appendix 9)
The recovery method according to appendix 7 or 8, wherein the first program is a basic input/output system, and the second program is an operating system.
(Appendix 10)
The monitoring unit transmits an inquiry signal to the management unit, and if it does not receive a response signal from the management unit within a predetermined period, determines that an abnormality has occurred in the management unit, and outputs the abnormality detection signal. 10. The recovery method according to any one of appendices 6 to 9, characterized by:

１０１、４０１サーバ
１１１、５１１ＢＭＣ
１１２、５１２チップセット
１１３、５１３ＣＰＵ
１２１、５２１ＶＧＡ
１２２、５２２ＰＣＩ－Ｅバス
１３１、５３１ＢＩＯＳ
１３２、５３２ＯＳ
１４１、５４２ＶＧＡドライバ
２０１情報処理装置
２１１管理部
２１２監視部
２１３情報処理部
２２１画像処理部
４０１サーバ
４０２ＲＡＩＤ装置
５１４監視回路
５１５メモリ
５１６通信回路
５４１割り込み処理ルーチン
101, 401 Server 111, 511 BMC
112, 512 Chipset 113, 513 CPU
121, 521 VGA
122, 522 PCI-E bus 131, 531 BIOS
132, 532 OS
141, 542 VGA driver 201 information processing device 211 management unit 212 monitoring unit 213 information processing unit 221 image processing unit 401 server 402 RAID device 514 monitoring circuit 515 memory 516 communication circuit 541 interrupt processing routine

Claims

An information processing device that performs information processing,
a management unit that includes an image processing unit that performs image processing and manages the information processing device;
a monitoring unit that monitors the management unit and outputs an abnormality detection signal when an abnormality in the management unit is detected;
an information processing unit that performs the information processing and, when the abnormality detection signal is output from the monitoring unit, disconnects the image processing unit based on the abnormality detection signal and restarts the management unit;
An information processing device comprising:

The information processing unit executes a first program and a second program,
The monitoring unit outputs a first interrupt request for the first program to the information processing unit as the abnormality detection signal,
2. The information processing section generates a second interrupt request for the second program based on the first interrupt request, and disconnects the image processing section based on the second interrupt request. The information processing device described.

After outputting the first interrupt request, the monitoring unit outputs a third interrupt request for the first program to the information processing unit,
The information processing unit checks a communication state between the information processing unit and the image processing unit based on the third interrupt request, and restarts the management unit when the communication state indicates that communication is impossible. 3. The information processing apparatus according to claim 2, wherein:

4. An information processing apparatus according to claim 2, wherein said first program is a basic input/output system, and said second program is an operating system.

The monitoring unit transmits an inquiry signal to the management unit, and if it does not receive a response signal from the management unit within a predetermined period, determines that an abnormality has occurred in the management unit, and outputs the abnormality detection signal. 5. The information processing apparatus according to any one of claims 1 to 4, characterized by:

process information,
A management unit that manages an information processing apparatus including an image processing unit that performs image processing is monitored by a monitoring unit, and if the monitoring unit detects an abnormality in the management unit and outputs an abnormality detection signal, the abnormality disconnecting the image processing unit based on the detection signal;
restarting the management unit;
A recovery method characterized in that a processor executes processing.