JP5689783B2

JP5689783B2 - Computer, computer system, and failure information management method

Info

Publication number: JP5689783B2
Application number: JP2011256512A
Authority: JP
Inventors: 和哉長澤
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2011-11-24
Filing date: 2011-11-24
Publication date: 2015-03-25
Anticipated expiration: 2031-11-24
Also published as: JP2013109722A

Description

本発明の実施形態は、コンピュータに生じたハードウェア障害を確実に記録するためのコンピュータ、コンピュータシステム、および障害情報管理方法に関する。 Embodiments described herein relate generally to a computer, a computer system, and a failure information management method for reliably recording a hardware failure that has occurred in a computer.

コンピュータに発生した障害を管理するためのシステム管理コントローラをマザーボード上に設けることが行われている。障害が発生した場合に、システム管理コントローラは、障害の内容を示す障害情報を不揮発性メモリに格納する。後に、管理者が障害情報を解析することによって、障害の原因を容易に特定することが可能になる。 A system management controller for managing a failure occurring in a computer is provided on a motherboard. When a failure occurs, the system management controller stores failure information indicating the content of the failure in the nonvolatile memory. Later, the administrator can easily identify the cause of the failure by analyzing the failure information.

特開２０１１−４８５３４号公報JP 2011-48534 A 特開平１０−２４７９１１号公報Japanese Patent Laid-Open No. 10-247911 特開２００９−２５２００６号公報JP 2009-252006 A

管理コントローラが故障したり、不揮発性メモリが故障したりすると、障害情報を不揮発性メモリに記録することができず、障害情報を管理することができない。そこで、障害情報を管理することができない場合であっても、後に発生した障害を解析するために障害を障害の情報を管理することが望まれている。 If the management controller fails or the nonvolatile memory fails, failure information cannot be recorded in the nonvolatile memory, and failure information cannot be managed. Therefore, even when failure information cannot be managed, it is desired to manage failure information for failure in order to analyze a failure that occurred later.

本発明の目的は、システム管理コントローラが障害の情報を管理することができない場合であっても、障害の情報を管理することが可能なコンピュータ、コンピュータシステム、および障害情報管理方法を提供することにある。 An object of the present invention is to provide a computer, a computer system, and a failure information management method capable of managing failure information even when the system management controller cannot manage the failure information. is there.

実施形態によれば、コンピュータは、ネットワークに接続される第１のコンピュータと管理用コンピュータとを含むコンピュータシステムであって、前記第１のコンピュータは、第１の記憶部と、前記第１のコンピュータにハードウェア障害が発生した場合に第１の割り込み通知を発行する第１の発行手段と、前記第１の割り込み通知の発行に応じて前記ハードウェア障害の内容を収集し、前記収集された内容に基づいた第１の障害情報を生成する生成手段と、前記第１の障害情報が生成された場合に第２の割り込み命令を発行する第２の発行手段と、前記第２の割り込み命令が発行された場合に、前記第１の障害情報を取得し、前記第１の障害情報を前記第１の記憶部に記録するシステム管理コントローラと、前記システム管理用コントローラが前記第１の障害情報を前記第１の記憶部に記録できなかった場合に前記第１の障害情報を前記管理用コンピュータに通知する通知手段とを具備し、前記管理用コンピュータは、第２の記憶部と、前記第１の障害情報が通知された場合に、前記第１の障害情報を前記第２の記憶部に書き込むシステム管理手段を具備する。 According to the embodiment, the computer is a computer system including a first computer connected to a network and a management computer, wherein the first computer includes a first storage unit and the first computer. First issue means for issuing a first interrupt notification when a hardware failure occurs, collecting the content of the hardware failure in response to the issue of the first interrupt notification, and collecting the collected content Generating means for generating first failure information based on the first issue information, second issue means for issuing a second interrupt instruction when the first failure information is generated, and issuing the second interrupt instruction A system management controller that acquires the first failure information and records the first failure information in the first storage unit, and the system management controller. And a notification means for notifying the management computer of the first failure information when the first failure information cannot be recorded in the first storage unit. Two storage units, and system management means for writing the first failure information into the second storage unit when the first failure information is notified.

実施形態のコンピュータシステムの構成の一例を示すブロック図。The block diagram which shows an example of a structure of the computer system of embodiment. 実施形態のサーバコンピュータの構成の一例を示すブロック図。The block diagram which shows an example of a structure of the server computer of embodiment. 図２に示すＳＭＩハンドラの構成の一例を示すブロック図。The block diagram which shows an example of a structure of the SMI handler shown in FIG. 実施形態のサーバコンピュータの構成の一例を示すブロック図。The block diagram which shows an example of a structure of the server computer of embodiment. 実施形態の管理用サーバコンピュータの構成の一例を示すブロック図。The block diagram which shows an example of a structure of the management server computer of embodiment. ＳＭＩハンドラによって実行される処理の手順を示すフローチャート。The flowchart which shows the procedure of the process performed by a SMI handler. ＢＭＣ代理プログラムによって実行される処理の手順を示すフローチャート。The flowchart which shows the procedure of the process performed by the BMC proxy program.

以下、実施の形態について図面を参照して説明する。 Hereinafter, embodiments will be described with reference to the drawings.

図１は、一実施形態のコンピュータシステムの構成を示すブロック図である。
図１に示すように、コンピュータシステムは、ＬＡＮ（Local Area Network）に接続された管理用サーバコンピュータ１０、第１のサーバコンピュータ２０Ａ、および第２のサーバコンピュータ２０Ｂ等から構成されている。 FIG. 1 is a block diagram illustrating a configuration of a computer system according to an embodiment.
As shown in FIG. 1, the computer system includes a management server computer 10, a first server computer 20A, a second server computer 20B, and the like connected to a LAN (Local Area Network).

サーバコンピュータ２０（第１のサーバコンピュータ２０Ａ、第２のサーバコンピュータ２０Ｂ）の構成を図２を参照して説明する。
サーバコンピュータ２０は、第１のＮＩＣ（Network Interface Card）２１、第２のＮＩＣ２２、ネットワークコントローラ２３、システム管理コントローラとしてのＢＭＣ（Baseboard Management Controller）２４、不揮発性メモリ（ＮＶＲＡＭ：Non-volatile memory）２５、およびフラッシュＲＯＭ２６等を備えている。 The configuration of the server computer 20 (first server computer 20A, second server computer 20B) will be described with reference to FIG.
The server computer 20 includes a first NIC (Network Interface Card) 21, a second NIC 22, a network controller 23, a BMC (Baseboard Management Controller) 24 as a system management controller, and a non-volatile memory (NVRAM) 25. And a flash ROM 26 and the like.

ネットワークコントローラ２３は、ＯＳＩ参照モデルのデータリンク層に相当する機能を有する。第１のＮＩＣ２１および第２のＮＩＣ２２は、例えばＯＳＩ参照モデルの物理層チップである。第２のＮＩＣ２２は、後述するＢＭＣ２４に設けられている。なお、第１のＮＩＣ２１は、サーバコンピュータ２０によって実行されるアプリケーションプログラム等が使用する。 The network controller 23 has a function corresponding to the data link layer of the OSI reference model. The first NIC 21 and the second NIC 22 are, for example, OSI reference model physical layer chips. The second NIC 22 is provided in a BMC 24 described later. Note that the first NIC 21 is used by an application program executed by the server computer 20.

ＢＭＣ２４は、サーバコンピュータ２０内に設けられたセンサを用いてハードウェアを常時監視する。そして、ハードウェア障害が発生した場合に、発生した障害の内容をＮＶＲＡＭ２５内のＳＥＬ(System Event Log)２５１に書き込む。また、ＢＭＣ２４は、発生した障害の内容を予め設定された管理者端末３０に通知する。ＢＭＣ２４は、例えば管理者のメールアドレス宛にメールを送ることで、発生した障害の内容を管理者に通知する。また、ＢＭＣ２４は、発生した障害の内容を含むメッセージをＳＮＭＰ（Simple Network Management Protocol）で送ることで、発生した障害の内容を予め設定された管理者端末３０に通知する。 The BMC 24 constantly monitors hardware using a sensor provided in the server computer 20. When a hardware failure occurs, the content of the failure that has occurred is written to a SEL (System Event Log) 251 in the NVRAM 25. In addition, the BMC 24 notifies the administrator terminal 30 set in advance of the content of the failure that has occurred. The BMC 24 notifies the administrator of the content of the failure that has occurred, for example, by sending an e-mail to the e-mail address of the administrator. Further, the BMC 24 notifies the administrator terminal 30 set in advance of the content of the failure that has occurred by sending a message including the content of the failure that has occurred using SNMP (Simple Network Management Protocol).

ＢＭＣ２４はコンピュータ（サーバ）のマザーボード上に配置され、ＩＰＭＩ(Intelligent Platform Management Interface)アーキテクチャに基づく特殊なマイクロコントローラである。ＢＭＣ２４は、ＣＰＵ（ＯＳ）が動作していなくても、電源さえあれば動作する。図示しないコンピュータに内蔵された異なるタイプのセンサは、温度、冷却ファン回転速度、電源状態、ＯＳ状態等に関するパラメータをＢＭＣ２４に報告する。ＢＭＣ２４はセンサを監視し、いずれかのパラメータが許容範囲外となると、システムの動作不良の可能性を、ネットワークを介して管理者端末３０に通知する。 The BMC 24 is a special microcontroller that is arranged on a motherboard of a computer (server) and is based on an IPMI (Intelligent Platform Management Interface) architecture. Even if the CPU (OS) is not operating, the BMC 24 operates as long as there is a power source. Different types of sensors built into the computer (not shown) report parameters related to temperature, cooling fan rotation speed, power supply state, OS state, etc. to the BMC 24. The BMC 24 monitors the sensor, and when any parameter falls outside the allowable range, the BMC 24 notifies the administrator terminal 30 of the possibility of system malfunction via the network.

ＮＶＭＲＡＭ２５には、ＳＥＬ（System Event Log）２５１、ＳＤＲ（Sensor Data Records）２５２、およびＰＥＦ（Platform Event Filtering）２５３が書き込まれている。ＮＶＲＡＭ２５は、シリアルバス接続タイプのＥＥＰＲＯＭ（Electrically Erasable and Programmable Read Only Memory）またはフラッシュメモリである。ＳＥＬ２５１には、例えばＢＭＣ２４により情報処理装置の異常を検出した場合、又はセンサで閾値を超えるエラーを検出した場合、障害の内容が記録される。ＳＤＲ２５２には、ＢＭＣ２４が管理しているセンサの種類（温度や電圧等）や、異常を識別する為の閾値などが製造時に記録されている。ＰＥＦ２５３には、障害が発生した場合に、管理者端末３０に通知を行う障害の種類の設定が記録されている。 In the NVMRAM 25, a system event log (SEL) 251, a sensor data record (SDR) 252, and a platform event filtering (PEF) 253 are written. The NVRAM 25 is a serial bus connection type EEPROM (Electrically Erasable and Programmable Read Only Memory) or flash memory. For example, when an abnormality of the information processing apparatus is detected by the BMC 24 or when an error exceeding a threshold is detected by the sensor, the content of the failure is recorded in the SEL 251. In the SDR 252, the type of sensor (temperature, voltage, etc.) managed by the BMC 24, a threshold value for identifying an abnormality, and the like are recorded at the time of manufacture. In the PEF 253, the setting of the type of failure to be notified to the administrator terminal 30 when a failure occurs is recorded.

フラッシュＲＯＭ２６内には、ＢＩＯＳ（基本入出力システム：Basic Input Output System）２６１が格納されている。ＢＩＯＳ２６１は、ＣＰＵによって実行されるハードウェア制御のためのシステムプログラムである。ＢＩＯＳ２６１は、ＳＭＩ（System Management Interrupt）イベントの発行時に、ＣＰＵによって実行されるＳＭＩハンドラ２６２を有する。 A BIOS (Basic Input / Output System) 261 is stored in the flash ROM 26. The BIOS 261 is a system program for hardware control executed by the CPU. The BIOS 261 has an SMI handler 262 that is executed by the CPU when an SMI (System Management Interrupt) event is issued.

なお、ＳＭＩハンドラ２６２は、図３に示すように、障害情報生成モジュール２６２１、障害情報格納指示モジュール２６２２、および障害情報送信モジュール２６２３等のプログラムを有する。 As shown in FIG. 3, the SMI handler 262 has programs such as a failure information generation module 2621, a failure information storage instruction module 2622, and a failure information transmission module 2623.

障害情報生成モジュール２６２１は、サーバコンピュータにハードウェア障害が発生した場合に、ハードウェア障害の内容に基づいて障害情報を生成する。障害情報が生成された場合に、指示発行モジュール２６２２は、ＢＭＣ２４に対して障害情報のＮＶＲＡＭ２５への格納を指示するための指示信号を送信する。障害情報送信モジュール２６２３は、障害情報のＮＶＲＡＭ２５への格納に失敗した場合に、障害情報を管理用サーバコンピュータ１０によって管理させるために、障害情報を管理用サーバコンピュータ１０に送信する。 The failure information generation module 2621 generates failure information based on the content of the hardware failure when a hardware failure occurs in the server computer. When the failure information is generated, the instruction issue module 2622 transmits an instruction signal for instructing the BMC 24 to store the failure information in the NVRAM 25. The failure information transmission module 2623 transmits the failure information to the management server computer 10 in order to manage the failure information by the management server computer 10 when the failure information has failed to be stored in the NVRAM 25.

サーバコンピュータ２０のより詳細なシステム構成を、図４を参照して説明する。
本コンピュータ２０は、図３に示されているように、ＣＰＵ１０１、ノースブリッジ１０２、主メモリ１０３、サウスブリッジ１０４、グラフィクスプロセッシングユニット（ＧＰＵ）１０５、ビデオメモリ（ＶＲＡＭ）１０５Ａ、サウンドコントローラ１０６、フラッシュＲＯＭ２６、ネットワークコントローラ２３、ＢＭＣ２４、ＮＶＲＡＭ２５、ハードディスクドライブ（ＨＤＤ）１１１、およびＰＣＩデバイス１１５等を備えている。 A more detailed system configuration of the server computer 20 will be described with reference to FIG.
As shown in FIG. 3, the computer 20 includes a CPU 101, a north bridge 102, a main memory 103, a south bridge 104, a graphics processing unit (GPU) 105, a video memory (VRAM) 105A, a sound controller 106, and a flash ROM 26. , Network controller 23, BMC 24, NVRAM 25, hard disk drive (HDD) 111, PCI device 115, and the like.

ＣＰＵ１０１は本コンピュータ１０の動作を制御するプロセッサである。ＣＰＵ１０１は、ハードディスクドライブ（ＨＤＤ）１１１から主メモリ１０３にロードされる、オペレーティングシステムや各種アプリケーションプログラムを実行する。また、ＣＰＵ１０１は、フラッシュＲＯＭ１０９に格納されたＢＩＯＳ（Basic Input Output System）２６１も実行する。ＢＩＯＳ２６１はハードウェア制御のためのプログラムである。 The CPU 101 is a processor that controls the operation of the computer 10. The CPU 101 executes an operating system and various application programs loaded from the hard disk drive (HDD) 111 to the main memory 103. The CPU 101 also executes a basic input output system (BIOS) 261 stored in the flash ROM 109. The BIOS 261 is a program for hardware control.

ノースブリッジ１０２は、ＣＰＵ１０１のローカルバスとサウスブリッジ１０４との間を接続するブリッジデバイスである。ノースブリッジ１０２には、主メモリ１０３をアクセス制御するメモリコントローラも内蔵されている。また、ノースブリッジ１０２は、PCI EXPRESS規格のシリアルバスなどを介して、ＧＰＵ１０５との通信を実行する機能も有している。 The north bridge 102 is a bridge device that connects the local bus of the CPU 101 and the south bridge 104. The north bridge 102 also includes a memory controller that controls access to the main memory 103. The north bridge 102 also has a function of executing communication with the GPU 105 via a PCI EXPRESS standard serial bus or the like.

ＧＰＵ１０５は、本コンピュータ１０のディスプレイモニタを制御する表示コントローラである。ＧＰＵ１０５は、ＶＲＡＭ１０５Ａをワークメモリとして使用する。このＧＰＵ１０５によって生成される映像信号は、ディスプレイモニタに送られる。 The GPU 105 is a display controller that controls the display monitor of the computer 10. The GPU 105 uses the VRAM 105A as a work memory. The video signal generated by the GPU 105 is sent to the display monitor.

サウスブリッジ１０４は、ＬＰＣ（Low Pin Count）バス上の各デバイス、およびＰＣＩ（Peripheral Component Interconnect）バス上の各デバイス１１５Ａ、１１５Ｂを制御する。また、サウスブリッジ１０４は、ハードディスクドライブ（ＨＤＤ）１１１およびＤＶＤドライブ１１２を制御するためのＩＤＥ（Integrated Drive Electronics）コントローラを内蔵している。さらに、サウスブリッジ１０４は、サウンドコントローラ１０６との通信を実行する機能も有している。 The south bridge 104 controls each device 115A and 115B on each LPC (Low Pin Count) bus and each PCI (Peripheral Component Interconnect) bus. The south bridge 104 includes an IDE (Integrated Drive Electronics) controller for controlling the hard disk drive (HDD) 111 and the DVD drive 112. Further, the south bridge 104 has a function of executing communication with the sound controller 106.

更に、第１の発行手段としてのサウスブリッジ１０４は、ＰＣＩバス１４上のＰＥＲＲ（パリティエラー）信号やＳＥＲＲ（システムエラー）信号の検出に伴い、ＳＭＩ（System Management Interrupts）イベントをＣＰＵに発行する回路を内蔵する。 Further, the south bridge 104 as the first issuing means is a circuit that issues an SMI (System Management Interrupts) event to the CPU upon detection of a PERR (parity error) signal or SERR (system error) signal on the PCI bus 14. Built in.

サウンドコントローラ１０６は音源デバイスであり、再生対象のオーディオデータをスピーカ１８Ａ，１８Ｂに出力する。 The sound controller 106 is a sound source device and outputs audio data to be reproduced to the speakers 18A and 18B.

センサ２４１は、サウスブリッジ１０４から発行されたシステムエラー信号(ＳＥＲＲ信号)／パリティエラー信号(ＰＥＲＲ信号)の発行やＣＰＵの温度等を監視する。センサ２４１は、ＢＭＣ２４により、所定間隔毎にポーリングされる。 The sensor 241 monitors the issuance of a system error signal (SERR signal) / parity error signal (PERR signal) issued from the south bridge 104, the temperature of the CPU, and the like. The sensor 241 is polled by the BMC 24 at predetermined intervals.

次に、管理用サーバコンピュータ１０のＣＰＵによって実行されるソフトウェアプログラムの構成について図５を参照して説明する。管理用サーバコンピュータ１０内では、ＢＭＣ代理プログラム５０１が実行される。ＢＭＣ代理プログラム５０１では、ＢＭＣマネージャ５０２、第１の仮想ＢＭＣ５０３Ａ、および第２の仮想ＢＭＣ５０３Ｂが実行される。また、管理用サーバコンピュータ１０は、記憶装置５１０を有する。 Next, the configuration of the software program executed by the CPU of the management server computer 10 will be described with reference to FIG. In the management server computer 10, the BMC proxy program 501 is executed. In the BMC proxy program 501, the BMC manager 502, the first virtual BMC 503A, and the second virtual BMC 503B are executed. The management server computer 10 also has a storage device 510.

第１の仮想ＢＭＣ５０３Ａは、第１のサーバコンピュータ２０Ａ内のＢＭＣの機能を実行する。第２の仮想ＢＭＣ５０３Ｂは、第２のサーバコンピュータ２０Ｂ内のＢＭＣの機能を実行する。管理マネージャは５０１、障害情報を送信したサーバに対応する仮想ＢＭＣに送信する。 The first virtual BMC 503A executes the function of the BMC in the first server computer 20A. The second virtual BMC 503B executes the function of the BMC in the second server computer 20B. The management manager 501 sends it to the virtual BMC corresponding to the server that sent the failure information.

正常運用時、ＢＭＣマネージャ５０２は、各サーバコンピュータの第２のＮＩＣ２２を経由して、ＢＭＣ２４からＳＤＲ２５２、ＰＥＦ２５３、ＢＭＣ専用ＬＡＮポートの構成情報（トラップの送信先など）をポーリングして取得する。 During normal operation, the BMC manager 502 polls and obtains configuration information (such as a trap transmission destination) of the SDR 252, PEF 253, and BMC dedicated LAN port from the BMC 24 via the second NIC 22 of each server computer.

ＢＭＣマネージャ５０２は、取得したＳＤＲおよびＰＥＦを、取得したサーバコンピュータに対応する仮想ＢＭＣに関連づけられた記憶装置５１０内のフォルダ（５１１Ａまたは５１１Ｂ）内に記録する。また、ＢＭＣマネージャ５０２は、ＢＭＣ専用ＬＡＮポートの構成情報を対応する仮想ＢＭＣに設定する。
ＢＭＣマネージャ５０２は、ＳＭＩハンドラ２６２に、ＢＭＣ故障時のＳＭＩイベントの送信先として、管理用サーバコンピュータ１０を登録しておく。 The BMC manager 502 records the acquired SDR and PEF in a folder (511A or 511B) in the storage device 510 associated with the virtual BMC corresponding to the acquired server computer. Further, the BMC manager 502 sets the configuration information of the BMC dedicated LAN port in the corresponding virtual BMC.
The BMC manager 502 registers the management server computer 10 in the SMI handler 262 as the transmission destination of the SMI event when the BMC fails.

サーバコンピュータにハードウェア障害が発生し、正常にＮＶＲＡＭに障害情報が書き込まれる場合の動作を説明する。 An operation when a hardware failure occurs in the server computer and the failure information is normally written in the NVRAM will be described.

センサ２４１は、ＰＣＩバス１１４上のＰＣＩデバイスが発行したＳＥＲＲ信号／ＰＥＲＲ信号を検出する。センサ２４１は、ＳＥＲＲ信号／ＰＥＲＲ信号を検出した場合、サウスブリッジ１０４に対して、ＳＭＩ信号をＣＰＵ１０１に出力するように指示する。サウスブリッジ１０４は、センサ２４１からの指示に従い、ＣＰＵ１０１に対して、ＳＭＩ信号を出力する。ＳＭＩ信号に応じて、ＳＭＩハンドラ２６２が起動する。 The sensor 241 detects the SERR signal / PERR signal issued by the PCI device on the PCI bus 114. When the sensor 241 detects the SERR signal / PERR signal, the sensor 241 instructs the south bridge 104 to output the SMI signal to the CPU 101. The south bridge 104 outputs an SMI signal to the CPU 101 in accordance with an instruction from the sensor 241. The SMI handler 262 is activated in response to the SMI signal.

ＳＭＩ信号に応答して、起動されたＳＭＩハンドラ２６２内の障害情報生成モジュール２６２１は、ＰＣＩバス１１４上のどのデバイスが、ＳＥＲＲ信号／ＰＥＲＲ信号を出力したかを示す情報を検出する。 In response to the SMI signal, the fault information generation module 2621 in the activated SMI handler 262 detects information indicating which device on the PCI bus 114 has output the SERR signal / PERR signal.

障害情報生成モジュール２６２１は、検出された情報に基づいて、エラーの種類（ＳＥＲＲ信号／ＰＥＲＲ信号）、エラーを発行・検出したデバイスのバス番号、ファンクション番号、デバイス番号を含む第１の障害情報を生成する。 Based on the detected information, the failure information generation module 2621 generates first failure information including the type of error (SERR signal / PERR signal), the bus number, function number, and device number of the device that issued and detected the error. Generate.

第１の障害情報が生成されると、障害情報格納指示モジュール２６２２は、ＢＭＣ２４に第１の障害情報のＮＶＲＡＭ２５への格納を指示するための指示信号を送信する。 When the first failure information is generated, the failure information storage instruction module 2622 transmits an instruction signal for instructing the BMC 24 to store the first failure information in the NVRAM 25.

ＢＭＣ２４は、指示信号の受信に応じて、第１の障害情報をＳＭＩハンドラ２６２から取得する。そして、第１の障害情報に、エラーイベントとして、イベントの通し番号、センサの種類と時刻等の付加情報を付加した第２の障害情報をＳＥＬ２５１に記録する。 The BMC 24 acquires first failure information from the SMI handler 262 in response to receiving the instruction signal. Then, second failure information in which additional information such as an event serial number, sensor type, and time is added to the first failure information as an error event is recorded in the SEL 251.

ＢＭＣ２４は、ＰＥＦ２５３に発生した障害が設定されている場合に、第２のＮＩＣ２２から予め設定されている管理者端末３０にトラップを送信する。 The BMC 24 transmits a trap to the administrator terminal 30 set in advance from the second NIC 22 when the failure that has occurred in the PEF 253 is set.

以上が、ＢＭＣおよびＮＶＲＡＭが故障していない状態で、サーバコンピュータのハードウエアに障害が生じた場合の動作である。 The above is the operation when a failure occurs in the hardware of the server computer while the BMC and NVRAM are not broken down.

ＢＭＣまたはＮＶＲＡＭが故障している状態で、ＳＭＩイベントが発行された場合のＳＭＩハンドラの動作を、図６のフローチャートを参照して説明する。 The operation of the SMI handler when an SMI event is issued in a state where the BMC or NVRAM is faulty will be described with reference to the flowchart of FIG.

センサ２４１は、ＰＣＩバス１１４上のＰＣＩデバイスが発行したＳＥＲＲ信号／ＰＥＲＲ信号を検出する。センサ２４１は、ＳＥＲＲ信号／ＰＥＲＲ信号を検出した場合、サウスブリッジ１０４に対して、ＳＭＩ信号をＣＰＵ１０１に出力するように指示する。サウスブリッジ１０４は、センサ２４１からの指示に従い、ＣＰＵ１０１に対して、ＳＭＩ信号を出力する。ＳＭＩ信号に応じて（ステップＢ６０１）、ＳＭＩハンドラ２６２が起動する。 The sensor 241 detects the SERR signal / PERR signal issued by the PCI device on the PCI bus 114. When the sensor 241 detects the SERR signal / PERR signal, the sensor 241 instructs the south bridge 104 to output the SMI signal to the CPU 101. The south bridge 104 outputs an SMI signal to the CPU 101 in accordance with an instruction from the sensor 241. In response to the SMI signal (step B601), the SMI handler 262 is activated.

ＳＭＩ信号に応答して、起動されたＳＭＩハンドラ２６２内の障害情報生成モジュール２６２１は、ＰＣＩバス１１４上のどのデバイスが、ＳＥＲＲ信号／ＰＥＲＲ信号を出力したかを検出する（ステップＢ６０２）。 In response to the SMI signal, the fault information generation module 2621 in the activated SMI handler 262 detects which device on the PCI bus 114 has output the SERR signal / PERR signal (step B602).

障害情報生成モジュール２６２１は、収集された情報に基づいて、エラーの種類（ＳＥＲＲ信号／ＰＥＲＲ信号）、エラーを発行・検出したデバイスのバス番号、ファンクション番号、デバイス番号を含む第１の障害情報を生成する（ステップＢ６０３）。 The failure information generation module 2621 generates first failure information including the type of error (SERR signal / PERR signal), the bus number of the device that issued / detected the error, the function number, and the device number based on the collected information. Generate (step B603).

第１の障害情報が生成されると、障害情報格納指示モジュール２６２２は、ＢＭＣ２４に第１の障害情報を記録を指示するためのＳＭＩイベントを発行する（ステップＢ６０４）。 When the first failure information is generated, the failure information storage instruction module 2622 issues an SMI event for instructing the BMC 24 to record the first failure information (step B604).

ＮＶＲＡＭ２５が故障している場合、ＢＭＣがＳＭＩハンドラにＳＥＬ２５１に第１の障害情報が記録できなかったことを示すエラー通知を発行する。また、ＢＭＣが故障している場合、ＳＭＩハンドラが第１の障害情報を記録するように指示してから一定時間経過するまでにＢＭＣからの応答が無い場合に、ＳＭＩハンドラは、ＳＥＬに障害情報が書き込まれなかったと判断する（ステップＢ６０５）。 If the NVRAM 25 has failed, the BMC issues an error notification indicating that the first failure information could not be recorded in the SEL 251 to the SMI handler. In addition, when the BMC is faulty, the SMI handler displays the fault information in the SEL when there is no response from the BMC until a predetermined time elapses after the SMI handler instructs to record the first fault information. Is not written (step B605).

ＳＥＬ２５１に障害情報が書き込まれなかった場合、ＳＭＩハンドラ２６２の障害情報送信モジュール２６２３は、通常のＬＡＮポート（ＮＩＣ１）を経由して、管理用サーバコンピュータ１０のアプリケーションプログラムに、第１の障害情報を含むメッセージを送信する（ステップＢ６０６）。メッセージ内には、当該メッセージを送ったサーバを示す送信元情報が格納されている。 When failure information is not written in the SEL 251, the failure information transmission module 2623 of the SMI handler 262 sends the first failure information to the application program of the management server computer 10 via the normal LAN port (NIC 1). A message including the message is transmitted (step B606). In the message, transmission source information indicating the server that sent the message is stored.

次に、メッセージを受け取ったＢＭＣ代理プログラム５０１の動作を、図７のフローチャートを参照して説明する。 Next, the operation of the BMC proxy program 501 that has received the message will be described with reference to the flowchart of FIG.

ＢＭＣマネージャ５０２が、第１のサーバコンピュータ２０Ａから送信された第１の障害情報を含むメッセージを受け取る（ステップＢ７０１）。 The BMC manager 502 receives a message including the first failure information transmitted from the first server computer 20A (step B701).

ＢＭＣマネージャは、メッセージから第１の障害情報と送信元情報とをそれぞれ抽出し、メッセージに含まれる送信元情報に基づいて、第１の障害情報を第１の仮想ＢＭＣに送る（ステップＢ７０２）。 The BMC manager extracts the first failure information and the transmission source information from the message, and sends the first failure information to the first virtual BMC based on the transmission source information included in the message (step B702).

第１の仮想ＢＭＣは、第１の障害情報を含む第３の障害情報を、第１のサーバコンピュータに対応する第１の仮想ＢＭＣ５０３Ａに関連づけられているフォルダ５１１Ａ内のＳＥＬ２５１Ａに書き込む（ステップＢ７０３）。 The first virtual BMC writes the third failure information including the first failure information to the SEL 251A in the folder 511A associated with the first virtual BMC 503A corresponding to the first server computer (step B703). .

第１の仮想ＢＭＣ５０３Ａは、第１のサーバコンピュータ２０Ａに発生した障害が、ＰＥＦに設定されているかを判定する（ステップＢ７０４）。障害が、設定されていると判定した場合（ステップＢ７０４のＹｅｓ）、第１の障害情報に含まれている情報の内、少なくとも一部を含む第４の障害情報にトラップを送信する（ステップＢ７０５）。障害が、設定されていないと判定した場合（ステップＢ７０４のＮｏ）は、処理を終了する。また、トラップを送信した場合（ステップＢ７０５）も、処理を終了する。 The first virtual BMC 503A determines whether the failure that has occurred in the first server computer 20A is set in the PEF (step B704). If it is determined that a failure has been set (Yes in step B704), a trap is transmitted to the fourth failure information including at least a part of the information included in the first failure information (step B705). ). If it is determined that the failure has not been set (No in step B704), the process ends. Also, when the trap is transmitted (step B705), the process is terminated.

本実施形態によれば、ＢＭＣまたはＮＶＲＡＭが故障し、サーバコンピュータに発生した障害の内容をＮＶＲＡＭに記録することができ無い状態であっても、管理用サーバコンピュータ１０内の記憶装置に障害の内容を書き込むことが出来る。このため、障害発生時の解析に有効となる。また、トラップを管理者端末３０に送信することが可能になる。 According to this embodiment, even if the BMC or NVRAM fails and the content of the failure that has occurred in the server computer cannot be recorded in the NVRAM, the content of the failure in the storage device in the management server computer 10 Can be written. Therefore, it is effective for analysis when a failure occurs. In addition, the trap can be transmitted to the administrator terminal 30.

なお、上記実施形態では、センサがサウスブリッジにＳＭＩイベントの発行を指示していたが、センサによって検出された値が閾値を超えた場合に、ＢＭＣがサウスブリッジにＳＭＩイベントの発行を指示するように構成しても良い。この場合、障害情報生成モジュール２６２１は、ＢＭＣ２４からハードウェア障害の内容を取得する。 In the above embodiment, the sensor instructs the South Bridge to issue an SMI event. However, when the value detected by the sensor exceeds a threshold, the BMC instructs the South Bridge to issue an SMI event. You may comprise. In this case, the failure information generation module 2621 acquires the content of the hardware failure from the BMC 24.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although several embodiments of the present invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１０…管理用サーバコンピュータ、２０Ａ…第１のサーバコンピュータ、２０Ｂ…第２のサーバコンピュータ、２１…第１のＮＩＣ、２２…第２のＮＩＣ、２３…ネットワークコントローラ、２４…ＢＭＣ（システム管理コントローラ）、２５…不揮発性メモリ（第１の記憶部）、１０１…ＣＰＵ、１０４…サウスブリッジ、１１５Ａ．１１５Ｂ…ＰＣＩデバイス、１１６…キーボードコントローラＩＣ、２６２…ＳＭＩハンドラ、５０１…ＢＭＣ代理プログラム、５０２…ＢＭＣマネージャ、５０３Ａ…第１の仮想ＢＭＣ、５０３Ｂ…第２の仮想ＢＭＣ、５１０…記憶装置（第２の記憶部）。 DESCRIPTION OF SYMBOLS 10 ... Management server computer, 20A ... 1st server computer, 20B ... 2nd server computer, 21 ... 1st NIC, 22 ... 2nd NIC, 23 ... Network controller, 24 ... BMC (system management controller) , 25... Nonvolatile memory (first storage unit), 101... CPU, 104. 115B ... PCI device, 116 ... Keyboard controller IC, 262 ... SMI handler, 501 ... BMC proxy program, 502 ... BMC manager, 503A ... First virtual BMC, 503B ... Second virtual BMC, 510 ... Storage device (second Storage section).

Claims

A computer connected to a management computer via a network,
A storage unit;
Generating means for generating failure information indicating the content of the hardware failure when a hardware failure occurs in the computer;
Issuing means for issuing a first instruction signal when the failure information is generated;
A system management controller that stores the failure information in the storage unit in response to reception of the first instruction signal from the issuing means;
A computer comprising: transmission means for transmitting the failure information to the management computer when storage of the failure information in the storage unit fails.

The system management controller
2. The computer according to claim 1, wherein when the failure information includes information indicating a failure designated by notification setting, at least a part of the failure information is transmitted to an administrator terminal based on transmission destination information.

A computer system comprising a first computer connected to a network and a management computer connected to the network,
The first computer is
A first storage unit;
Generating means for generating fault information indicating the contents of the hardware fault when a hardware fault occurs in the first computer;
Issuing means for issuing a first instruction signal when the failure information is generated;
A system management controller that stores the failure information in the first storage unit in response to reception of the first instruction signal from the issuing means;
Transmission means for transmitting the failure information to the management computer when storage of the failure information in the first storage unit fails,
The management computer is:
A second storage unit;
A computer system comprising system management means for storing the failure information in the second storage unit.

The system management controller
The computer system according to claim 3 , wherein when the failure information includes information indicating a failure specified by notification setting, at least a part of the failure information is transmitted to an administrator terminal based on transmission destination information.

The management computer further comprises acquisition means for acquiring the notification setting and the transmission destination information,
When the failure information includes information indicating a failure specified by the acquired notification setting, the system management unit sends at least a part of the failure information to the administrator terminal based on the acquired transmission destination information. The computer system according to claim 4 for transmitting.

The computer system further comprises a second computer connected to the network,
The second computer is
A third storage unit;
Generating means for generating second fault information indicating a content of the hardware fault when a hardware fault occurs in the second computer;
First issuing means for issuing a second instruction signal when the second failure information is generated;
A second system management controller that stores the second failure information in the third storage unit in response to reception of the second instruction signal from the first issuing means;
Transmission means for transmitting the second failure information to the management computer when storage of the second failure information in the third storage unit fails,
The computer system according to claim 3 , wherein the management computer stores the second failure information in the second storage unit.

A failure information management method by a computer system including a first computer connected to a network and a management computer connected to the network,
The first computer generates failure information indicating a content of the hardware failure when a hardware failure occurs in the first computer;
The first computer issues a first instruction signal when the failure information is generated;
A system management controller provided in the first computer stores the failure information in a first storage unit provided in the first computer in response to receiving the first instruction signal;
When storing the failure information in the first storage unit fails, the system management controller sends the failure information to the management computer,
A failure information management method in which the management computer stores the failure information in a second storage unit provided in the management computer.