JP2009252006A

JP2009252006A - Log management system and method in computer system

Info

Publication number: JP2009252006A
Application number: JP2008100202A
Authority: JP
Inventors: Hisashi Saito; 寿齋藤
Original assignee: NEC Computertechno Ltd
Current assignee: NEC Computertechno Ltd
Priority date: 2008-04-08
Filing date: 2008-04-08
Publication date: 2009-10-29

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method capable of integratedly managing logs collected and stored by a plurality of cell nodes due to a single fault in a computer system as one log. <P>SOLUTION: In the computer system with a plurality of cell nodes, when a fault is detected in a component in any cell node, the fault is notified to a management controller in the cell node, the detection of the fault is transferred to the other cell nodes in the computer system, each cell node stores faults detected in its own cell node and the detection of faults transferred from other nodes as local log data, the local log data is transferred to an integrated management controller in the computer system, a fault factor in the local log data to be transferred from each cell node is subsequently estimated on the basis of the transferred local log data, and the local log data is accumulated according to the estimation and stored as global log data into a memory. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、複数のセルノードから構成されるコンピュータシステムにおける障害ログを収集し管理する装置、方法およびプログラムに関する。 The present invention relates to an apparatus, a method, and a program for collecting and managing a failure log in a computer system composed of a plurality of cell nodes.

マイクロプロセッサ、メモリ、Ｉ／Ｏデバイス、インタコネクトコントローラ等のハードウェア部品及びそれらハードウェア部品を管理・制御する管理コントローラを搭載したセルノードと呼ぶ集合体を基本単位とし、このセルノードを単一或いは複数個統合して１つのコンピュータシステムを形成できる拡張性に優れたコンピュータサーバがある。このようなコンピュータサーバの運用管理においては、何れかのセルノードにおいて発生した障害を検知し、その要因を迅速に特定する必要が生じる。 The basic unit is an assembly called a cell node equipped with hardware components such as a microprocessor, a memory, an I / O device, and an interconnect controller, and a management controller that manages and controls these hardware components. There is a computer server with excellent expandability that can be integrated to form one computer system. In operation management of such a computer server, it is necessary to detect a failure occurring in any cell node and quickly identify the cause.

このようなコンピュータサーバにおいて、複数セルノードでコンピュータシステムを形成している場合、該コンピュータシステム内で致命障害が発生すると、該コンピュータシステムを形成している各セルノード内の管理コントローラは、互いに独立して、自セルノードに搭載されているハードウェア部品状態（ログ）を収集・保持する。しかし、互いに独立して収集・保持された複数セルノードのログを、それ単一で解析しても障害原因を特定できない障害がある。例えば、コンピュータシステムがストールした障害や、各セルノード間を接続しているインタフェースの障害等の場合である。 In such a computer server, when a computer system is formed by a plurality of cell nodes, if a fatal failure occurs in the computer system, the management controllers in the cell nodes forming the computer system are independent of each other. Collects and holds the status (log) of hardware components mounted on its own cell node. However, there is a failure in which the cause of the failure cannot be identified even if the logs of a plurality of cell nodes collected and held independently of each other are analyzed alone. For example, there is a case where the computer system is stalled or the interface connecting each cell node is faulty.

尚、本書においてコンピュータシステムとは、１つのオペレーティングシステム（ＯＳ）で動作する集合体のことと定義し、コンピュータサーバとは、物理的に相互接続されたセルノード全体の集合体と定義する。即ち、１つのコンピュータサーバ内に複数のコンピュータシステムを形成することもできる。 In this document, a computer system is defined as an aggregate that operates on one operating system (OS), and a computer server is defined as an aggregate of the entire cell nodes that are physically interconnected. That is, a plurality of computer systems can be formed in one computer server.

このようなコンピュータサーバの一形態として、ブレードサーバと呼ばれるものがある。ブレードサーバの一例とコンピュータシステムとの関連を図９に記載する。 One form of such a computer server is called a blade server. The relationship between an example of a blade server and a computer system is described in FIG.

ブレードサーバ３００は、プロセッサ、メモリ、Ｉ／Ｏデバイス等のハードウェア部品及びそれらハードウェア部品を管理・制御する管理コントローラを搭載したブレード３０１ー１〜３０１ー４と呼ぶ集合体を基本単位とし、このブレードを複数個まとめたコンピュータサーバである。 The blade server 300 has as its basic unit an assembly called blades 301-1 to 301-4 on which hardware components such as processors, memories, and I / O devices and management controllers that manage and control these hardware components are mounted. A computer server in which a plurality of blades are collected.

また、特開２００５−２８４５２号公報（特許文献１）、特開平０２−２７４９号公報（特許文献２）および特開平１１−１４３７３８号公報（特許文献３）には、複数のプロセッサあるいは計算機間での障害情報を一元管理しあるいは同時監視する技術が開示されている。 In addition, Japanese Patent Laid-Open No. 2005-28452 (Patent Document 1), Japanese Patent Laid-Open No. 02-2749 (Patent Document 2) and Japanese Patent Laid-Open No. 11-143738 (Patent Document 3) include a plurality of processors or computers. Discloses a technique for centrally managing or simultaneously monitoring the failure information.

特開２００５−２８４５２０号公報JP 2005-284520 A 特開平０２−２７４４９号公報JP 02-27449 A 特開平１１−１４３７３８号公報Japanese Patent Application Laid-Open No. 11-143738

しかしながらブレードサーバは、単一ブレードでコンピュータシステムを形成し、複数ブレードを統合して１つのコンピュータシステムを形成しないので、コンピュータシステム内で致命障害が発生した場合、単一ブレードのログを収集するのみであり、複数ブレードでログを収集する必要がない。また、各ブレードで収集されたログは、異なる障害を契機に収集されたログでありそれらログに関連性はない。 However, since a blade server forms a computer system with a single blade and does not consolidate multiple blades into a single computer system, only a single blade log is collected if a fatal failure occurs in the computer system. It is not necessary to collect logs with multiple blades. In addition, the logs collected by each blade are collected when different failures occur, and these logs are not related.

また、上記特許文献に開示の技術においては、複数のプロセッサにおける障害情報を一元管理することが示されているものの、具体的な障害の種類に応じてこれを分類管理するような機能は示されておらず、従って、複数のセルノードで構成されるコンピュータシステム内の問題箇所を迅速に特定し、適切な処置を施すことは依然として困難を伴う。 In addition, although the technique disclosed in the above-mentioned patent document indicates that the failure information in a plurality of processors is centrally managed, the function for classifying and managing the failure information according to the specific failure type is shown. Therefore, it is still difficult to quickly identify a problem location in a computer system composed of a plurality of cell nodes and take appropriate measures.

本発明の目的は、上述した問題点に鑑みてなされたものであり、このようなコンピュータサーバにおいて、コンピュータシステム内の単一障害に起因して複数セルノードで収集・保持されたログを１つのログとして統合的に管理可能なログ管理システム、ログ管理方法を提供することにある。 The object of the present invention has been made in view of the above-mentioned problems. In such a computer server, logs collected and held in a plurality of cell nodes due to a single failure in the computer system are stored in one log. To provide a log management system and a log management method that can be managed in an integrated manner.

本発明によるログ管理システムは、複数のセルノードを有するコンピュータシステム内で障害が発生した場合のログデータを管理するログ管理システムであって、各セルノード内の構成部品における障害を検知する手段と、何れかのセルノードにおける構成部品において障害が検知された場合に、該障害の検知を当該セルノード内の管理コントローラに通知する手段と、障害の検知が管理コントローラに通知された場合に、障害の検知をコンピュータシステム内の他のセルノードに転送する手段と、各セルノードにおいて、自セルノード内で検知された障害並びに他のセルノードから転送された障害の検知をローカルログデータとしてローカルメモリ内に保持する手段と、各セルノードのローカルメモリに保持されるローカルログデータを、コンピュータシステム内の統括管理コントローラに転送する手段を備え、統括管理コントローラが、各セルノードから転送されたローカルログデータに基づいて、以降、各セルノードから転送されるローカルログデータにおける障害の要因を推定する手段と、障害要因の推定に従ってローカルログデータを集計してグローバルログデータとしてメモリ内に保持する手段とを含む。 A log management system according to the present invention is a log management system for managing log data when a failure occurs in a computer system having a plurality of cell nodes, and means for detecting a failure in a component in each cell node, When a failure is detected in a component in the cell node, a means for notifying the management controller in the cell node of the detection of the failure, and when a failure detection is notified to the management controller, the failure detection is performed by a computer. Means for transferring to other cell nodes in the system, and means for holding in each cell node a failure detected in the own cell node and a failure transferred from the other cell node in the local memory as local log data; Local log data held in the local memory of the cell node Means for transferring to the central management controller in the computer system, and the central management controller estimates the cause of failure in the local log data transferred from each cell node based on the local log data transferred from each cell node. And means for aggregating the local log data according to the estimation of the failure factor and holding the result in the memory as global log data.

本発明によるログ管理方法は、複数のセルノードを有するコンピュータシステム内で障害が発生した場合のログデータを管理するログ管理方法であって、各セルノード内の構成部品における障害を検知するステップと、何れかのセルノードにおける構成部品において障害が検知された場合に、該障害の検知を当該セルノード内の管理コントローラに通知するステップと、障害の検知が管理コントローラに通知された場合に、障害の検知をコンピュータシステム内の他のセルノードに転送するステップと、各セルノードにおいて、自セルノード内で検知された障害並びに他のセルノードから転送された障害の検知をローカルログデータとしてローカルメモリ内に保持するステップと、各セルノードのローカルメモリに保持されるローカルログデータを、コンピュータシステム内の統括管理コントローラに転送するステップを含み、統括管理コントローラにおいて、各セルノードから転送されたローカルログデータに基づいて、以降、各セルノードから転送されるローカルログデータにおける障害の要因を推定するステップと、障害要因の推定に従ってローカルログデータを集計してグローバルログデータとしてメモリ内に保持するステップを含む。 A log management method according to the present invention is a log management method for managing log data when a failure occurs in a computer system having a plurality of cell nodes, the step of detecting a failure in a component in each cell node, When a failure is detected in a component in the cell node, the step of notifying the management controller in the cell node of the detection of the failure, and the detection of the failure when the detection of the failure is notified to the management controller A step of transferring to other cell nodes in the system, and a step of holding in each cell node a failure detected in the own cell node and a failure transferred from the other cell node in the local memory as local log data; Local log data stored in the local memory of the cell node Of transferring the data to the central management controller in the computer system, and based on the local log data transferred from each cell node in the central management controller, the cause of failure in the local log data transferred from each cell node thereafter And a step of aggregating local log data according to the estimation of the failure factor and holding it in the memory as global log data.

本発明は、複数のセルノードから構成されるコンピュータシステムにおいて、今まで障害原因を特定できなかった障害に対しても障害原因を高精度に特定できるようになる。 According to the present invention, in a computer system composed of a plurality of cell nodes, a failure cause can be identified with high accuracy even for a failure that could not be identified until now.

その理由は、何れかのセルノードにおいて障害が発生すると、各セルノードにおいて記録保持されるローカルログデータが統括管理コントローラに転送され、そこで同種の障害レベルにあるログデータは、一つの要因に基づく障害であると推定され、その状態で記録保持されることになり、その結果、コンピュータシステムの保守管理において障害原因の特定が極めて容易になるからである。 The reason is that when a failure occurs in any of the cell nodes, the local log data recorded and held in each cell node is transferred to the central management controller, where the log data at the same failure level is a failure based on one factor. This is because it is presumed that there is a record and is kept in that state, and as a result, it is very easy to identify the cause of the failure in the maintenance management of the computer system.

以下本発明を実施するための最良の形態を、図を参照して説明する。 The best mode for carrying out the present invention will be described below with reference to the drawings.

（実施の形態の構成）
図１を参照すると、本発明の実施の形態によるコンピュータサーバの一実施例とコンピュータシステムの一実施例とそれらの相関が示されている。 (Configuration of the embodiment)
Referring to FIG. 1, an example of a computer server according to an embodiment of the present invention, an example of a computer system, and their correlation are shown.

コンピュータサーバ１００は、４つのセルノード１０１−１〜１０１−４から構成されている。これら４つのセルノード１０１−ｎは、セルノードを跨いだプロセッサ間、Ｉ／Ｏデバイス間、プロセッサとＩ／Ｏデバイス間のデータ送受信を行なう場合、伝送路１５０を介して行なう。また、セルノードを跨いだ管理コントローラ間の通信を行なう場合、伝送路１５１を介して行なう。 The computer server 100 is composed of four cell nodes 101-1 to 101-4. These four cell nodes 101-n perform data transmission / reception between processors, I / O devices, and between processors and I / O devices across cell nodes via a transmission line 150. In addition, when communication between management controllers across cell nodes is performed, it is performed via the transmission line 151.

コンピュータサーバ１００は、コンピュータシステムＡとコンピュータシステムＢに分割されている。コンピュータシステムＡは、３つのセルノード１０１−１，１０１−２，１０１−３から形成され、コンピュータシステムＢは、１つのセルノード１０１−４から形成されている。コンピュータシステムを跨いだプロセッサ間、Ｉ／Ｏデバイス間、プロセッサとＩ／Ｏデバイス間のデータ送受信は行なわれないので、セルノード１０１−４は、伝送路１５０を介して他セルノードとデータ送受信を行なわない。 The computer server 100 is divided into a computer system A and a computer system B. The computer system A is formed from three cell nodes 101-1, 101-2, and 101-3, and the computer system B is formed from one cell node 101-4. Since data transmission / reception between processors, I / O devices, and between processors and I / O devices across the computer system is not performed, the cell node 101-4 does not perform data transmission / reception with other cell nodes via the transmission line 150. .

コンピュータシステムＡは３つのセルノード１０１−１，１０１−２，１０１−３から形成されているため、コンピュータシステムＡ全体を統括管理する管理コントローラを１つ決める必要がある。本実施例では、セルノード１０１−１内の管理コントローラ１１１−１を、コンピュータシステムＡの統括管理コントローラとしている。 Since the computer system A is formed of three cell nodes 101-1, 101-2, and 101-3, it is necessary to determine one management controller that performs overall management of the entire computer system A. In this embodiment, the management controller 111-1 in the cell node 101-1 is used as the overall management controller of the computer system A.

図２を参照すると、本発明に係るコンピュータサーバの主要部品であるセルノードの一実施例が示されている。 Referring to FIG. 2, there is shown an embodiment of a cell node which is a main part of a computer server according to the present invention.

プロセッサ２００−ｎ、メモリ２０１−ｎ、Ｉ／Ｏデバイス２０２−ｎは、コンピュータの主要部品である。これら部品において障害を検知した場合、伝送路２６０−ｎを介して管理コントローラ１１１に障害検知通知が発行される。この障害検知通知には、障害レベルが含まれる。障害レベルとは、障害がコンピュータシステムに与える影響度である。障害レベル区分の一実施例を下記に記載する。 The processor 200-n, the memory 201-n, and the I / O device 202-n are main components of the computer. When a failure is detected in these components, a failure detection notification is issued to the management controller 111 via the transmission path 260-n. This failure detection notification includes a failure level. The failure level is the degree of influence that the failure has on the computer system. An example of the fault level classification is described below.

［致命障害］コンピュータシステムダウンとなる障害。該障害レベルはコンピュータシステム全体に影響が及ぶため、コンピュータシステム全体のログ収集が必要である。
［警告障害］ハードウェア部品の一部が異常状態であるが、コンピュータシステムは運用継続可能な障害レベル。障害検知セルノードのみのログ収集が必要である。
［訂正可能障害］メモリ１ビットエラー等の訂正可能な障害レベル。障害検知セルノードのみのログ収集が必要である。 [Fatal failure] A failure that causes the computer system to go down. Since the failure level affects the entire computer system, it is necessary to collect logs of the entire computer system.
[Warning failure] A failure level at which some hardware components are in an abnormal state, but the computer system can continue to operate. It is necessary to collect logs only for failure detection cell nodes.
[Correctable fault] A correctable fault level such as a memory 1-bit error. It is necessary to collect logs only for failure detection cell nodes.

インタコネクトコントローラ１１０は、同一セルノード内のプロセッサ２００−ｎ間、Ｉ／Ｏデバイス２０２−ｎ間及びプロセッサ２００−ｎとＩ／Ｏデバイス２０２−ｎ間のデータ送受信の制御や、同一コンピュータシステム内のセルノード間のデータ送受信の制御を司る。また、インタコネクトコントローラ１１０において障害検知した場合、伝送路２６０−５を介して管理コントローラに障害検知通知を発行する。この障害検知通知には、障害レベルが含まれる。 The interconnect controller 110 controls data transmission / reception between the processors 200-n, the I / O devices 202-n and between the processors 200-n and the I / O devices 202-n in the same cell node, Controls data transmission / reception between cell nodes. When a failure is detected in the interconnect controller 110, a failure detection notification is issued to the management controller via the transmission path 260-5. This failure detection notification includes a failure level.

管理コントローラ１１１は、本発明を実現するための主要部品である。管理コントローラ１１１は、自セルノード内のハードウェア部品の管理・制御を司る。そのため、プロセッサ２００−ｎ、メモリ２０１−ｎ、Ｉ／Ｏデバイス２０２−ｎ、インタコネクトコントローラ１１０等のセルノード内ハードウェア部品と伝送路２６０−ｎで接続されている。 The management controller 111 is a main part for realizing the present invention. The management controller 111 manages and controls hardware components in the own cell node. Therefore, the hardware components in the cell node such as the processor 200-n, the memory 201-n, the I / O device 202-n, and the interconnect controller 110 are connected by the transmission line 260-n.

また、自セルノード内のログを保持する不揮発性メモリ２０４や、自セルノードの環境（温度、電源等）を監視する環境監視デバイス２０３に接続している。さらに、コンピュータシステムを統括管理するために、他セルノード内管理コントローラと伝送路１５１で接続されている。 Further, it is connected to a non-volatile memory 204 that holds a log in the own cell node and an environment monitoring device 203 that monitors the environment (temperature, power supply, etc.) of the own cell node. Further, in order to centrally manage the computer system, it is connected to a management controller in another cell node through a transmission line 151.

管理コントローラ１１１は、プロセッサ２００−ｎ、メモリ２０１−ｎ、Ｉ／Ｏデバイス２０２−ｎ、インタコネクトコントローラ１１０等のセルノード内ハードウェア部品から伝送路２６０−ｎを介して障害検知通知を受け取ると、該障害検知通知の障害レベルが致命障害であったならば、同一コンピュータシステム内の全セルノード内管理コントローラへ伝送路１５１を介して該障害検知通知を転送する。 When the management controller 111 receives the failure detection notification via the transmission line 260-n from the hardware components in the cell node such as the processor 200-n, the memory 201-n, the I / O device 202-n, and the interconnect controller 110, If the failure level of the failure detection notification is a fatal failure, the failure detection notification is transferred via the transmission path 151 to all the cell node management controllers in the same computer system.

また、管理コントローラ１１１は、自セルノード内ハードウェア部品或いは他セルノードから障害検知通知を受け取ると、自セルノード内のハードウェア部品のログを収集し、不揮発性メモリ２０４に保持する。このときログを識別するローカルログＩＤとログ収集時間と受け取った障害検知通知内に埋め込まれている障害レベルもセットで不揮発性メモリに保持する。尚、不揮発性メモリ２０４に保持するログ構造体の一実施例が図３に示されている。この詳細については後述する。 When the management controller 111 receives a failure detection notification from the hardware component in the own cell node or from another cell node, the management controller 111 collects the log of the hardware component in the own cell node and holds it in the nonvolatile memory 204. At this time, the local log ID for identifying the log, the log collection time, and the failure level embedded in the received failure detection notification are also stored in the nonvolatile memory as a set. An example of a log structure held in the nonvolatile memory 204 is shown in FIG. Details of this will be described later.

さらに、管理コントローラ１１１は、自セルノード内のハードウェア部品のログを不揮発性メモリ２０４に保持した後、ローカルログＩＤとログ収集時間と障害レベルを添えて、統括管理コントローラ１１１へログ収集通知を発行する。 Furthermore, the management controller 111 holds a log of hardware parts in its own cell node in the nonvolatile memory 204, and then issues a log collection notification to the overall management controller 111 with a local log ID, a log collection time, and a failure level. To do.

統括管理コントローラとなっている管理コントローラ１１１は、同一コンピュータシステム内のある管理コントローラ１１１から障害レベルが致命障害であるログ収集通知を付け取ると、その後一定時間以内の間に同一コンピュータシステム内の他管理コントローラ１１１から受け取る障害レベルが致命障害であるログ収集通知を、同一障害に起因したログであると判断し、それら異なる管理コントローラから通知された複数のログ収集通知を１まとまりとして管理する。 When the management controller 111, which is the overall management controller, receives a log collection notification with a failure level of fatal failure from a certain management controller 111 in the same computer system, the management controller 111 in the same computer system thereafter A log collection notification whose failure level is a fatal failure received from the management controller 111 is determined to be a log caused by the same failure, and a plurality of log collection notifications notified from these different management controllers are managed as one group.

尚、管理方法の一実施例が図４に示されている。統括管理コントローラとなっている管理コントローラ１１１は、図４に示すような自コンピュータシステム内で収集・保持されているハードウェア部品ログを一元管理できるログ管理テーブルを持つ。 An embodiment of the management method is shown in FIG. The management controller 111 serving as the overall management controller has a log management table capable of centrally managing hardware component logs collected and held in its own computer system as shown in FIG.

図４に示されているグローバルログＩＤ＝３は、同一致命障害に起因して収集・保持されたログが、セルノード１０１−１内不揮発性メモリ２０４内のローカルログＩＤ＝２、セルノード１０１−２内不揮発性メモリ２０４内のローカルログＩＤ＝１とセルノード１０１−３内不揮発性メモリ２０４内のローカルログＩＤ＝０であることを示している。 The global log ID = 3 shown in FIG. 4 indicates that the log collected and retained due to the coincidence failure is the local log ID = 2 in the nonvolatile memory 204 in the cell node 101-1, the cell node 101-2. It shows that the local log ID = 1 in the internal nonvolatile memory 204 and the local log ID = 0 in the nonvolatile memory 204 in the cell node 101-3.

（実施の形態の動作）
以下、本実施例の動作について、図５〜８並びに図３、４を用いて説明する。 (Operation of the embodiment)
Hereinafter, the operation of the present embodiment will be described with reference to FIGS.

図５は、本発明の分散ログ管理方法における、致命障害発生から分散収集・保持されたログを管理するまでの処理フローチャートであり、図６と図７は、その動作を表したものである。本実施例では、セルノード１０１−３内のメモリ２０１−１において致命障害を検知した場合の動作について説明する。 FIG. 5 is a processing flowchart from the occurrence of a fatal failure to the management of logs collected and held in the distributed log management method of the present invention, and FIGS. 6 and 7 show the operation. In this embodiment, an operation when a fatal fault is detected in the memory 201-1 in the cell node 101-3 will be described.

セルノード１０１−３内のメモリ２０１−１は、致命障害を検知すると（図５におけるステップＳ５０１）、障害レベルを致命障害として自セルノード内の管理コントローラ１１１−３へ伝送路２６０−１を介して障害検知通知を発行する（ステップＳ５０２、図７における１）。 When the memory 201-1 in the cell node 101-3 detects a fatal failure (step S 501 in FIG. 5), the failure level is assumed to be a fatal failure and the failure is transmitted to the management controller 111-3 in the own cell node via the transmission line 260-1. A detection notification is issued (step S502, 1 in FIG. 7).

セルノード１０１−３内の管理コントローラ１１１−３は、障害レベルが致命障害である障害通知を受け取ると、同一コンピュータシステム内である他セルノードの管理コントローラ１１１−１と１１１−２へ該障害検知通知を転送する（ステップＳ５０３、図７における２）。尚、障害レベルが警告障害或いは訂正可能障害である場合は、他セルノードの管理コントローラへ障害通知を転送しない。なぜならば、障害レベルが警告障害或いは訂正可能障害の場合、障害検知セルノード内のログだけで十分障害箇所を特定できるからである。 When the management controller 111-3 in the cell node 101-3 receives a failure notification whose failure level is a fatal failure, the management controller 111-3 sends the failure detection notification to the management controllers 111-1 and 111-2 of other cell nodes in the same computer system. Transfer (step S503, 2 in FIG. 7). When the failure level is a warning failure or a correctable failure, the failure notification is not transferred to the management controller of another cell node. This is because, when the failure level is a warning failure or a correctable failure, the failure location can be sufficiently identified only by the log in the failure detection cell node.

セルノード１０１−３内の管理コントローラ１１１−３は、障害検知通知を受け取ると、自セルノード内のハードウェア部品のログを収集し、自管理コントローラ配下の不揮発性メモリ２０４に保持する（ステップＳ５０４−１、図６における３）。 Upon receiving the failure detection notification, the management controller 111-3 in the cell node 101-3 collects hardware component logs in the own cell node and holds them in the nonvolatile memory 204 under the self-management controller (step S504-1). 3 in FIG.

これと並行して、コンピュータシステムＡ内の他の全てのセルノードにおける管理コントローラ１１１−１、１１１−２も、同様に、障害検知通知を受け取ると、自セルノード内のハードウェア部品のログを収集し、自管理コントローラ配下の不揮発性メモリ２０４に保持する（ステップＳ５０４−２、図６における３）。この際、各管理コントローラ１１１−ｎは、自セルノード内でユニークなローカルセルＩＤとログ収集した時間と障害検知通知に埋め込まれて来た障害レベル（致命障害）も一緒に保持する。 In parallel with this, the management controllers 111-1 and 111-2 in all the other cell nodes in the computer system A similarly collect the log of the hardware parts in the own cell node when receiving the failure detection notification. And stored in the non-volatile memory 204 under its own management controller (step S504-2, 3 in FIG. 6). At this time, each management controller 111-n also holds the local cell ID unique in the own cell node, the log collection time, and the failure level (fatal failure) embedded in the failure detection notification.

図３に、この時点で各セルノードの不揮発性メモリ２０４に保持されている情報の一例を示す。セルノード１０１−１では、今までに２つのログがローカルログＩＤ＝０と１に保持されているため、セルノード１０１−３内のメモリ２０１−１が検知した致命障害に起因して収集・保持されたログは、ローカルログＩＤ＝２に保持される。 FIG. 3 shows an example of information held in the nonvolatile memory 204 of each cell node at this time. In the cell node 101-1, since two logs have been held in the local log ID = 0 and 1 so far, they are collected and held due to the fatal failure detected by the memory 201-1 in the cell node 101-3. The log is held at local log ID = 2.

セルノード１０１−２では、今までに１つのログがローカルログＩＤ＝０に保持されているため、セルノード１０１−３内のメモリ２０１−１が検知した致命障害に起因して収集・保持されたログは、ローカルログＩＤ＝１に保持される。セルノード１０１−３では、今までに保持されたログが無いので、セルノード１０１−３内のメモリ２０１−１が検知した致命障害に起因して収集・保持されたログは、ローカルログＩＤ＝０に保持される。 In the cell node 101-2, since one log has been held at the local log ID = 0 so far, logs collected and held due to a fatal failure detected by the memory 201-1 in the cell node 101-3 Is held at local log ID = 1. In the cell node 101-3, since there is no log held so far, the log collected and held due to the fatal failure detected by the memory 201-1 in the cell node 101-3 is set to local log ID = 0. Retained.

セルノード１０１−３内の管理コントローラ１１１−３は、自セルノード内のログ収集・保持が完了すると、統括管理コントローラである管理コントローラ１１１−１へ、ローカルログＩＤとログ収集時間と障害レベルを添えてログ収集通知を伝送路１５１を介して発行する（ステップＳ５０５−１、図６における４）。 The management controller 111-3 in the cell node 101-3, when the log collection / retention in the own cell node is completed, adds the local log ID, the log collection time, and the failure level to the management controller 111-1 as the overall management controller. A log collection notification is issued via the transmission path 151 (step S505-1, 4 in FIG. 6).

これと並行して、コンピュータシステムＡ内の他の全てのコンピュータシステムＡ内の他の全てのセルノードにおける管理コントローラ１１１−１、１１１−２も、同様に、自セルノード内のログ収集・保持が完了すると、統括管理コントローラである管理コントローラ１１１−１へ、ローカルログＩＤとログ収集時間と障害レベルを添えてログ収集通知を伝送路１５１を介して発行する（ステップＳ５０５−２、図６における４）。 In parallel with this, the management controllers 111-1 and 111-2 in all other cell nodes in the computer system A in the computer system A also complete the log collection / retention in their own cell nodes. Then, a log collection notification is issued via the transmission path 151 with the local log ID, the log collection time, and the failure level to the management controller 111-1 as the overall management controller (step S505-2, 4 in FIG. 6). .

すなわち、セルノード１０１−１の管理コントローラ１１１−１はローカルログＩＤ＝２を報告し、セルノード１０１−２の管理コントローラ１１１−２はローカルログＩＤ＝１を報告し、セルノード１０１−３の管理コントローラ１１１−３はローカルログＩＤ＝０を報告する。 That is, the management controller 111-1 of the cell node 101-1 reports local log ID = 2, the management controller 111-2 of the cell node 101-2 reports local log ID = 1, and the management controller 111 of the cell node 101-3. -3 reports local log ID = 0.

統括管理コントローラである管理コントローラ１１１−１は、コンピュータシステムＡ内の何れかのセルノードから障害レベルが致命障害であるログ収集通知を受け取ると、タイマーをスタートさせる。このタイマーはある一定時間経過するとタイムアウトする。統括管理コントローラは、タイマーがタイムアウトするより前に受け取った障害レベルが致命障害である複数のログ収集通知を、セルノード１０１−３内のメモリ２０１−１が検知した致命障害に起因して収集・保持されたログであると判断する（ステップＳ５０６、図６における５）。 When the management controller 111-1, which is the overall management controller, receives a log collection notification whose failure level is fatal from any cell node in the computer system A, it starts a timer. This timer times out after a certain time. The overall management controller collects and holds a plurality of log collection notifications whose failure level is fatal failure received before the timer times out due to the fatal failure detected by the memory 201-1 in the cell node 101-3. It is determined that the log has been recorded (step S506, 5 in FIG. 6).

統括管理コントローラである管理コントローラ１１１−１は、図４に示すログ管理テーブルを保持しており、ステップＳ５０６で判断した結果を格納する（ステップＳ５０７、図６における６）。図４におけるグローバルログＩＤ＝３のログが、セルノード１０１−３内のメモリ２０１−１が検知した致命障害に起因して収集・保持されたログであることを示す。 The management controller 111-1, which is the overall management controller, holds the log management table shown in FIG. 4, and stores the result determined in step S506 (step S507, 6 in FIG. 6). 4 indicates that the log with the global log ID = 3 is a log collected and held due to a fatal failure detected by the memory 201-1 in the cell node 101-3.

統括管理コントローラが保持するログ管理テーブルと各セルノードが不揮発性メモリ２０４に保持するログの対応方法を図８に示す。グローバルログＩＤ＝０は、セルノード１０１−２が保持するローカルログＩＤ＝０に対応し、グローバルログＩＤ＝１は、セルノード１０１−１が保持するローカルログＩＤ＝０に対応し、グローバルログＩＤ＝２は、セルノード１０１−１が保持するローカルログＩＤ＝１に対応し、グローバルログＩＤ＝３は、セルノード１０１−１が保持するローカルＩＤ＝２とセルノード１０１−２が保持するローカルログＩＤ＝１とセルノード１０１−３が保持するローカルログＩＤ＝０に対応している。即ち、統括管理コントローラが保持するログ管理テーブルは、障害検知毎に新たなグローバルＩＤが付加され、その障害検知に起因して収集・保持されたログがどこに存在するどれなのかが判る。 FIG. 8 shows a correspondence method between the log management table held by the overall management controller and the log held in the nonvolatile memory 204 by each cell node. The global log ID = 0 corresponds to the local log ID = 0 held by the cell node 101-2, the global log ID = 1 corresponds to the local log ID = 0 held by the cell node 101-1, and the global log ID = 2 corresponds to the local log ID = 1 held by the cell node 101-1, and the global log ID = 3 is the local ID = 2 held by the cell node 101-1, and the local log ID = 1 held by the cell node 101-2. And the local log ID = 0 held by the cell node 101-3. In other words, a new global ID is added to the log management table held by the overall management controller every time a failure is detected, and it is possible to determine where and where logs collected and held due to the failure detection exist.

以上により、何れかのセルノード内での障害の発生からログの登録までの一連の処理が完了する。 As described above, a series of processes from occurrence of a failure in any cell node to log registration is completed.

（実施の形態の効果）
以上説明したように、本実施の形態によれば、コンピュータシステム内の単一障害に起因して複数セルノードで収集・保持されたログを１つのログとして管理できるため、これら複数セルノードで収集されたログを横断的に解析することができるようになり、その結果、今まで障害原因を特定できなかった障害に対しても障害原因が特定できるようになるという効果を有する。 (Effect of embodiment)
As described above, according to the present embodiment, logs collected and held in a plurality of cell nodes due to a single failure in the computer system can be managed as one log. As a result, it becomes possible to analyze the logs across the logs, and as a result, it is possible to identify the cause of the failure even for the failure that could not be identified until now.

本発明の実施の形態による各セルノードの管理コントローラについては、ハードウェア的に実現することは勿論として、その機能を提供するログ管理プログラムを、ハードディスク装置やＲＯＭ等の補助記憶部に格納し、そのプログラムをプロセッサで実行することにより、ソフトウェア的に実現することも可能である。 The management controller of each cell node according to the embodiment of the present invention is not only realized in hardware, but also stores a log management program providing its function in an auxiliary storage unit such as a hard disk device or ROM, and It can also be realized in software by executing the program with a processor.

以上好ましい実施の形態をあげて本発明を説明したが、本発明は必ずしも、上記実施の形態に限定されるものではなく、その技術的思想の範囲内において様々に変形して実施することが出来る。 Although the present invention has been described with reference to the preferred embodiments, the present invention is not necessarily limited to the above embodiments, and various modifications can be made within the scope of the technical idea. .

例えば、上記実施の形態では、統括管理コントローラを選択された一つのセルノードにおける管理コントローラとして利用したが、これをセルノードの外にある管理コントローラ上で実現するようにしてもよい。 For example, in the above embodiment, the overall management controller is used as a management controller in one selected cell node, but this may be realized on a management controller outside the cell node.

また、上記管理コントローラにおける機能を、インタコネクトコントローラに持たせるようにしてもよい。一般的に管理コントローラ間におけるデータ伝送よりも、インタコネクトコントローラ間におけるデータ伝送のほうが高速であるので、各セルノード間におけるログデータの転送時間差を小さくするためには、この例のほうが好ましい。 Further, the interconnect controller may have the function of the management controller. Since data transmission between interconnect controllers is generally faster than data transmission between management controllers, this example is preferable in order to reduce the difference in log data transfer time between cell nodes.

本発明の一実施の形態によるコンピュータシステムおけるログ管理システムの構成を示すブロック図である。It is a block diagram which shows the structure of the log management system in the computer system by one embodiment of this invention. 図１の実施の形態におけるセルノードの構成を示すブロック図である。It is a block diagram which shows the structure of the cell node in embodiment of FIG. 各セルノードの不揮発性メモリに保持されている情報の一例を示す図である。It is a figure which shows an example of the information hold | maintained at the non-volatile memory of each cell node. 統括管理コントローラを有するセルノード内の不揮発性メモリで収集・保持される、ハードウェア部品ログを一元管理するログ管理テーブルの一例を示す図である。It is a figure which shows an example of the log management table which centrally manages the hardware component log collected and hold | maintained with the non-volatile memory in the cell node which has an integrated management controller. 本発明の一実施の形態によるログ管理方法における、致命障害発生から分散収集・保持されたログを管理するまでの処理フローチャートを示す図である。It is a figure which shows the processing flowchart until it manages the log collected and hold | maintained from fatal failure generation in the log management method by one embodiment of this invention. 図５における処理の工程を付記した本発明の一実施の形態によるログ管理システムの構成を示すブロック図である。It is a block diagram which shows the structure of the log management system by one embodiment of this invention which added the process of FIG. 図５における処理の工程を付記したセルノードの構成を示すブロック図である。It is a block diagram which shows the structure of the cell node which added the process of the process in FIG. 統括管理コントローラが保持するログ管理テーブルと各セルノードが不揮発性メモリに保持するログの対応方法を示す図である。It is a figure which shows the correspondence method of the log management table which a general management controller hold | maintains, and the log which each cell node hold | maintains at non-volatile memory. 関連技術によるブレードサーバの構成を示すブロック図である。It is a block diagram which shows the structure of the blade server by related technology.

Explanation of symbols

１００：コンピュータサーバ
１０１−ｎ：セルノード
１１０−ｎ：インタコネクトコントローラ
１１１−ｎ：管理コントローラ
１１１−１：統括管理コントローラ
１５０：伝送路
１５１：伝送路
２００−ｎ：プロセッサ
２０１−ｎ：メモリ
２０２−ｎ：Ｉ／Ｏデバイス
２０３：環境監視デバイス
２０４：不揮発性メモリ
２６０−ｎ：伝送路 DESCRIPTION OF SYMBOLS 100: Computer server 101-n: Cell node 110-n: Interconnect controller 111-n: Management controller 111-1: General management controller 150: Transmission path 151: Transmission path 200-n: Processor 201-n: Memory 202-n : I / O device 203: Environmental monitoring device 204: Non-volatile memory 260-n: Transmission path

Claims

A log management system for managing log data when a failure occurs in a computer system having a plurality of cell nodes,
Means for detecting a failure in a component in each cell node;
Means for notifying the detection of the failure to the management controller in the cell node when a failure is detected in a component in any of the cell nodes;
Means for transferring the failure detection to another cell node in the computer system when the failure detection is notified to the management controller;
In each cell node, means for holding in a local memory as a local log data detection of a failure detected in the own cell node and a failure transferred from the other cell node;
Means for transferring local log data held in a local memory of each cell node to a general management controller in a computer system;
The overall management controller is
Means for estimating the cause of failure in the local log data transferred from each cell node based on the local log data transferred from each cell node;
Means for aggregating the local log data according to the estimation of the failure factor and holding it in memory as global log data;
A log management system in a computer system, comprising:

Means for notifying the management controller in the cell node of the detection of the failure,
2. The log management system according to claim 1, wherein the management controller is notified of information for identifying a component in which a failure has occurred and information for identifying the level of the failure.

Information identifying the level of failure is
3. The log management system according to claim 2, wherein at least the failure includes information indicating a type of whether or not the failure is a fatal failure affecting the entire computer system.

Means for forwarding the failure detection to another cell node in the computer system;
The detection of the failure is transferred to another cell node in the computer system only when the information specifying the level of the failure indicates a fatal failure. The log management system according to claim 3.

Means for holding the failure in the local memory as local log data,
5. The log management system according to claim 1, wherein the log management system retains information specifying a unique log ID, a log data collection time, and a failure level for each failure.

Means for estimating the cause of failure in local log data transferred from each cell node of the overall management controller,
When the failure level in the first local log data is a fatal failure, the local log data at the fatal failure level transferred within a predetermined time from the transfer is caused by the same factor as the first failure. The log management system according to any one of claims 1 to 5, wherein the log management system is estimated.

The log management system according to claim 1, wherein the management controller is an interconnect controller in each cell node.

8. The log management system according to claim 1, wherein one of the selected management controllers in any of the cell nodes is used as the overall management controller. 9.

A log management method for managing log data when a failure occurs in a computer system having a plurality of cell nodes,
Detecting a failure in a component in each cell node;
A step of notifying the detection of the failure to a management controller in the cell node when a failure is detected in a component in any of the cell nodes;
Transferring the failure detection to another cell node in the computer system when the failure detection is notified to the management controller;
In each cell node, holding a failure detected in its own cell node and a failure transferred from the other cell node in local memory as local log data; and
Transferring local log data held in a local memory of each cell node to a general management controller in a computer system;
In the overall management controller,
Based on the local log data transferred from each cell node, and thereafter estimating the cause of failure in the local log data transferred from each cell node;
A log management method in a computer system, comprising the step of aggregating the local log data in accordance with the failure factor estimation and holding it in a memory as global log data.

Informing the management controller in the cell node of the detection of the failure,
The log management method according to claim 9, wherein the management controller is notified of information for identifying a failed component and information for identifying the level of the failure.

Information identifying the level of failure is
The log management method according to claim 10, comprising at least information indicating whether the failure is a fatal failure affecting the entire computer system.

Transferring the detection of the failure to another cell node in the computer system;
The detection of the failure is transferred to another cell node in the computer system only when the information specifying the level of the failure indicates a fatal failure. The log management method according to claim 11.

Holding the failure as local log data in local memory;
The log management method according to any one of claims 9 to 12, wherein information specifying a unique log ID, a log data collection time, and a failure level is held for each failure.

In the step of estimating the cause of failure in local log data transferred from each cell node of the overall management controller,
When the failure level in the first local log data is a fatal failure, the local log data at the fatal failure level transferred within a predetermined time from the transfer is caused by the same factor as the first failure. The log management method according to any one of claims 9 to 13, wherein the log management method is estimated.

The log management method according to any one of claims 9 to 14, wherein the management controller is an interconnect controller in each cell node.

The log management method according to claim 9, wherein one of the selected management controllers in the cell node is the overall management controller.