JP2004302731A

JP2004302731A - Information processor and method for trouble diagnosis

Info

Publication number: JP2004302731A
Application number: JP2003093171A
Authority: JP
Inventors: Yuji Fujiwara; 勇治藤原
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-03-31
Filing date: 2003-03-31
Publication date: 2004-10-28

Abstract

<P>PROBLEM TO BE SOLVED: To perform trouble diagnosis action for a multiprocessor system without requiring any special hardware for trouble diagnosis. <P>SOLUTION: A POST process task table for trouble diagnosis is loaded into a memory 5 and the CPU of one of a group of processors 1 performs tasks serially while referring to the task table. Further, a command format table is loaded into the memory 5, and the group of processors 1 updates and refers to the table serially to determine whether or not the execution of each task of the POST can be completed. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、複数のプロセッサを使用した情報処理装置に関し、特に障害発生時に自動的に復旧する機能を有するＰＣサーバ装置に関する。
【０００２】
【従来の技術】
従来よりＰＣサーバ装置では、その信頼性を向上させるため、ハードディスク装置の障害や、ネットワーク障害などの障害発生時に障害内容を診断して、自動的に復旧させる機能（以下ＰＯＳＴ（ＰｏｗｅｒＯｎＳｅｌｆＴｅｓｔ）と称する）が具備されている。
【０００３】
ＰＯＳＴを実行する際は、システムのＢＩＯＳに保持されているタスクテーブルに従って、その複数のタスクを順に実行していく。通常のＢＩＯＳでは、タスク実行前にそのタスク番号を特定の入出力ポートに出力して、どのタスクを実行中であるかを外部に示す。例えば、外部から各タスクに対応して付与されたタスク番号を、ＬＥＤなどにのＩ／Ｏポートに接続しておくことにより、ＬＥＤの点滅内容で、タスクの内容が判別できるようにしている。従って、システムがハングアップした場合には、ＬＥＤをみることが、どのタスクを実行中にハングアップしたのかが分かる。
【０００４】
最近のＰＣサーバＢＩＯＳでは、ＩＰＭＩ（ＩｎｔｅｌｌｉｇｅｎｔＭａｎａｇｅｍｅｎｔＰｌａｔｆｏｒｍ）と呼ばれる特定のコマンドを使用して、ＢＭＣ（ＢａｓｅｂａｎｄＭａｎａｇｅｍｅｎｔＣｏｎｔｒｏｌｌｅｒ）と呼ばれるプロセッサにタスク番号を通知するようにしている。同時にＢＭＣのタイマーを（設定した時間が経過してもリセットされない場合は、リブートするように）セットする。ＰＯＳＴ途中でハングアップしてしまった場合は、設定した時間経過後リブートしてシステムを復旧する。ＢＭＣはリブート時のタスク番号を保持しているので、これをログデータとして記録する。
【０００５】
【特許文献１】
特開平１０−１４３３８７号公報（図１）
【０００６】
【発明が解決しようとする課題】
このように、従来の障害診断方式では、必ずＢＭＣのような専用のプロセッサが必要であった。ＢＭＣが存在しないシステムも存在するが、この場合は、システムハングアップした時にリブートすることは可能であるが、どのタスク番号でハングしたのか記録できないなど機能が制限されていた。
【０００７】
そこで、本発明では、ＢＭＣなどの特定のプロセッサを具備しなくともＰＯＳＴの起動障害を監視可能なＰＣサーバを提供することを目的とする。
【０００８】
【課題を解決するための手段】
本発明は、上述した課題を解決するため、複数のプロセッサを有する情報処理装置において、障害診断用の複数のタスクを順次実行していく第１のＣＰＵと、前記第１のＣＰＵが実行しているタスクの実行状況を監視する第２のＣＰＵと、前記順次実行されるタスクの内容と、このタスクに対応する監視用データをロードするメモリとを有し、前記第１のＣＰＵは、実行しているタスク毎に、監視用データを更新し、前記第２のＣＰＵは前記監視用データの更新状況を参照して、障害の有無を判別するものである。
【０００９】
また本発明は、上述した課題を解決するため、前記監視用データは、前記タスク処理内容を特定するためのタスク番号と、このタスク番号に応じたタイムアウト値と、タスクの実行状況を示す状況データとを含むものである。
【００１０】
また本発明は、上述した課題を解決するため、前記第２のＣＰＵは、前記タイムアウト値を参照して、前記第１のＣＰＵにより前記タイムアウト値に示す時間内に前記状況データを更新されていない場合、障害が発生したと判断するものである。
【００１１】
また本発明は上述した課題を解決するため、障害診断用の複数のタスクを順次実行していく第１のＣＰＵと、前記第１のＣＰＵが実行しているタスクの実行状況を監視する第２のＣＰＵと、前記順次実行されるタスクの内容と、このタスクに対応する監視用データをロードするメモリとを有する情報処理装置における障害診断方法において、前記第１のＣＰＵは、実行しているタスク毎に、監視用データを更新し、前記第２のＣＰＵは前記監視用データの更新状況を参照して、障害の有無を判別すものである。
【００１２】
【発明の実施の形態】
以下、図面を用いて、本発明の実施形態を説明する。
図１は、本発明の一実施形態であるＰＣサーバのシステム構成を示すブロック図である。
本実施形態のＰＣサーバは、複数のプロセッサ群１を有する。プロセッサ群１は、ＣＰＵ０〜ｍのＭ個のＣＰＵからなり、ＣＰＵ０は、システムの起動処理を実行するＢＳＰ（ＢｏｏｔＳｔｒａｐＰｒｏｃｅｓｓｏｒ）であり、ＣＰＵ１〜ＣＰＵｍは、システム起動後のアプリケーションプログラムを実行するＡＰ（ＡｐｐｌｉｃａｔｉｏｎＰｒｏｃｅｓｓｏｒ）である。
【００１３】
これらプロセッサ群１は、ホストバス２を介して、ノースブリッジ３に接続されており、メモリバス４を介して接続されるメモリ５や、ＰＣＩバス５を介して接続される各種Ｉ／Ｏデバイス群６の動作制御を実行する。また、メモリ５には、図２に示すようなＰＯＳＴ処理用のタスクテーブルがロードされており、プロセッサ群１のうち１つのＣＰＵがこのタスクテーブルを参照しながら各タスクを順次実行する。
【００１４】
さらに、メモリ５には、図３に示すようなコマンドフォーマットテーブルがロードされており、このテーブルをプロセッサ群１が逐次更新、参照することによりＰＯＳＴの各タスクの実行完了の可否を判断する。
【００１５】
メモリ５は具体的にはＤＩＭＭ（ＤｕａｌＩｎｌｉｎｅＭｅｍｏｒｙＭｏｄｕｌｅ）である。デバイス群６のデバイスには、アプリケーションプログラムを格納したハードディスク装置や、ＢＩＯＳプログラムを格納したＲＯＭなどが含まれる。
【００１６】
サウスブリッジ７は、その他ＩＳＡバス（図示せず）やＳＭバス８を介して接続されるＩ／Ｏデバイスをプロセッサ群１が動作制御するためのブリッジ回路である。
次に、図２のタスクテーブルの内容について説明する。
タスクテーブルは、タスク毎にその処理内容を記したリストであり、システム起動開始後に、タスク１から順番に実行していき、システムの障害状況を診断する。例えば、最初のタスク１は、チップセットの初期化処理であり、初期化が正常に完了されれば、次のタスク処理に移行する。
【００１７】
そして、各タスクが問題なく実施された場合には、最後にＢＩＯＳセットアップを実施（タスクＮ−１）して、ＯＳの起動処理（タスクＮ）に入り、ＰＯＳＴ動作が完了する。
【００１８】
本実施形態では、タスクテーブルに書かれた各タスクをＣＰＵ０が順次実行していく。そして、ＣＰＵ０のタスク処理の実行が成功したか否かを図３のコマンドフォーマットを利用してＣＰＵ１が監視する。
【００１９】
このタスク処理の実行と監視用に、図３に示すコマンドフォーマットが利用される。このコマンドフォーマットの内容について説明する。
コマンドフォーマットは、オフセット番号、コマンドサイズ、データフィールドが対応付けられたコマンドテーブルリストとなっている。
オフセット番号は、リスト番号として利用されるものである。サイズは、データフィールドの書き込まれるデータのデータサイズである。データフィールドは、各タスク毎に更新されていくデータであり、オフセット１には、実行中のタスク番号が書き込まれる。オフセット２には、タイムアウト値が書き込まれ各タスクをＣＰＵ０が実行したとき、そのタスク実行が所定時間内に完了しなかったことを検出するための時間情報である。このタイムアウト値は、タスク毎にその時間が異なっており、１００ｍｓ単位でそれぞれのタスクに応じたタイムアウト値が書き込まれる。
【００２０】
オフセット３には、タスクの実行が完了したか否かを検出するための参照データであり、ＣＰＵ０が実行中のタスクを正常に完了すると、ビッド１（１ｂ）が書き込まれる。一方、タスク実行中またはタスクの実行を正常完了できないときは、ビット０（０ｂ）が書き込まれた状態となっている。このオフセット３をＣＰＵ１が監視することにより、現在のタスクが完了したか、或いはまだ実行中であるかを判別することができる。そして、ＣＰＵ１は、オフセット２の書かれたタイムアウト値を参照し、このタイムアウト値の時間内に、オフセット３のデータ更新されていない場合には、このタスクが正常に完了しなかったと判断して、図４のオフセット４に示すような、タイムアウトアクションを実行して、障害復旧処理を実行していく。具体的にはＮｏＡｃｔｉｏｎ（復旧処理せずに動作停止維持）、ＨａｒｄＲｅｓｅｔ（強制リセット）、ＰｏｗｅｒＤｏｗｎ（シャットダウン処理）、ＰｏｗｅｒＣｙｃｌｅ（再起動処理）のいずれかであり、各タスク毎にＣＰＵ０がメモリ５にロードされたコマンドフォーマットに書き込んでいく。
【００２１】
以上のように、図３に示すこのコマンドフォーマットは、ＣＰＵ０のからの書き込み指示によりタスク毎に順次更新されていく。即ち、図２に示すタスクテーブルのタスク１からＣＰＵ０が順次タスクを実行していき、１つのタスクの実行が完了すると次のタスクに関するデータに変更される。
【００２２】
このような構成において、本実施形態の動作を説明する。
まず、デバイス群６から読み出されたＢＩＯＳは、ＣＰＵ０から図２に示す最初のタスク番号１を特定のポートに送信するのと同じタイミングで、以下の処理を行う。
【００２３】
（１）ＣＰＵ１をｗａｋｅ（ＣＰＵ１に駆動開始のためのｗａｋｅ割り込み発行）。
（２）ＣＰＵ１へタスク番号を通知
（３）ＣＰＵ１へタイムアウト時間を通知。
（（２）〜（３）の通知は、ＣＰＵ１が、メモリ５に書き込まれたコマンドフォーマットを参照することで実行される。ＣＰＵ０は、メモリ５の特定のアドレスに図３に示すコマンドフォーマットを書き込み、ＣＰＵ１がそのコマンドフォーマットに従った動作を実行するものである）。
【００２４】
（４）ＣＰＵ１へタイマー開始を指示し、ＣＰＵ１がＣＰＵ０のタスク実行の完了状況に関する時間監視をする。
この（４）の動作において、ＣＰＵ１では、コマンドフォーマットのオフセット１に書き込まれた時間内に、ｓｔｏｐまたはｓｔａｒｔ／ｒｅｓｔａｒｔコマンドを受信しない場合に、図４に示すタイムアクション処理を実行する。また後で、障害解析ができるように、フラッシュメモリなどの不揮発性メモリにタイムアウトしたときのタスク番号を記録しておく。
【００２５】
そして、タスク番号２以降のタスクを順次実行していくがその際は、以下のようにタスク番号の通知のみを行う。
（５）ＣＰＵへタスク番号を通知。
（６）ＣＰＵ１へタイムアウト時間を通知。
（７）ＣＰＵ１へタイマー開始を指示。
（この（５）〜（７）の動作は（２）〜（４）と同様の具体的な動作で実行される）
そして、図２に示された最後のタスクであるＯＳ起動直前には、以下の処理を実行する。
（８）ＣＰＵ１にタイマー計測の停止を指示。
（９）ＣＰＵ１をＨＬＴＳｔａｔｅ（動作停止状態）に移行。
このように、（９）の動作まで完了した場合には、障害がなかったとして、ＯＳが開始され、ＰＣサーバのアプリケーションプログラムの稼働が開始される。
以上説明したように、本実施形態によれば、障害診断用に特別なハードウェアを必要とすることなく、マルチプロセッサシステムでの障害診断動作を実行することができるものである。
【００２６】
【発明の効果】
本発明によれば、障害診断用に特別なハードウェアを必要とすることなく、マルチプロセッサシステムでの障害診断動作を実行することができる。
【図面の簡単な説明】
【図１】本発明の一実施形態におけるＰＣサーバのシステム構成を示すブロック図である。
【図２】同実施形態におけるタスクテーブルの内容を示す図である。
【図３】同実施形態におれるタスク実行中のコマンドフォーマットを説明するための図である。
【図４】同実施形態における、障害発生時の復旧動作を示すコマンドフォーマットを説明するための図である。
【符号の説明】
１…プロセッサ群、２…ホストバス、３…ノースブリッジ、４…メモリバス、５…メモリ、６…デバイス群、７…サウスブリッジ。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an information processing apparatus using a plurality of processors, and more particularly, to a PC server apparatus having a function of automatically recovering from a failure.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, in order to improve the reliability of a PC server device, when a failure such as a hard disk device failure or a network failure occurs, a function of diagnosing the failure content and automatically recovering the failure (hereinafter referred to as POST (Power On Self Test)) ).
[0003]
When executing POST, the plurality of tasks are sequentially executed according to a task table held in the BIOS of the system. In a normal BIOS, the task number is output to a specific input / output port before the task is executed, to indicate to the outside which task is being executed. For example, by connecting a task number externally assigned to each task to an I / O port of an LED or the like, the content of the task can be determined from the blinking content of the LED. Thus, if the system hangs up, looking at the LEDs tells which task was hung during execution.
[0004]
Recent PC server BIOS uses a specific command called IPMI (Intelligent Management Platform) to notify a processor called BMC (Baseband Management Controller) of a task number. At the same time, the timer of the BMC is set (if it is not reset even after the set time has elapsed, it is rebooted). If the system hangs up during POST, the system is rebooted and the system is restored after the lapse of the set time. Since the BMC holds the task number at the time of reboot, this is recorded as log data.
[0005]
[Patent Document 1]
Japanese Patent Application Laid-Open No. H10-14387 (FIG. 1)
[0006]
[Problems to be solved by the invention]
As described above, the conventional fault diagnosis method always requires a dedicated processor such as a BMC. Some systems do not have a BMC. In this case, it is possible to reboot when the system hangs up, but the functions are limited, such as not being able to record which task number hung.
[0007]
Accordingly, an object of the present invention is to provide a PC server that can monitor a POST startup failure without having a specific processor such as a BMC.
[0008]
[Means for Solving the Problems]
SUMMARY OF THE INVENTION In order to solve the above-described problems, the present invention provides, in an information processing apparatus having a plurality of processors, a first CPU that sequentially executes a plurality of failure diagnosis tasks, and a first CPU that executes the first tasks. A second CPU for monitoring the execution status of the task being executed, a content of the task to be sequentially executed, and a memory for loading monitoring data corresponding to the task, and the first CPU executes The monitoring data is updated for each task that is being performed, and the second CPU determines the presence or absence of a failure by referring to the monitoring data update status.
[0009]
According to the present invention, in order to solve the above-described problem, the monitoring data includes a task number for specifying the task processing content, a timeout value corresponding to the task number, and status data indicating a task execution status. And
[0010]
According to the present invention, in order to solve the above-described problem, the second CPU does not refer to the timeout value and update the status data within the time indicated by the timeout value by the first CPU. In this case, it is determined that a failure has occurred.
[0011]
According to another aspect of the present invention, there is provided a first CPU that sequentially executes a plurality of failure diagnosis tasks and a second CPU that monitors an execution status of the task being executed by the first CPU. In a failure diagnosis method for an information processing device having a CPU, a content of the task to be sequentially executed, and a memory for loading monitoring data corresponding to the task, the first CPU executes Every time, the monitoring data is updated, and the second CPU determines the presence or absence of a failure by referring to the update status of the monitoring data.
[0012]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a system configuration of a PC server according to an embodiment of the present invention.
The PC server according to the present embodiment has a plurality of processor groups 1. The processor group 1 includes M CPUs CPU0 to m, the CPU0 is a BSP (Boot StrapProcessor) that executes a system startup process, and the CPU1 to the CPUm are APs (APs) that execute application programs after the system is started. Application Processor).
[0013]
The processor group 1 is connected to a north bridge 3 via a host bus 2, and a memory 5 connected via a memory bus 4 and various I / O device groups connected via a PCI bus 5 Operation control 6 is executed. A task table for POST processing as shown in FIG. 2 is loaded in the memory 5, and one CPU of the processor group 1 sequentially executes each task while referring to the task table.
[0014]
Further, a command format table as shown in FIG. 3 is loaded into the memory 5, and the processor group 1 sequentially updates and refers to this table to determine whether or not the execution of each task of POST is completed.
[0015]
The memory 5 is specifically a DIMM (Dual Inline Memory Module). The devices of the device group 6 include a hard disk device storing an application program, a ROM storing a BIOS program, and the like.
[0016]
The south bridge 7 is a bridge circuit for the processor group 1 to control the operation of other I / O devices connected via an ISA bus (not shown) or the SM bus 8.
Next, the contents of the task table of FIG. 2 will be described.
The task table is a list in which the processing contents are described for each task. After starting the system, the task table is sequentially executed from task 1 to diagnose a system failure state. For example, the first task 1 is a process of initializing the chipset, and if the initialization is completed normally, the process proceeds to the next task process.
[0017]
Then, when each task is performed without any problem, the BIOS is finally set up (task N-1), and the OS starts (task N), and the POST operation is completed.
[0018]
In this embodiment, the CPU 0 sequentially executes each task written in the task table. Then, the CPU 1 monitors whether or not the execution of the task processing of the CPU 0 is successful, using the command format of FIG.
[0019]
The command format shown in FIG. 3 is used for executing and monitoring the task processing. The contents of this command format will be described.
The command format is a command table list in which offset numbers, command sizes, and data fields are associated.
The offset number is used as a list number. The size is the data size of the data to be written in the data field. The data field is data that is updated for each task, and the number of the task being executed is written in offset 1. A time-out value is written in the offset 2 and is time information for detecting that the task execution has not been completed within a predetermined time when the CPU 0 executes each task. The time-out value differs for each task, and a time-out value corresponding to each task is written in units of 100 ms.
[0020]
The offset 3 is reference data for detecting whether or not the execution of the task has been completed. When the CPU 0 normally completes the task being executed, the bit 1 (1b) is written. On the other hand, when the task is being executed or when the execution of the task cannot be completed normally, bit 0 (0b) has been written. By monitoring the offset 3 by the CPU 1, it is possible to determine whether the current task has been completed or is still being executed. Then, the CPU 1 refers to the timeout value in which the offset 2 is written. If the data of the offset 3 is not updated within the time of the timeout value, the CPU 1 determines that this task has not been completed normally. A time-out action is executed as shown at offset 4 in FIG. 4 to execute the failure recovery processing. Specifically, it is one of No Action (maintaining operation stop without performing recovery processing), Hard Reset (forced reset), Power Down (shutdown processing), and Power Cycle (restart processing). The data is written into the command format loaded in the memory 5.
[0021]
As described above, the command format shown in FIG. 3 is sequentially updated for each task according to the write instruction from the CPU 0. That is, the tasks are sequentially executed from the task 1 to the CPU 0 in the task table shown in FIG. 2, and when the execution of one task is completed, the data is changed to data on the next task.
[0022]
The operation of the present embodiment in such a configuration will be described.
First, the BIOS read from the device group 6 performs the following processing at the same timing as when the first task number 1 shown in FIG. 2 is transmitted from the CPU 0 to a specific port.
[0023]
(1) Wake the CPU 1 (issue a wake interrupt to the CPU 1 to start driving).
(2) Notifying the CPU 1 of the task number (3) Notifying the CPU 1 of the timeout time.
((2) to (3) are executed by the CPU 1 referring to the command format written in the memory 5. The CPU 0 writes the command format shown in FIG. , CPU 1 executes an operation according to the command format).
[0024]
(4) Instruct the CPU 1 to start a timer, and the CPU 1 monitors the time related to the completion status of the task execution of the CPU 0.
In the operation (4), when the CPU 1 does not receive the stop or start / restart command within the time written in the offset 1 of the command format, the CPU 1 executes the time action processing shown in FIG. Further, the task number at the time of timeout is recorded in a nonvolatile memory such as a flash memory so that a failure analysis can be performed later.
[0025]
Then, the tasks subsequent to the task number 2 are sequentially executed. In this case, only the task number is notified as follows.
(5) Notify the CPU of the task number.
(6) Notify the CPU 1 of the timeout time.
(7) Instruct the CPU 1 to start the timer.
(The operations (5) to (7) are executed by the same specific operations as (2) to (4))
Then, the following processing is executed immediately before OS startup, which is the last task shown in FIG.
(8) Instruct CPU 1 to stop timer measurement.
(9) CPU1 shifts to HLT State (operation stop state).
As described above, when the operation of (9) is completed, it is determined that there is no failure, the OS is started, and the operation of the application program of the PC server is started.
As described above, according to the present embodiment, it is possible to execute a fault diagnosis operation in a multiprocessor system without requiring special hardware for fault diagnosis.
[0026]
【The invention's effect】
According to the present invention, a fault diagnosis operation can be performed in a multiprocessor system without requiring special hardware for fault diagnosis.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a system configuration of a PC server according to an embodiment of the present invention.
FIG. 2 is a diagram showing contents of a task table in the embodiment.
FIG. 3 is a diagram illustrating a command format during execution of a task according to the embodiment;
FIG. 4 is a diagram illustrating a command format indicating a recovery operation when a failure occurs in the embodiment.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Processor group, 2 ... Host bus, 3 ... North bridge, 4 ... Memory bus, 5 ... Memory, 6 ... Device group, 7 ... South bridge.

Claims

In an information processing apparatus having a plurality of processors,
A first CPU for sequentially executing a plurality of tasks for fault diagnosis,
A second CPU for monitoring an execution status of a task executed by the first CPU;
The content of the task to be sequentially executed, and a memory for loading monitoring data corresponding to the task,
The first CPU updates monitoring data for each task being executed, and the second CPU determines the presence or absence of a failure by referring to the update status of the monitoring data. Information processing device.

2. The monitoring data according to claim 1, wherein the monitoring data includes a task number for specifying the task processing content, a timeout value according to the task number, and status data indicating a task execution status. Information processing device.

The second CPU refers to the timeout value, and determines that a failure has occurred if the status data has not been updated by the first CPU within the time indicated by the timeout value. 3. The information processing apparatus according to claim 2, wherein

A first CPU for sequentially executing a plurality of tasks for failure diagnosis, a second CPU for monitoring the execution status of the task being executed by the first CPU, and contents of the task to be sequentially executed And a failure diagnosis method in an information processing apparatus having a memory for loading monitoring data corresponding to the task,
The first CPU updates monitoring data for each task being executed, and the second CPU determines the presence or absence of a failure by referring to an update status of the monitoring data. Failure diagnosis method.

The fault according to claim 4, wherein the monitoring data includes a task number for specifying the task content, a timeout value corresponding to the task number, and status data indicating a task execution status. Diagnostic method.

The second CPU refers to the timeout value, and determines that a failure has occurred if the status data has not been updated by the first CPU within the time indicated by the timeout value. The fault diagnosis method according to claim 5, wherein