JP5293412B2

JP5293412B2 - Computer system and computer system failure processing method

Info

Publication number: JP5293412B2
Application number: JP2009132140A
Authority: JP
Inventors: 尚希安達
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-06-01
Filing date: 2009-06-01
Publication date: 2013-09-18
Anticipated expiration: 2029-06-01
Also published as: JP2010277514A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a fault recovery method that can shorten the time required to initialize expansion devices before starting an OS. <P>SOLUTION: A computer system includes a main apparatus, a plurality of expansion devices, and a service processor for controlling the operation of the main apparatus. When power is restored to the main apparatus after a fault, the service processor initializes only a first device group of devices out of the plurality of expansion devices which is necessary to boot an operating system, and after initializing the first device group, starts the operating system. After the operating system is started, a CPU executes fault handling on the operating system. After the fault handling is started by the CPU, the service processor initializes a second device group of devices out of the plurality of expansion devices, which is other than the first device group. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、拡張デバイスの実装されるコンピュータシステム、及びその障害処理方法に関する。 The present invention relates to a computer system on which an expansion device is mounted and a failure processing method thereof.

ＣＰＵ、ノースブリッジ、及びサウスブリッジなどを含む本体系装置に、機能を拡張するための拡張デバイス（例えば、ＰＣＩデバイス）を複数個実装したコンピュータシステムが知られている。このようなコンピュータシステムにおいて、障害が発生した場合、再起動（システムリブート）が行われる場合がある。 There is known a computer system in which a plurality of expansion devices (for example, PCI devices) for expanding functions are mounted on a main system apparatus including a CPU, a north bridge, a south bridge, and the like. In such a computer system, when a failure occurs, a restart (system reboot) may be performed.

図１は、障害発生による再起動時の動作を示すフローチャートである。図１に示されるように、コンピュータシステムに障害が発生すると（Ｓ１０１）、本体系装置において、ＤＣＯＦＦ命令が実行され、オペレーティングシステム（以下、ＯＳ）がたち下げられる（Ｓ１０２）。ＤＣＯＦＦ命令の実行が終了すると、電源が再投入される（ステップＳ１０３）。次に、ＨａｎｄＯＦＦが実行される（Ｓ１０４）。ＨａｎｄＯＦＦが終了すると、複数の拡張デバイス等のデバイスが初期化される（Ｓ１０５）。全ての拡張デバイスが初期化された後に、ＯＳの立ち上げが開始される。ＯＳの起動が完了し、ＯＳＲｅａｄｙ状態となった後に、障害復旧処理が開始される（Ｓ１０６）。 FIG. 1 is a flowchart showing the operation at the time of restart due to the occurrence of a failure. As shown in FIG. 1, when a failure occurs in the computer system (S101), a DC OFF command is executed in the main system apparatus, and the operating system (hereinafter referred to as OS) is lowered (S102). When the execution of the DC OFF command is completed, the power is turned on again (step S103). Next, Hand OFF is executed (S104). When the Hand OFF is completed, a plurality of devices such as expansion devices are initialized (S105). After all the expansion devices are initialized, OS startup is started. After the activation of the OS is completed and the OS Ready state is entered, a failure recovery process is started (S106).

障害が復旧するまで、コンピュータシステムを利用することはできない。従って、障害復旧に要する時間は、短い方が望ましい。 The computer system cannot be used until the failure is recovered. Therefore, it is desirable that the time required for failure recovery is shorter.

障害復旧時の動作を工夫した技術として、特許文献１に記載された情報処理装置が挙げられる。特許文献１の情報処理装置は、障害を検出する障害検出回路と、複数の障害のそれぞれをグループ分けして格納する障害テーブルと、障害検出回路にて検出された障害について障害情報テーブルに格納されるグループのいずれに属するかを判定し、判定したグループに関するハードウェアのみを初期化する制御装置と、を備える。これにより、必要最低限のハードウェアの初期化が可能となり、発生した障害とは無関係なハードウェアの初期化を行わなくてもよいため、再試行処理の性能が向上すると記載されている。 As a technique for devising the operation at the time of failure recovery, there is an information processing apparatus described in Patent Document 1. The information processing apparatus disclosed in Patent Document 1 stores a failure detection circuit that detects a failure, a failure table that stores a plurality of failures in groups, and a failure information table that stores failures detected by the failure detection circuit. And a control device that initializes only the hardware related to the determined group. As a result, it is described that the minimum necessary hardware initialization is possible, and it is not necessary to initialize the hardware unrelated to the failure that has occurred, so that the performance of the retry process is improved.

また、特許文献２には、コアＩ／Ｏカードを２重化することによって、障害などにより使用していた一のコアＩ／Ｏカードが切り離された場合でも、他のコアＩ／Ｏカードを利用してリブートできることが記載されている。 Further, in Patent Document 2, even if one core I / O card used due to a failure is disconnected by duplicating the core I / O card, another core I / O card is not connected. It is described that it can be used and rebooted.

また、特許文献３には、オペレーティングシステムの初期化プロセスに、起動対象となるシステムの稼動に必要なドライブを検索したことを判定してドライブの検索処理を終了する処理手段を具備したことを特徴とする情報処理装置が開示されている。この特許文献３によれば、起動時におけるドライブ検索数を必要最小限に抑える事ができ、システム起動にかかる処理時間を短縮できると記載されている。 Further, Patent Document 3 includes a processing means for determining that a drive required for operating the system to be booted is searched for in the initialization process of the operating system and ending the drive search process. An information processing apparatus is disclosed. According to Patent Document 3, it is described that the number of drive searches at startup can be minimized, and the processing time required for system startup can be shortened.

また、特許文献４には、オペレーティングシステムを格納した記憶装置を含む情報処理装置において、第１スイッチ及び第１スイッチを本体に設けることが記載されている。第１スイッチの操作により本体の電源がオンされたときには、その記憶装置を含む複数のデバイスを初期化する処理を含む第１の起動処理が実行された後に、オペレーティングシステムが起動される。第２スイッチの操作により本体の電源がオンされたときには、第１の起動処理の所定の一部の処理の実行が省略された第２の起動処理が実行された後に、オペレーティングシステムが起動される。第２スイッチの操作により本体の電源がオンされた時には、オペレーティングシステムの起動のために必要とされない各デバイスの初期化処理の実行がスキップされ、電源オンからオペレーティングシステムが起動されるまでの時間を大幅に短縮できると記載されている。 Patent Document 4 describes that in an information processing apparatus including a storage device storing an operating system, a first switch and a first switch are provided in a main body. When the power of the main body is turned on by operating the first switch, the operating system is started after the first starting process including the process of initializing a plurality of devices including the storage device is executed. When the power of the main body is turned on by operating the second switch, the operating system is started after the second starting process in which the execution of a predetermined part of the first starting process is omitted is executed. . When the main unit is turned on by the operation of the second switch, the initialization process of each device that is not required for starting the operating system is skipped, and the time until the operating system is started after the power is turned on is skipped. It is described that it can be greatly shortened.

特開２０００−２００１９９号公報JP 2000-200909 A 特開２００５−２６６９４８号公報JP 2005-266948 A 特開２００６−２３６０５８号公報Japanese Patent Laid-Open No. 2006-236058 特開２００６−２５２３２９号公報JP 2006-252329 A

再起動時における拡張デバイスの初期化は、拡張デバイスを使用可能にするために不可欠な動作であるが、多くの時間を要する。例えば、ＰＣＩスロットにＰＣＩデバイスが実装されている場合、初期化時には、このＰＣＩスロットにＯｐｔｉｏｎＲＯＭが存在するか否かが確認される。ＯｐｔｉｏｎＲＯＭが存在すれば、ＯｐｔｉｏｎＲＯＭのコードがメモリ上に展開される。コードの展開には、多くの時間を要する。 Initialization of the expansion device at the time of restart is an indispensable operation for enabling the expansion device, but takes a lot of time. For example, when a PCI device is mounted in a PCI slot, it is confirmed at initialization whether or not an Option ROM exists in the PCI slot. If the Option ROM exists, the code of the Option ROM is expanded on the memory. It takes a lot of time to deploy the code.

従って、本発明の目的は、ＯＳが起動する前に拡張デバイスの初期化に要する時間を、短縮することのできる、コンピュータシステム及びコンピュータシステムの障害復旧方法を提供することにある。 Accordingly, an object of the present invention is to provide a computer system and a computer system failure recovery method that can reduce the time required for initialization of an expansion device before the OS is started.

以下に、［発明を実施するための形態］で使用する括弧付き符号を用いて、課題を解決するための手段を説明する。これらの符号は、［特許請求の範囲］の記載と［発明を実施するための形態］の記載との対応関係を明らかにするために付加されたものであるが、［特許請求の範囲］に記載されている発明の技術的範囲の解釈に用いてはならない。 Hereinafter, means for solving the problem will be described using the reference numerals in parentheses used in [Mode for Carrying Out the Invention]. These symbols are added to clarify the correspondence between the description of [Claims] and the description of [Mode for Carrying Out the Invention]. It should not be used to interpret the technical scope of the described invention.

本発明のコンピュータシステムは、ＣＰＵ（１１）を備える本体系装置（１０）と、本体系装置（１０）の機能を拡張するために前記本体系装置に実装された、複数の拡張デバイス（１７〜１９）と、ＣＰＵ（１１）とは独立に、本体系装置（１０）の動作を制御するサービスプロセッサ（２０）と、を具備する。サービスプロセッサ（２０）は、障害発生により本体系装置（１０）の電源が再投入されたときに、複数の拡張デバイス（１７〜１９）のうちでオペレーティングシステムの立ち上げに必要なデバイスである第１デバイス群（１７）のみを初期化し、第１デバイス群（１７）を初期化した後にオペレーティングシステムを起動させる。ＣＰＵ（１１）は、オペレーティングシステムの起動後に、オペレーティングシステム上で障害処理を実行する。サービスプロセッサ（２０）は、ＣＰＵ（１１）による障害処理の実行が開始された後に、複数の拡張デバイス（１７〜１９）のうちの第１デバイス群（１７）以外のデバイスである第２デバイス群（１８、１９）を初期化する。 The computer system of the present invention includes a main body apparatus (10) having a CPU (11) and a plurality of expansion devices (17 to 17) mounted on the main body apparatus to expand the functions of the main body apparatus (10). 19) and a service processor (20) for controlling the operation of the main system (10) independently of the CPU (11). The service processor (20) is a device necessary for starting up the operating system among the plurality of expansion devices (17 to 19) when the power of the main unit (10) is turned on again due to the occurrence of a failure. Only the one device group (17) is initialized, and after the first device group (17) is initialized, the operating system is started. The CPU (11) executes failure processing on the operating system after the operating system is started. The service processor (20) is a second device group that is a device other than the first device group (17) among the plurality of expansion devices (17 to 19) after the execution of the failure process by the CPU (11) is started. (18, 19) is initialized.

本発明によれば、ＯＳが起動する前に拡張デバイスの初期化に要する時間を、短縮することのできる、コンピュータシステム及びコンピュータシステムの障害復旧方法が提供される。 According to the present invention, it is possible to provide a computer system and a computer system failure recovery method capable of reducing the time required for initializing an expansion device before the OS is started.

コンピュータシステムの障害処理方法を示すフローチャートである。It is a flowchart which shows the failure processing method of a computer system. 実施例のコンピュータシステムの構成を概略的に示すブロック図である。It is a block diagram which shows the structure of the computer system of an Example roughly. 実施例のコンピュータシステムの障害処理方法を示すフローチャートである。It is a flowchart which shows the failure processing method of the computer system of an Example.

図面を参照しつつ、本発明の実施例について説明する。 Embodiments of the present invention will be described with reference to the drawings.

図２は、本実施例のコンピュータシステムの構成を示す概略ブロック図である。本実施例では、コンピュータシステムとして、サーバを例に挙げて説明する。このコンピュータシステムは、本体系装置１０と、サービスプロセッサ２０と、ＳＮＭＰ（ＳｉｍｐｌｅＮｅｔｗｏｒｋＭａｎａｇｅｍｅｎｔＰｒｏｔｏｃｏｌ）マネージャ３０と、カード情報記憶部４０と、を備えている。 FIG. 2 is a schematic block diagram showing the configuration of the computer system of this embodiment. In this embodiment, a server will be described as an example of a computer system. The computer system includes a main system device 10, a service processor 20, an SNMP (Simple Network Management Protocol) manager 30, and a card information storage unit 40.

本体系装置１０は、複数（４つ）のＣＰＵ（１１−１〜１１−４）と、ノースブリッジ１２と、複数（２つ）のサウスブリッジ（１３−１、１３−２）と、主記憶装置１４と、プロセッサバス１５と、複数（２つ）のＰＣＩバス（１６−１、１６−２）と、複数のＰＣＩスロット１７と、を備えている。４つのＣＰＵ（１１−１〜１１−４）は、プロセッサバス１５を介してノースブリッジ１２に接続されている。２つのサウスブリッジ（１３−１、１３−２）は、ノースブリッジ１２に接続されている。サウスブリッジ（１３−１、１３−２）の配下には、それぞれ、ＰＣＩバス（１６−１、１６−２）が接続されている。ＰＣＩバス（１６−１、１６−２）の配下には、それぞれ、複数のＰＣＩスロットが接続されている。 The system 10 includes a plurality (four) of CPUs (11-1 to 11-4), a north bridge 12, a plurality (two) of south bridges (13-1, 13-2), and a main memory. The apparatus 14 includes a processor bus 15, a plurality (two) of PCI buses (16-1 and 16-2), and a plurality of PCI slots 17. The four CPUs (11-1 to 11-4) are connected to the north bridge 12 via the processor bus 15. The two south bridges (13-1 and 13-2) are connected to the north bridge 12. PCI buses (16-1, 16-2) are connected to the subordinates of the south bridges (13-1, 13-2), respectively. A plurality of PCI slots are connected under the PCI buses (16-1, 16-2).

ノースブリッジ１２は、ホスト−ＰＣＩブリッジを含むシステムコントローラである。ノースブリッジ１２には、ホスト−ＰＣＩブリッジのほかにも、主記憶装置１４のメモリコントローラなどが内蔵されている。 The north bridge 12 is a system controller including a host-PCI bridge. In addition to the host-PCI bridge, the north bridge 12 includes a memory controller of the main storage device 14 and the like.

サウスブリッジ１３は、配下に接続されているＰＣＩバス１６のインタフェース（ＰＣＩバスコントローラ）機能を有する。 The south bridge 13 has an interface (PCI bus controller) function of the PCI bus 16 connected thereto.

複数のＰＣＩスロット１７は、ＰＣＩボード（拡張デバイス）を実装するために設けられている。本実施例では、ＰＣＩバス１６−１、１６−２の配下に、それぞれ、８個のＰＣＩスロットが設けられている。ＰＣＩバス１６−１の配下のＰＣＩスロットには、ＰＣＩボードとして、ＳＣＳＩ（ＳｍａｌｌＣｏｍｐｕｔｅｒＳｙｓｔｅｍＩｎｔｅｒｆａｃｅ）１７−１、ＦＣ（ＦｉｂｅｒＣｈａｎｎｅｌ）１８−１、及びＮＩＣ（ＮｅｔｗｏｒｋＩｎｔｅｒｆａｃｅＣａｒｄ）１９−１が接続されている。また、ＰＣＩバス１６−２の配下のＰＣＩスロットにも、同様に、ＳＣＳＩ（１７−２）、ＦＣ（１８−２）、ＮＩＣ（１９−２）が接続されている。尚、本実施例では、ＰＣＩボードとして、ＮＩＣ、ＳＣＳＩ、ＦＣの３種のインタフェースを例に挙げたが、他のＰＣＩスロットにも各種ＰＣＩボードのインタフェースが実装されていてもよい。 The plurality of PCI slots 17 are provided for mounting a PCI board (expansion device). In the present embodiment, eight PCI slots are respectively provided under the PCI buses 16-1 and 16-2. The PCI slots under the PCI bus 16-1 are connected to a PCI (Small Computer System Interface) 17-1, FC (Fiber Channel) 18-1, and NIC (Network Interface Card) 19-1 as PCI boards. ing. Similarly, SCSI (17-2), FC (18-2), and NIC (19-2) are also connected to the PCI slots under the PCI bus 16-2. In the present embodiment, three types of interfaces of NIC, SCSI, and FC are given as examples of PCI boards, but various PCI board interfaces may be mounted in other PCI slots.

本実施例では、ＳＣＳＩ（１７−１、１７−２）の配下に、それぞれ、ＯＳを起動する際に用いるＢｏｏｔＤｉｓｋ（１−１、１−２）が接続されているものとする。また、ＦＣ（１８−１、１８−２）の配下に、それぞれ、ディスク（２−１、２−２）が接続されているものとする。 In the present embodiment, it is assumed that Boot Disks (1-1, 1-2) used when starting up the OS are connected under the SCSI (17-1, 17-2). In addition, it is assumed that the disks (2-1, 2-2) are connected under the FC (18-1, 18-2), respectively.

サービスプロセッサ２０は、本体系装置１０のＣＰＵ（１１−１〜１１−４）とは独立して、本体系装置１０の動作を制御するプロセッサである。詳細は後述するが、サービスプロセッサ２０は、障害発生によりコンピュータシステムが再起動されるときに、本体系装置１０の動作を制御する。 The service processor 20 is a processor that controls the operation of the main body apparatus 10 independently of the CPUs (11-1 to 11-4) of the main body apparatus 10. As will be described in detail later, the service processor 20 controls the operation of the main body apparatus 10 when the computer system is restarted due to the occurrence of a failure.

ＳＮＭＰマネージャは、サービスプロセッサ２０に接続されている。障害発生による再起動時には、ＳＮＭＰマネージャからの指示により、サービスプロセッサ２０が本体系装置１０の動作を制御する。ＳＮＭＰマネージャは、ネットワークを介してサービスプロセッサ２０に接続されていてもよい。 The SNMP manager is connected to the service processor 20. At the time of restart due to the occurrence of a failure, the service processor 20 controls the operation of the main body apparatus 10 according to an instruction from the SNMP manager. The SNMP manager may be connected to the service processor 20 via a network.

カード情報記憶部４０には、予め、複数のＰＣＩデバイスのうちでオペレーティングシステムの立ち上げに必要なデバイス（第１デバイス群）を特定するための情報が記憶されている。本実施例では、オペレーティングシステムの立ち上げに必要なデバイスは、ＳＣＳＩ（１７−１、１７−２）に接続されたＢｏｏｔＤｉｓｋ（１−１、１−２）である。したがって、カード情報記憶部４０には、ＳＣＳＩ（１７−１、１７−２）の接続されたＰＣＩスロットを特定するための情報が記憶されている。カード情報記憶部４０は、サービスプロセッサ２０に接続されている。サービスプロセッサ２０は、障害発生による再起動時に、カード情報記憶部４０にアクセスし、カード情報記憶部４０に記憶された情報に基づいて、本体系装置１０の動作を制御する。カード情報記憶部４０は、たとえば、ハードディスクなどで構成することができる。 The card information storage unit 40 stores in advance information for specifying a device (first device group) necessary for starting up the operating system among a plurality of PCI devices. In the present embodiment, devices necessary for starting up the operating system are Boot Disks (1-1, 1-2) connected to SCSI (17-1, 17-2). Therefore, the card information storage unit 40 stores information for specifying the PCI slot to which the SCSI (17-1, 17-2) is connected. The card information storage unit 40 is connected to the service processor 20. The service processor 20 accesses the card information storage unit 40 when restarting due to the occurrence of a failure, and controls the operation of the main body apparatus 10 based on the information stored in the card information storage unit 40. The card information storage unit 40 can be configured with, for example, a hard disk.

続いて、本実施例にかかるコンピュータシステムの障害処理方法について説明する。図３は、そのコンピュータシステムの障害処理方法の動作を示すフローチャートである。 Next, a failure processing method for the computer system according to the present embodiment will be described. FIG. 3 is a flowchart showing the operation of the failure processing method of the computer system.

コンピュータシステムの運用中に、コンピュータシステムを継続して運用できないような障害が発生するとする（ステップＳ１０）。この場合、障害発生部位から、サービスプロセッサ２０に、障害の発生が通知される。 It is assumed that a failure occurs that prevents the computer system from being operated continuously during operation of the computer system (step S10). In this case, the occurrence of the failure is notified to the service processor 20 from the failure occurrence site.

サービスプロセッサ２０は、ＤＣＯＦＦ命令を図示しない電源供給回路に発行し、本体系装置１０に対する電源の供給を遮断する（ステップＳ２０）。また、サービスプロセッサ２０は、ＳＮＭＰマネージャ３０に対して、Ｒｅｓｅｔｐｅｎｄｉｎｇのｔｒａｐを送信する。また、どの部位に障害が発生したかを示す情報も、ＳＮＭＰマネージャ３０に送信される。ＳＮＭＰマネージャ３０は、取得した情報に基づいて、本体形装置１０の一部を論理的に切り離して再立ち上げをすることで継続運用が可能であるかどうかを判断する。再立ち上げが可能である場合、ＳＮＭＰマネージャは、サービスプロセッサ２０に障害部位の切り離しを命令する。この場合、サービスプロセッサ２０は、障害部位を、論理的に切り離す。例えば、障害部位が、サウスブリッジ８の配下のＰＣＩバス１６−１であった場合には、サウスブリッジ１３−１を論理的に本体系装置１０から切り離す。 The service processor 20 issues a DC OFF command to a power supply circuit (not shown) and cuts off the power supply to the main body apparatus 10 (step S20). In addition, the service processor 20 transmits a Reset-pending trap to the SNMP manager 30. In addition, information indicating in which part the failure has occurred is also transmitted to the SNMP manager 30. Based on the acquired information, the SNMP manager 30 determines whether or not the continuous operation is possible by logically separating a part of the main body apparatus 10 and restarting. If restarting is possible, the SNMP manager instructs the service processor 20 to isolate the faulty part. In this case, the service processor 20 logically separates the fault site. For example, when the failure site is the PCI bus 16-1 under the south bridge 8, the south bridge 13-1 is logically disconnected from the main body device 10.

本体系装置１０に対する電源の供給停止と、障害部位の切り離しが終了した後に、ＳＮＭＰマネージャ３０は、サービスプロセッサ２０に、コンピュータシステムの再立ち上げを命令する（ステップＳ３０）。サービスプロセッサ２０は、再立ち上げの命令を受けると、本体系装置１０に電源供給を行うように、電源供給回路の動作を制御する。 After stopping the supply of power to the main system device 10 and disconnecting the faulty part, the SNMP manager 30 commands the service processor 20 to restart the computer system (step S30). When receiving the restart command, the service processor 20 controls the operation of the power supply circuit so as to supply power to the main body apparatus 10.

続いて、各装置のＳＤテスト、ＭＥＭ１４のテストなどが行われ、さらに、Ｈａｎｄｏｆｆが行われる（ステップＳ４０）。 Subsequently, an SD test of each device, a test of the MEM 14 and the like are performed, and further, Hand off is performed (step S40).

続いて、ＰＣＩデバイスの初期化が行われる。この際、サービスプロセッサ２０は、カード情報記憶部４０を参照して、ＯＳの起動に必要なＰＣＩデバイスを特定する。そして、まず、ＯＳの起動に必要なＰＣＩデバイスのみを初期化する（ステップＳ５０）。本実施例では、ＢｏｏｔＤｉｓｋの接続されたＳＣＳＩ（１７−１）のみを初期化する。サウスブリッジ１３−１が論理的に切り離されている場合には、サウスブリッジ１３−２の配下のＳＣＳＩ（１７−２）が初期化される。 Subsequently, the PCI device is initialized. At this time, the service processor 20 refers to the card information storage unit 40 and identifies a PCI device necessary for starting up the OS. First, only the PCI device necessary for starting the OS is initialized (step S50). In this embodiment, only the SCSI (17-1) connected to the boot disk is initialized. When the south bridge 13-1 is logically disconnected, the SCSI (17-2) under the south bridge 13-2 is initialized.

ＯＳ立ち上げに必要なＰＣＩデバイス（ＳＣＳＩ１７−１）の初期化が終了すると、サービスプロセッサ２０は、ＢｏｏｔＤｉｓｋからＯＳのブートローダをＭＥＭ１４に読み込み、ＯＳの起動を開始する（ステップＳ６０）。 When the initialization of the PCI device (SCSI 17-1) necessary for OS startup is completed, the service processor 20 reads the OS boot loader from the boot disk into the MEM 14, and starts the OS startup (step S60).

ＯＳの起動が完了し、ＯＳＲｅａｄｙ状態となると（ステップＳ７０）、ＣＰＵ（１１−１〜１１−４）が、ＯＳ上において障害処理を開始する（ステップＳ８０）。 When the activation of the OS is completed and the OS Ready state is set (step S70), the CPU (11-1 to 11-4) starts failure processing on the OS (step S80).

サービスプロセッサ２０は、ステップＳ８０で障害処理が開始された後に、ステップＳ５０で初期化を行わなかった他のＰＣＩデバイスを初期化する（ステップＳ９０）。 The service processor 20 initializes another PCI device that has not been initialized in step S50 after the failure processing is started in step S80 (step S90).

全てのＰＣＩデバイスの初期化が終了すると、本実施例における一連の動作が終了する（ステップＳ１００）。 When the initialization of all the PCI devices is completed, a series of operations in this embodiment is completed (step S100).

ＰＣＩデバイスの初期化は、ＰＣＩデバイスを有効な状態とするのにあたり、不可欠な作業である。ただし、この過程で、例えば、ＯｐｔｉｏｎＲＯＭが存在するか否かを確認し、ＯｐｔｉｏｎＲＯＭが存在すればＯｐｔｉｏｎＲＯＭのコードをメモリ上に展開する、といった動作を行うため、多くの時間を要する。 The initialization of the PCI device is an indispensable work for bringing the PCI device into an effective state. However, in this process, for example, it is necessary to check whether or not the Option ROM exists, and if the Option ROM exists, the operation of expanding the code of the Option ROM on the memory takes a lot of time.

そのため、全てのＰＣＩデバイスを初期化した後に、ＯＳの立ち上げを行う場合、コンピュータシステムの障害処理が開始される時間は遅くなってしまう。 Therefore, when the OS is started up after all the PCI devices are initialized, the time for starting the failure processing of the computer system is delayed.

これに対して、本実施例では、ＯＳ立ち上げに必要なＰＣＩデバイスのみをまず初期化し、ＯＳを立ち上げて障害処理が開始された後に他のＰＣＩデバイスの初期化が行われるので、障害処理着手前にＯＳ立ち上げに必要のないデバイスの初期化に要していた時間を省略することができる。ＰＣＩデバイスが多ければ多いほど、大幅に障害処理開始時間を早めることが可能である。 On the other hand, in this embodiment, only the PCI device necessary for starting up the OS is initialized first, and after the OS is started up and failure processing is started, other PCI devices are initialized. It is possible to omit the time required for device initialization that is not required for OS startup before the start. As the number of PCI devices increases, the failure processing start time can be greatly advanced.

すなわち、本実施例によれば、システムリブートを伴う障害が発生した場合、ＯＳ立ち上げに必要な最小限なＰＣＩデバイスのみが初期化されるので、ＯＳ立ち上げに要する時間が短縮化され、素早く障害処理に着手することが可能となる。 That is, according to the present embodiment, when a failure accompanied by a system reboot occurs, only the minimum PCI device necessary for OS startup is initialized. It becomes possible to start failure processing.

１ブートディスク
２ディスク
１１ＣＰＵ
１２ノースブリッジ
１３サウスブリッジ
１４主記憶メモリ
１５プロセッサバス
１６ＰＣＩバス
１７ＳＣＳＩ
１８ＦＣ
１９ＮＩＣ
１０本体系装置
２０サービスプロセッサ
３０ＳＮＭＰマネージャ
４０カード情報記憶部 1 Boot disk 2 Disk 11 CPU
12 North Bridge 13 South Bridge 14 Main Memory 15 Processor Bus 16 PCI Bus 17 SCSI
18 FC
19 NIC
10 Main System Device 20 Service Processor 30 SNMP Manager 40 Card Information Storage Unit

Claims

A main system device including a CPU;
A plurality of expansion devices mounted on the main body device in order to extend the function of the main body device;
Independent of the CPU, a service processor that controls the operation of the main body device;
Comprising
The service processor initializes only the first device group which is a device necessary for starting up an operating system among the plurality of expansion devices when the power of the main body apparatus is turned on again due to the occurrence of a failure, After initializing the first device group, start an operating system,
The CPU executes failure processing on the operating system after the operating system is started,
The service processor is a computer system that initializes a second device group that is a device other than the first device group among the plurality of expansion devices after execution of failure processing by the CPU is started.

A computer system according to claim 1, comprising:
Furthermore,
A device information storage unit in which information for specifying the first device group and the second device group is stored in advance;
Comprising
The service processor refers to the expansion device storage unit, specifies the first device group, and initializes the first device group.

A computer system according to claim 1 or 2,
The expansion device is a computer system that is a PCI (Peripheral Component Interconnect) card.

In the event of a failure, a power-on step for restarting the main unit equipped with the CPU,
A first device group initialization step for initializing the first device group after the power-on step;
An OS startup step of starting an operating system after the first device group initialization step;
Executing a failure process on the operating system by the CPU after starting the operating system;
A second device group initialization step for initializing the second device group after the execution of the fault processing is started;
Comprising
The first device group is a device group necessary for starting up an operating system among a plurality of expansion devices mounted on the main body apparatus in order to expand the function of the main body apparatus.
The second device group, among the plurality of expansion devices, Ri device group der other than the first device group,
The first device group initialization step, the OS startup step, and the second device group initialization step are performed by a service processor that controls the operation of the main body apparatus independently of the CPU. System failure handling method.

A failure handling method for a computer system according to claim 4 ,
Furthermore,
A storage step of storing information for specifying the first device group and the second device group in advance;
Comprising
A failure processing method for a computer system, wherein in the first device group initialization step, the first device group is specified based on the information stored in the storage step, and the first device group is initialized.

A fault handling method for a computer system according to claim 4 or 5 ,
The expansion device is a failure processing method for a computer system, which is a PCI (Peripheral Component Interconnect) card.